20th November, 2025
Below are optimised implementations of 32-bit integer multiplication for Tenstorrent’s Blackhole and Wormhole AI accelerators, achieving throughputs of:
Unlike its Wormhole predecessor, Blackhole has a dedicated
instruction for 23-bit integer multiplication: SFPMUL24,
returning the low or high 23 bits of the product of two 23-bit
integers.
Given two 32-bit integer values a and b, we
split each value into two 23-bit chunks, such that
a = (a1 << 23) | a0 and
b = (b1 << 23) | b0:
a0 = a & 0x7fffff
b0 = b & 0x7fffff
a1 = a >> 23
b1 = b >> 23
result =
((a1 * b1) << 46) # discard as low 32 bits are zero
+ ((a1 * b0) << 23) # low 23 bits are zero; only need low 9 bits of product
+ ((a0 * b1) << 23) # low 23 bits are zero; only need low 9 bits of product
+ (a0 * b0) # need all 32 bits of productRewrite to use mul24_lo and mul24_hi; note
that we no longer need a0 and b0, since
mul24_lo and mul24_hi implicitly mask inputs
to 23 bits:
result =
(mul24_lo(a1, b) << 23)
+ (mul24_lo( a, b1) << 23)
+ (mul24_hi( a, b) << 23) + mul24_lo(a, b)Collect terms and do a single shift:
result =
(mul24_lo(a1, b) + mul24_lo(a, b1) + mul24_hi(a, b)) << 23
+ mul24_lo(a, b)This is equivalent to the following sequence of 13 instructions:
sfpload L0, INT32, ADDR_MOD_7, offset0 ; a = load(offset0)
sfpshft -23, L0, L2, 1|4 ; a1 = a >> 23
sfpload L1, INT32, ADDR_MOD_7, offset1 ; b = load(offset1)
sfpshft -23, L1, L3, 1|4 ; b1 = b >> 23
sfpmul24 L1, L2, L9, L2, 0 ; cross0 = mul24_lo(b, a1)
sfpmul24 L0, L3, L9, L3, 0 ; cross1 = mul24_lo(a, b1)
sfpmul24 L0, L1, L9, L4, 1 ; hi = mul24_hi(a, b)
sfpmul24 L0, L1, L9, L0, 0 ; lo = mul24_lo(a, b)
sfpiadd 0, L2, L4, 4 ; hi += cross0
sfpiadd 0, L3, L4, 4 ; hi += cross1
sfpshft 23, L4, L4, 1 ; hi = hi << 23
sfpiadd 0, L4, L0, 4 ; lo += hi
sfpstore L0, INT32, ADDR_MOD_6, offset2 ; store(lo, offset2) with autoincrement13 cycles is pretty good, but can we go faster? As always, SFPLOADMACRO
is our friend. The following uses four unique macros to schedule four
reusable instruction templates, achieving a throughput of 8 cycles per
row.
SFPLOADMACRO restricts us to a maximum of four unique
macros, all sharing up to four instruction templates (one per sub-unit).
Important: Each macro can override either
VB or VC for its use of a particular template,
and set it VD from that macro’s particular instantiation.
All other fields are fixed (except for VD
in the case of SFPSTORE, which can be overridden with
VD=16 or VD=0).
We rewrite our code to maximise reuse of instruction templates:
# Variable-to-register mappings:
#
# a,b = 0
# a1 = 1
# b1 = 2
# b2 = 3
# c = 4
#
# Instruction templates ([x] is replaced by SFPLOADMACRO's VD):
#
# 0: [x] = [x] >> 23 via SFPSHFT2(Imm12=-23, MOD1=SFPSHFT2_MOD1_SHFT_IMM)
# 1: [x] = mul24_lo(a or b, [x]) via SFPMUL24(VA=0, VC=9)
# 2: [x] = b1 << 23 via SFPSHFT(Imm12=23, VC=2)
# 3: [x] = [x] + c via SFPIADD(VC=3)
b = load(offset1) # L0 = b
a1 = loadmacro(offset0) # L1 = a
b1 = loadmacro(offset1) # L2 = b
a1 = a1 >> 23 # template 0; L1 = L1 >> 23
b1 = b1 >> 23 # template 0; L2 = L2 >> 23
a1 = mul24_lo(b, a1) # template 1; L1 = mul24_lo(L0, L1)
a = load(offset0) # L0 = a
b2 = loadmacro(offset1) # L3 = b
b1 = mul24_lo(a, b1) # template 1; L2 = mul24_lo(L0, L2)
c = mul24_hi(a, b2) # L4 = mul24_hi(L0, L3)
b2 = mul24_lo(a, b2) # template 1; L3 = mul24_lo(L0, L3)
b1 = b1 + a1 # L2 = L2 + L1
b1 = b1 + c # template 3; L2 = L2 + L4
c = loadmacro(offset2) # L4 = ?
c = b1 << 23 # template 2; L4 = L2 << 23
L16 = b2 + c # template 3; L16 = L3 + L4
store(L16, offset2) # store L16 to offset2In the scheduling diagram below, colours denote variables, and
instruction slots are coloured by their destination variable, except in
the case of the special destination register L16, where we
colour the instruction by its original SFPLOADMACRO
destination variable.
Note that Blackhole has automatic stalling for SFPMUL24
when issued normally (not via SFPLOADMACRO). Our
mul24_hi and the following instruction are issued as
regular instructions, which means care needs to be taken to avoid
triggering the automatic
stalling logic.
Since Wormhole doesn’t have a dedicated instruction for multiplying
integers, we have to SFPCAST
to FP32 and use SFPMAD
instead. FP32 can represent 24-bit integers exactly, so we have to split
the input values into multiple chunks such that their products can be
represented exactly in FP32.
One option is to split each 32-bit value into four 8-bit chunks. This would require a total of 10 multiplications, 9 additions, and 3 shifts (not including the operations required to extract the initial 8-bit chunks):
# Given:
# a = (a3 << 24) | (a2 << 16) | (a1 << 8) | a0
# b = (b3 << 24) | (b2 << 16) | (b1 << 8) | b0
a*b =
(a3*b3 << 48)
+ (a3*b2 + a2*b3) << 40
+ (a3*b1 + a2*b2 + a1*b3) << 32
+ (a3*b0 + a2*b1 + a1*b2 + a0*b3) << 24
+ (a2*b0 + a1*b1 + a0*b2) << 16
+ (a1*b0 + a0*b1) << 8
+ (a0*b0)
# We only need the products whose low 32 bits are non-zero:
a*b & 0xffffffff =
(a3*b0 + a2*b1 + a1*b2 + a0*b3) << 24
+ (a2*b0 + a1*b1 + a0*b2) << 16
+ (a1*b0 + a0*b1) << 8
+ (a0*b0)An alternative is to use three 11-bit chunks per
32-bit value. This has the attractive property of allowing SFPMAD
to be used to sum three 22-bit products without loss of precision, as
this can be represented with 24 bits. Now, only 6 multiplications are
needed, 2 additions (since we get 3 additions via SFPMAD
for free), and 2 shifts:
# Given:
# a = (a2 << 22) | (a1 << 11) | a0
# b = (b2 << 22) | (b1 << 11) | b0
a * b & 0xffffffff =
(a0*b2 + a1*b1 + a2*b0) << 22 # top = max 24 bits
+ (a0*b1 + a1*b0) << 11 # mid = max 23 bits
+ (a0*b0) # low = max 22 bitsThe catch is that there is no dedicated instruction for converting FP32 values larger than 16 bits to an integer, so multiple instructions are required:
sfpexexp 0, L5, L0, 2|8 ; extract unbiased exponent, disable lanes where Exp<0
sfpexman 0, L5, L5, 0 ; extract mantissa with implicit bit set (even for zero!)
sfpiadd 22-23, L0, L0, 1|4 ; calculate shift amount (shift right 23, then left 22)
sfpshft2 L5, L0, L5, 5 ; logical shift of mantissa by amount
sfpencc 0, 0, 0, 0 ; re-enable lanesCompared to a single SFPSTOCHRND
(if it supported values greater than 16 bits), 5 cycles seems expensive.
However, note that for top and mid, we need to
shift the values anyway, so the overhead of this approach is only 3
cycles for those variables, and 4 cycles for low.
The full code is 40 cycles:
; assumes constant registers L9 = 0; L12 = 0x7ff; L13 = -11
sfpload L0, INT32, ADDR_MOD_3, offset0 ; a0 = load(offset0)
sfpshft2 L0, L13, L2, 5 ; a1 = a0 >> 11
sfpshft2 L2, L13, L4, 5 ; a2 = a1 >> 11
sfpand 0, L12, L2, 0 ; a1 &= mask
sfpcast L2, L2, 0 ; a1 = int_to_fp32(a1)
sfpcast L4, L4, 0 ; a2 = int_to_fp32(a2)
sfpand 0, L12, L0, 0 ; a0 &= mask
sfpcast L0, L0, 0 ; a0 = int_to_fp32(a0)
sfpload L1, INT32, ADDR_MOD_3, offset1 ; b0 = load(offset1)
sfpshft2 L1, L13, L3, 5 ; b1 = b0 >> 11
sfpshft2 L3, L13, L5, 5 ; b2 = b1 >> 11
sfpcast L5, L5, 0 ; b2 = int_to_fp32(b2)
sfpmad L0, L5, L9, L5, 0 ; top = a0*b2
sfpand 0, L12, L3, 0 ; b1 &= mask
sfpcast L3, L3, 0 ; b1 = int_to_fp32(b1)
sfpmad L2, L3, L5, L5, 0 ; top += a1*b1
sfpand 0, L12, L1, 0 ; b0 &= mask
sfpcast L1, L1, 0 ; b0 = int_to_fp32(b0)
sfpmad L4, L1, L5, L5, 0 ; top += a2*b0
sfpmad L0, L3, L9, L6, 0 ; mid = a0*b1
sfpmad L0, L1, L9, L7, 0 ; low = a0*b0
sfpmad L2, L1, L6, L6, 0 ; mid += a1*b0
; low = fp32_to_int(low)
sfpexexp 0, L7, L0, 2|8 ; extract unbiased exponent, disable lanes where Exp<0
sfpexman 0, L7, L7, 0 ; extract mantissa with implicit bit set (even for zero!)
sfpiadd -23, L0, L0, 1|4 ; calculate shift amount (shift right 23)
sfpshft2 L7, L0, L7, 5 ; logical shift of mantissa by amount
sfpencc 0, 0, 0, 0 ; re-enable lanes
; mid = fp32_to_int(mid) << 11
sfpexexp 0, L6, L0, 2|8 ; extract unbiased exponent, disable lanes where Exp<0
sfpexman 0, L6, L6, 0 ; extract mantissa with implicit bit set (even for zero!)
sfpiadd 11-23, L0, L0, 1|4 ; calculate shift amount (shift right 23; left 11)
sfpshft2 L6, L0, L6, 5 ; logical shift of mantissa by amount
sfpencc 0, 0, 0, 0 ; re-enable lanes
; top = fp32_to_int(top) << 22
sfpexexp 0, L5, L0, 2|8 ; extract unbiased exponent, disable lanes where Exp<0
sfpexman 0, L5, L5, 0 ; extract mantissa with implicit bit set (even for zero!)
sfpiadd 22-23, L0, L0, 1|4 ; calculate shift amount (shift right 23; left 22)
sfpshft2 L5, L0, L5, 5 ; logical shift of mantissa by amount
sfpencc 0, 0, 0, 0 ; re-enable lanes
sfpiadd 0, L6, L7, 4 ; low += mid
sfpiadd 0, L5, L7, 4 ; low += top
sfpstore L7, INT32, ADDR_MOD_2, offset2 ; store(low, offset2) with autoincrementLeveraging SFPLOADMACRO,
we achieve 8/32c throughput for 32-bit integer multiplication on
Blackhole. On Wormhole, using SFPMAD
with 11-bit chunks achieves 40/32c throughput.
Thanks to Tenstorrent for sponsoring this work.