22nd January, 2026
Below we implement 16-bit integer multiplication
multiply(a: u16, b: u16) → u16 for Tenstorrent’s Blackhole and Wormhole
AI accelerators, using SFPLOADMACRO to achieve throughputs
of:
See also: 32-bit Integer Multiplication on Tenstorrent.
The Blackhole
implementation is relatively trivial, since SFPMUL24
can be used to multiply two 23-bit integers and extract the lower 23
bits of the result. When storing the result to Dst,
SFPSTORE
with mode=UINT16 will truncate the value to 16 bits.
Assembly syntax: the destination register is the last register
specified, e.g. sfpmad A,B,C,D means
D=A*B+C.
Throughput for this sequential code is 5 cycles per input row of 32 values:
sfpload L0, UINT16, ADDR_MOD_7, offset0 ; a = load(offset0)
sfpload L1, UINT16, ADDR_MOD_7, offset1 ; b = load(offset1)
sfpmul24 L0, L1, LCONST_0, L1, 0 ; b = mul24_lo(a, b)
; automatic stall
sfpstore L1, UINT16, ADDR_MOD_6, offset2 ; store(b, offset2) with autoincrementJust like we did with where, we can
optimise for the common case where the output overwrites one of the
inputs in-place, achieving a throughput of 2 cycles per
input row of 32 values:
Alternatively, for the out-of-place case, we can achieve 3 cycles per input row:
Wormhole does not have a dedicated integer multiplication instruction, so we are forced to use fp32 multiplication.
The most straightforward (and efficient) method is to split each 16-bit value into two 8-bit chunks and losslessly cast to fp32:
a0 = int_to_fp32(a & 0xff)
a1 = int_to_fp32(a >> 8)
b0 = int_to_fp32(b & 0xff)
b1 = int_to_fp32(b >> 8)
lo = a0 * b0
hi0 = a0 * b1
hi1 = a1 * b0
hi = hi0 + hi1
result = lo + (hi << 8)Recall our trick from 32-bit integer
multiplication for implementing fp32_to_int23 with a
single instruction. We simply need to add 2^{23} in fp32, and this gives us the integer
in the raw mantissa bits.
Each individual product fits in 16 bits; the cross-term sum
hi = hi0 + hi1 fits in 17 bits, which is less than the
maximum 23 bits allowed by the 2^{23}
mantissa trick.
This allows us to save two instructions, since we can do the addition
hi = hi0 + hi1 for free in fp32 via FMA, and now we only
need one instruction to convert to an integer instead of two:
lo = a0 * b0 + 2.0**23
hi = a0 * b1 + 2.0**23
hi += a1 * b0
# convert to integers
lo = mantissa_bits(lo)
hi = mantissa_bits(hi)
result = lo + (hi << 8)Some care is required to minimise the number of instructions.
We use SFPSHFT2
instead of SFPSHFT,
because on Wormhole, SFPSHFT can only be applied in-place
and overwrites its input. This shortcoming is fixed in Blackhole with a
new SFPSHFT_MOD1_ARG_IMM_USE_VC modifier. On the other
hand, SFPSHFT2 with
mod1=SFPSHFT2_MOD1_SHFT_LREG performs a bitwise shift on
L[vb] of amount L[vc]. This saves having to
use another instruction to make a copy of the original value, at the
expense of storing the shift amount in a register.
Similarly, SFPAND on Wormhole can only be applied
in-place and overwrites its input. As long as we perform the
SFPSHFT2 before SFPAND, this is fine.
The three 2-cycle SFPMAD
instructions and subsequent SFPEXMAN
instruction are sequenced such that no stalls occur.
Our optimised assembly code takes 18 cycles:
; assumes constant registers L12 = -8, L13 = 0xff, L14 = 0x4b000000 (2.0f**23)
sfpload L0, UINT16, ADDR_MOD_3, offset0 ; a0
sfpshft2 L0, L12, L2, 5 ; a1 = a0 >> 8
sfpcast L2, L2, 0 ; a1 = int_to_fp32(a1)
sfpload L1, UINT16, ADDR_MOD_3, offset1 ; b0
sfpshft2 L1, L12, L3, 5 ; b1 = b0 >> 8
sfpcast L3, L3, 0 ; b1 = int_to_fp32(b1)
sfpand 0, L13, L0, 0 ; a0 = a0 & 0xff
sfpcast L0, L0, 0 ; a0 = int_to_fp32(a0)
sfpand 0, L13, L1, 0 ; b0 = b0 & 0xff
sfpcast L1, L1, 0 ; b0 = int_to_fp32(b0)
sfpmad L0, L3, L14, L3, 0 ; hi = a0 * b1 + 2**23 (overwriting b1)
sfpmad L0, L1, L14, L0, 0 ; lo = a0 * b0 + 2**23 (overwriting a0)
sfpmad L2, L1, L3, L3, 0 ; hi += a1 * b0
sfpexman 0, L0, L0, 1 ; lo = mantissa_bits(lo)
sfpexman 0, L3, L3, 1 ; hi = mantissa_bits(hi)
sfpshft 8, 0, L3, 1 ; hi = hi << 8
sfpiadd 0, L3, L0, 4 ; lo = lo + hi
sfpstore L0, UINT16, ADDR_MOD_2, offset2 ; store lo, with autoincrementHow fast can we go using SFPLOADMACRO?
Looking at our instructions above, we can split them into their
requisite sub-units:
The SFPSHFT could potentially be replaced by a
SFPSHFT2, moving it to the round sub-unit.
At first glance, our best case is 10+2 = 12 cycles,
combining the number of simple and round instructions.
They are combined due to the fact that they are difficult to use
simultaneously; one must have VD=16 and the other
VD!=16 if active at the same time.
The special L16 register can only be read
during a macro-scheduled SFPSTORE, making it quite limited
and only useful prior to SFPSTORE.
Although we can overlap the final SFPIADD
(VD=16) with a SFPSHFT2 (VD!=16),
the dependency chain forces the effective throughput to remain at
12 cycles.
Thanks to Tenstorrent for sponsoring this work.