16-bit Integer Multiplication on Tenstorrent

22nd January, 2026

Introduction

Below we implement 16-bit integer multiplication multiply(a: u16, b: u16) → u16 for Tenstorrent’s Blackhole and Wormhole AI accelerators, using SFPLOADMACRO to achieve throughputs of:

See also: 32-bit Integer Multiplication on Tenstorrent.

Blackhole

The Blackhole implementation is relatively trivial, since SFPMUL24 can be used to multiply two 23-bit integers and extract the lower 23 bits of the result. When storing the result to Dst, SFPSTORE with mode=UINT16 will truncate the value to 16 bits.

Sequential Code

Assembly syntax: the destination register is the last register specified, e.g. sfpmad A,B,C,D means D=A*B+C.

Throughput for this sequential code is 5 cycles per input row of 32 values:

sfpload  L0, UINT16, ADDR_MOD_7, offset0 ; a = load(offset0)
sfpload  L1, UINT16, ADDR_MOD_7, offset1 ; b = load(offset1)
sfpmul24 L0, L1, LCONST_0, L1, 0         ; b = mul24_lo(a, b)
                                         ; automatic stall
sfpstore L1, UINT16, ADDR_MOD_6, offset2 ; store(b, offset2) with autoincrement

Parallel Execution via SFPLOADMACRO

Just like we did with where, we can optimise for the common case where the output overwrites one of the inputs in-place, achieving a throughput of 2 cycles per input row of 32 values:

Alternatively, for the out-of-place case, we can achieve 3 cycles per input row:

Wormhole

Wormhole does not have a dedicated integer multiplication instruction, so we are forced to use fp32 multiplication.

The most straightforward (and efficient) method is to split each 16-bit value into two 8-bit chunks and losslessly cast to fp32:

a0 = int_to_fp32(a & 0xff)
a1 = int_to_fp32(a >> 8)
b0 = int_to_fp32(b & 0xff)
b1 = int_to_fp32(b >> 8)

lo  = a0 * b0
hi0 = a0 * b1
hi1 = a1 * b0

hi = hi0 + hi1
result = lo + (hi << 8)

Recall our trick from 32-bit integer multiplication for implementing fp32_to_int23 with a single instruction. We simply need to add 2^{23} in fp32, and this gives us the integer in the raw mantissa bits.

Each individual product fits in 16 bits; the cross-term sum hi = hi0 + hi1 fits in 17 bits, which is less than the maximum 23 bits allowed by the 2^{23} mantissa trick.

This allows us to save two instructions, since we can do the addition hi = hi0 + hi1 for free in fp32 via FMA, and now we only need one instruction to convert to an integer instead of two:

lo = a0 * b0 + 2.0**23
hi = a0 * b1 + 2.0**23
hi += a1 * b0

# convert to integers
lo = mantissa_bits(lo)
hi = mantissa_bits(hi)

result = lo + (hi << 8)

Sequential Code

Some care is required to minimise the number of instructions.

We use SFPSHFT2 instead of SFPSHFT, because on Wormhole, SFPSHFT can only be applied in-place and overwrites its input. This shortcoming is fixed in Blackhole with a new SFPSHFT_MOD1_ARG_IMM_USE_VC modifier. On the other hand, SFPSHFT2 with mod1=SFPSHFT2_MOD1_SHFT_LREG performs a bitwise shift on L[vb] of amount L[vc]. This saves having to use another instruction to make a copy of the original value, at the expense of storing the shift amount in a register.

Similarly, SFPAND on Wormhole can only be applied in-place and overwrites its input. As long as we perform the SFPSHFT2 before SFPAND, this is fine.

The three 2-cycle SFPMAD instructions and subsequent SFPEXMAN instruction are sequenced such that no stalls occur.

Our optimised assembly code takes 18 cycles:

; assumes constant registers L12 = -8, L13 = 0xff, L14 = 0x4b000000 (2.0f**23)

sfpload  L0, UINT16, ADDR_MOD_3, offset0 ; a0
sfpshft2 L0, L12, L2, 5                  ; a1 = a0 >> 8
sfpcast  L2, L2, 0                       ; a1 = int_to_fp32(a1)

sfpload  L1, UINT16, ADDR_MOD_3, offset1 ; b0
sfpshft2 L1, L12, L3, 5                  ; b1 = b0 >> 8
sfpcast  L3, L3, 0                       ; b1 = int_to_fp32(b1)

sfpand  0, L13, L0, 0                    ; a0 = a0 & 0xff
sfpcast L0, L0, 0                        ; a0 = int_to_fp32(a0)

sfpand  0, L13, L1, 0                    ; b0 = b0 & 0xff
sfpcast L1, L1, 0                        ; b0 = int_to_fp32(b0)

sfpmad L0, L3, L14, L3, 0                ; hi = a0 * b1 + 2**23 (overwriting b1)
sfpmad L0, L1, L14, L0, 0                ; lo = a0 * b0 + 2**23 (overwriting a0)
sfpmad L2, L1, L3, L3, 0                 ; hi += a1 * b0

sfpexman 0, L0, L0, 1                    ; lo = mantissa_bits(lo)
sfpexman 0, L3, L3, 1                    ; hi = mantissa_bits(hi)

sfpshft 8,  0, L3, 1                     ; hi = hi << 8
sfpiadd 0, L3, L0, 4                     ; lo = lo + hi

sfpstore L0, UINT16, ADDR_MOD_2, offset2 ; store lo, with autoincrement

Parallel Execution via SFPLOADMACRO

How fast can we go using SFPLOADMACRO? Looking at our instructions above, we can split them into their requisite sub-units:

The SFPSHFT could potentially be replaced by a SFPSHFT2, moving it to the round sub-unit.

At first glance, our best case is 10+2 = 12 cycles, combining the number of simple and round instructions. They are combined due to the fact that they are difficult to use simultaneously; one must have VD=16 and the other VD!=16 if active at the same time.

The special L16 register can only be read during a macro-scheduled SFPSTORE, making it quite limited and only useful prior to SFPSTORE.

Although we can overlap the final SFPIADD (VD=16) with a SFPSHFT2 (VD!=16), the dependency chain forces the effective throughput to remain at 12 cycles.

Acknowledgements

Thanks to Tenstorrent for sponsoring this work.