32-bit Integer Multiplication on Tenstorrent

Below are optimised implementations of 32-bit integer multiplication for Tenstorrent’s Blackhole and Wormhole AI accelerators, achieving throughputs of:

Blackhole

Unlike its Wormhole predecessor, Blackhole has a dedicated instruction for 23-bit integer multiplication: SFPMUL24, returning the low or high 23 bits of the product of two 23-bit integers.

Sequential Code

Given two 32-bit integer values a and b, we split each value into two 23-bit chunks, such that a = (a1 << 23) | a0 and b = (b1 << 23) | b0:

a0 = a & 0x7fffff
b0 = b & 0x7fffff
a1 = a >> 23
b1 = b >> 23

result =
  ((a1 * b1) << 46) # discard as low 32 bits are zero
+ ((a1 * b0) << 23) # low 23 bits are zero; only need low 9 bits of product
+ ((a0 * b1) << 23) # low 23 bits are zero; only need low 9 bits of product
+  (a0 * b0)        # need all 32 bits of product

Rewrite to use mul24_lo and mul24_hi; note that we no longer need a0 and b0, since mul24_lo and mul24_hi implicitly mask inputs to 23 bits:

result =
  (mul24_lo(a1,  b) << 23)
+ (mul24_lo( a, b1) << 23)
+ (mul24_hi( a,  b) << 23) + mul24_lo(a, b)

result =
  (mul24_lo(a1, b) + mul24_lo(a, b1) + mul24_hi(a, b)) << 23
+  mul24_lo(a, b)

sfpload L0, INT32, ADDR_MOD_7, offset0  ; a = load(offset0)
sfpshft -23, L0, L2, 1|4                ; a1 = a >> 23

sfpload L1, INT32, ADDR_MOD_7, offset1  ; b = load(offset1)
sfpshft -23, L1, L3, 1|4                ; b1 = b >> 23

sfpmul24 L1, L2, L9, L2, 0              ; cross0 = mul24_lo(b, a1)
sfpmul24 L0, L3, L9, L3, 0              ; cross1 = mul24_lo(a, b1)
sfpmul24 L0, L1, L9, L4, 1              ; hi = mul24_hi(a, b)
sfpmul24 L0, L1, L9, L0, 0              ; lo = mul24_lo(a, b)

sfpiadd  0, L2, L4, 4                   ; hi += cross0
sfpiadd  0, L3, L4, 4                   ; hi += cross1
sfpshft 23, L4, L4, 1                   ; hi = hi << 23
sfpiadd  0, L4, L0, 4                   ; lo += hi

sfpstore L0, INT32, ADDR_MOD_6, offset2 ; store(lo, offset2) with autoincrement

Parallel Execution via SFPLOADMACRO

13 cycles is pretty good, but can we go faster? As always, SFPLOADMACRO is our friend. The following uses four unique macros to schedule four reusable instruction templates, achieving a throughput of 8 cycles per row.

SFPLOADMACRO restricts us to a maximum of four unique macros, all sharing up to four instruction templates (one per sub-unit).

Important: Each macro can override either VB or VC for its use of a particular template, and set it VD from that macro’s particular instantiation. All other fields are fixed (except for VD in the case of SFPSTORE, which can be overridden with VD=16 or VD=0).

# Variable-to-register mappings:
#
# a,b = 0
# a1  = 1
# b1  = 2
# b2  = 3
# c   = 4
#
# Instruction templates ([x] is replaced by SFPLOADMACRO's VD):
#
# 0: [x] = [x] >> 23 via SFPSHFT2(Imm12=-23, MOD1=SFPSHFT2_MOD1_SHFT_IMM)
# 1: [x] = mul24_lo(a or b, [x]) via SFPMUL24(VA=0, VC=9)
# 2: [x] = b1 << 23 via SFPSHFT(Imm12=23, VC=2)
# 3: [x] = [x] + c via SFPIADD(VC=3)

b  = load(offset1)      # L0 = b
a1 = loadmacro(offset0) # L1 = a
b1 = loadmacro(offset1) # L2 = b

a1 = a1 >> 23           # template 0; L1 = L1 >> 23
b1 = b1 >> 23           # template 0; L2 = L2 >> 23
a1 = mul24_lo(b, a1)    # template 1; L1 = mul24_lo(L0, L1)

a  = load(offset0)      # L0 = a
b2 = loadmacro(offset1) # L3 = b
b1 = mul24_lo(a, b1)    # template 1; L2 = mul24_lo(L0, L2)
c  = mul24_hi(a, b2)    #             L4 = mul24_hi(L0, L3)
b2 = mul24_lo(a, b2)    # template 1; L3 = mul24_lo(L0, L3)

b1 = b1 + a1            #             L2 = L2 + L1
b1 = b1 + c             # template 3; L2 = L2 + L4
c  = loadmacro(offset2) #             L4 = ?
c  = b1 << 23           # template 2; L4 = L2 << 23
L16 = b2 + c            # template 3; L16 = L3 + L4
store(L16, offset2)     # store L16 to offset2

In the scheduling diagram below, colours denote variables, and instruction slots are coloured by their destination variable, except in the case of the special destination register L16, where we colour the instruction by its original SFPLOADMACRO destination variable.

Note that Blackhole has automatic stalling for SFPMUL24 when issued normally (not via SFPLOADMACRO). Our mul24_hi and the following instruction are issued as regular instructions, which means care needs to be taken to avoid triggering the automatic stalling logic.

Wormhole

Since Wormhole doesn’t have a dedicated instruction for multiplying integers, we have to SFPCAST to FP32 and use SFPMAD instead. FP32 can represent 24-bit integers exactly, so we have to split the input values into multiple chunks such that their products can be represented exactly in FP32.

8-bit Chunks

One option is to split each 32-bit value into four 8-bit chunks. This would require a total of 10 multiplications, 9 additions, and 3 shifts (not including the operations required to extract the initial 8-bit chunks):

# Given:
#   a = (a3 << 24) | (a2 << 16) | (a1 << 8) | a0
#   b = (b3 << 24) | (b2 << 16) | (b1 << 8) | b0

a*b =
  (a3*b3 << 48)
+ (a3*b2 + a2*b3) << 40
+ (a3*b1 + a2*b2 + a1*b3) << 32
+ (a3*b0 + a2*b1 + a1*b2 + a0*b3) << 24
+ (a2*b0 + a1*b1 + a0*b2) << 16
+ (a1*b0 + a0*b1) << 8
+ (a0*b0)

# We only need the products whose low 32 bits are non-zero:

a*b & 0xffffffff =
  (a3*b0 + a2*b1 + a1*b2 + a0*b3) << 24
+ (a2*b0 + a1*b1 + a0*b2)         << 16
+ (a1*b0 + a0*b1)                 << 8
+ (a0*b0)

11-bit Chunks

An alternative is to use three 11-bit chunks per 32-bit value. This has the attractive property of allowing SFPMAD to be used to sum three 22-bit products without loss of precision, as this can be represented with 24 bits. Now, only 6 multiplications are needed, 2 additions (since we get 3 additions via SFPMAD for free), and 2 shifts:

# Given:
#   a = (a2 << 22) | (a1 << 11) | a0
#   b = (b2 << 22) | (b1 << 11) | b0

a * b & 0xffffffff =
  (a0*b2 + a1*b1 + a2*b0) << 22 # top = max 24 bits
+ (a0*b1 + a1*b0)         << 11 # mid = max 23 bits
+ (a0*b0)                       # low = max 22 bits

The catch is that there is no dedicated instruction for converting FP32 values larger than 16 bits to an integer, so multiple instructions are required:

sfpexexp 0, L5, L0, 2|8     ; extract unbiased exponent, disable lanes where Exp<0
sfpexman 0, L5, L5, 0       ; extract mantissa with implicit bit set (even for zero!)
sfpiadd  22-23, L0, L0, 1|4 ; calculate shift amount (shift right 23, then left 22)
sfpshft2 L5, L0, L5, 5      ; logical shift of mantissa by amount
sfpencc  0, 0, 0, 0         ; re-enable lanes

Compared to a single SFPSTOCHRND (if it supported values greater than 16 bits), 5 cycles seems expensive. However, note that for top and mid, we need to shift the values anyway, so the overhead of this approach is only 3 cycles for those variables, and 4 cycles for low.

; assumes constant registers L9 = 0; L12 = 0x7ff; L13 = -11

sfpload  L0, INT32, ADDR_MOD_3, offset0 ; a0 = load(offset0)

sfpshft2 L0, L13, L2, 5                 ; a1 = a0 >> 11
sfpshft2 L2, L13, L4, 5                 ; a2 = a1 >> 11

sfpand   0, L12, L2, 0                  ; a1 &= mask
sfpcast  L2, L2, 0                      ; a1 = int_to_fp32(a1)

sfpcast  L4, L4, 0                      ; a2 = int_to_fp32(a2)

sfpand   0, L12, L0, 0                  ; a0 &= mask
sfpcast  L0, L0, 0                      ; a0 = int_to_fp32(a0)

sfpload  L1, INT32, ADDR_MOD_3, offset1 ; b0 = load(offset1)

sfpshft2 L1, L13, L3, 5                 ; b1 = b0 >> 11
sfpshft2 L3, L13, L5, 5                 ; b2 = b1 >> 11

sfpcast  L5, L5, 0                      ; b2 = int_to_fp32(b2)

sfpmad   L0, L5, L9, L5, 0              ; top = a0*b2

sfpand   0, L12, L3, 0                  ; b1 &= mask
sfpcast  L3, L3, 0                      ; b1 = int_to_fp32(b1)

sfpmad   L2, L3, L5, L5, 0              ; top += a1*b1

sfpand   0, L12, L1, 0                  ; b0 &= mask
sfpcast  L1, L1, 0                      ; b0 = int_to_fp32(b0)

sfpmad   L4, L1, L5, L5, 0              ; top += a2*b0
sfpmad   L0, L3, L9, L6, 0              ; mid  = a0*b1
sfpmad   L0, L1, L9, L7, 0              ; low  = a0*b0
sfpmad   L2, L1, L6, L6, 0              ; mid += a1*b0

; low = fp32_to_int(low)
sfpexexp 0, L7, L0, 2|8                 ; extract unbiased exponent, disable lanes where Exp<0
sfpexman 0, L7, L7, 0                   ; extract mantissa with implicit bit set (even for zero!)
sfpiadd  -23, L0, L0, 1|4               ; calculate shift amount (shift right 23)
sfpshft2 L7, L0, L7, 5                  ; logical shift of mantissa by amount
sfpencc  0, 0, 0, 0                     ; re-enable lanes

; mid = fp32_to_int(mid) << 11
sfpexexp 0, L6, L0, 2|8                 ; extract unbiased exponent, disable lanes where Exp<0
sfpexman 0, L6, L6, 0                   ; extract mantissa with implicit bit set (even for zero!)
sfpiadd  11-23, L0, L0, 1|4             ; calculate shift amount (shift right 23; left 11)
sfpshft2 L6, L0, L6, 5                  ; logical shift of mantissa by amount
sfpencc  0, 0, 0, 0                     ; re-enable lanes

; top = fp32_to_int(top) << 22
sfpexexp 0, L5, L0, 2|8                 ; extract unbiased exponent, disable lanes where Exp<0
sfpexman 0, L5, L5, 0                   ; extract mantissa with implicit bit set (even for zero!)
sfpiadd  22-23, L0, L0, 1|4             ; calculate shift amount (shift right 23; left 22)
sfpshft2 L5, L0, L5, 5                  ; logical shift of mantissa by amount
sfpencc  0, 0, 0, 0                     ; re-enable lanes

sfpiadd  0, L6, L7, 4                   ; low += mid
sfpiadd  0, L5, L7, 4                   ; low += top

sfpstore L7, INT32, ADDR_MOD_2, offset2 ; store(low, offset2) with autoincrement

Conclusion

Leveraging SFPLOADMACRO, we achieve 8/32c throughput for 32-bit integer multiplication on Blackhole. On Wormhole, using SFPMAD with 11-bit chunks achieves 40/32c throughput.