Typecast on Tenstorrent

Below are optimised implementations of various typecast operations for Tenstorrent’s AI accelerators.

Float → Float

fp32 → bf16

However, we prefer round to nearest with ties to even: it doesn’t have a systematic bias, and is generally what most software uses.

v += ((v >> 16) & 1) + 0x7fff
v >>= 16 # truncation

We can achieve a throughput of 3 cycles per input row using SFPLOADMACRO. Note that the simple and round sub-units can only be used simultaneously if one has VD=16 and the other VD≠16. In our case, the final SFPIADD has VD=16, allowing SFPSHFT2 to be used simultaneously for the next row.

Note that we also store ((v >> 16) & 1) to the low 16 bits of the fp32 Dst register. This is only necessary to prevent double rounding by packers; it should also be possible to set packers to use truncation instead.

Float → Int

fp32/bf16 → i32

The trick here is to take advantage of the fact that both SFPEXEXP and SFPIADD have the ability to set lane flags, which saves having to use SFPSETCC. Careful ordering of these conditionals means that we can nest them to reduce the number of SFPENCC or SFPCOMPC instructions.

result = 0
exp = in.Exp

if exp >= 0:
  result = INT_MIN
  if exp < 31:
    result = in.Man << (exp - 23)

if in < 0:
  result = -result # idempotent if result = INT_MIN

sfpload  L0, 0, ADDR_MOD_7, 0       ; load fp32/bf16
sfploadi L1, 0, 0                   ; result = 0

sfpexexp  0, L0, L2, CC_SGN|CC_COMP ; exp = in.Exp (LaneEnabled = exp >= 0)
sfploadi L1, FLOATB, 0x8000         ; result = INT_MIN (identical to fp32 -0.0)
sfpiadd -31, L2, L2, IMM|CC_LT0     ; exp -= 31 (LaneEnabled = exp < 31)
sfpiadd   8, L2, L2, IMM|CC_NONE    ; exp += 8
sfpexman  0, L0, L1, 0              ; result = in.Man, including implicit bit
sfpshft   0, L2, L1, 0              ; result <<= exp
sfpencc   0,  0,  0, 0              ; LaneEnabled = true

sfpsetcc  0, L0,  0, LREG_LT0       ; LaneEnabled = in < 0
sfpiadd   0, L9, L1, NEG|CC_NONE    ; result = 0 - result (two's complement; L9 is constant 0)
sfpencc   0,  0,  0, 0              ; LaneEnabled = true

sfpstore L1, INT32, ADDR_MOD_6, 0   ; store i32 (autoincrement addr)

fp32/bf16 → u32

result = 0
exp = in.Exp

if in >= 0 and exp >= 0:
  result = 0xffff_ffff
  if exp < 32:
    result = in.Man << (exp - 23)

sfpload  L0,  0, ADDR_MOD_7, 0      ; load fp32/bf16
sfploadi L1,  0,  0                 ; result = 0

sfpsetcc  0, L0,  0, LREG_GTE0      ; LaneEnabled = in ≥ 0
sfpexexp  0, L0, L2, CC_SGN|CC_COMP ; exp = in.Exp (LaneEnabled = exp >= 0)
sfploadi L1, SHORT, -1              ; result = 0xffffffff
sfpiadd -32, L2, L2, IMM|CC_LT0     ; exp -= 32 (LaneEnabled = exp < 32)
sfpiadd   9, L2, L2, IMM|CC_NONE    ; exp += 9
sfpexman  0, L0, L1, 0              ; result = in.Man, including implicit bit
sfpshft   0, L2, L1, 0              ; result <<= exp
sfpencc   0,  0,  0, 0              ; LaneEnabled = true

sfpstore L1, INT32, ADDR_MOD_6, 0   ; store i32 (autoincrement addr)

fp32/bf16 → u16

This is simpler, as we can use SFPSTOCHRND, though the rounding mode is round to nearest with ties away from zero.

We still need to detect negative numbers and clamp to zero, since SFPSTOCHRND takes the absolute value before clamping.

The most straightforward way to do this is via SFPSWAP in min/max mode with the constant register containing 0.0. When scheduling via SFPLOADMACRO, we also avoid the automatic stalling behaviour, achieving a throughput of 2 cycles per input row. Moreover, scheduling SFPSTOCHRND via SFPLOADMACRO allows us to schedule it for the same time as SFPSWAP using VD=16.

v = load()
v = max(v, 0.0)
L16 = rnd(v)
store(L16)

Int → Float

u32 → fp32

The hardware supports casting from integer to fp32 via SFPCAST, but the source is treated as a sign-magnitude integer.

v = load()
a = v & 0x7fff_ffff   # make safe for sign-magnitude by zeroing the sign bit
a = cast_int_float(a) # convert to fp32
if v < 0:             # check sign bit
  a += 2.0**31        # if Sign==1, then add missing 2.0**31
store(a)

The above is around 7 cycles. We can do better, achieving 3 cycles throughput via SFPLOADMACRO.

Instead of conditionally enabling/disabling lanes, we use SFPMAD’s indirect mode to conditionally add 2.0**31 if Sign==1, otherwise add 0.0.

L0 = 0.0
L1 = 2.0**31

a  = load()
b  = load()
L7 = load()
a = setsgn(a, 0)
b = cast_int_float(a)
L7 >>= 31
L16 = L[L7]*1.0 + b
store(L16)

i32 → fp32

Similar to the u32 case, but this time we need to use two’s complement negation to obtain the correct magnitude if the input is negative. Thankfully, this is easily achieved via SFPABS.

However, note that abs(INT_MIN) = INT_MIN, i.e. it remains -2**31 and becomes -0.0 when treated as sign-magnitude. We use the same trick with SFPMAD’s indirect mode, this time only detecting the INT_MIN case.

v = load()
a = abs(v)            # two's complement negation for v<0
L7 = a >> 31          # L7==1 iff v==INT_MIN
a = cast_int_float(a) # convert to fp32
a = setsgn(a, v)      # copy v.Sign to a
if L7 == 1:           # handle v=INT_MIN
  a += -2.0**31       # if v==INT_MIN, then a==-0.0 and we need -2.0**31
store(a)

The above is around 8 cycles. This time, we achieve 4 cycles throughput via SFPLOADMACRO.

L0 = 0.0
L1 = 2.0**31

v = load()
t = abs(v)
L7 = t >> 31
t = cast_int_float(t)
v = setsgn(t, v)
L16 = L[L7]*1.0 + v
store(L16)

u32 → bf16

This is almost identical to the u32 → fp32 case, except we want to use SFPSTOCHRND to round the result to bf16 (round to nearest with ties away from zero).

L0 = 0.0
L1 = 2.0**31

v = load()
L7 = v >> 31
v = setsgn(v, 0)
v = cast_int_float(v)
v = L[L7]*1.0 + v
L16 = rnd(v)
store(L16)

The v register is alive for longer than 4 cycles, so we alternate between two registers.

i32 → bf16

Essentially identical to the i32 → fp32 case, except we add rounding (round to nearest with ties away from zero). We retain the throughput of 4 cycles per input row.

v = load()
t = abs(v)            # two's complement negation for v<0
L7 = t >> 31          # stash sign bit; L7==1 iff v==INT_MIN
t = cast_int_float(t) # convert to fp32
v = setsgn(t, v)      # v = {v.Sign,t.Exp,t.Man}
if L7 == 1:           # check sign bit to handle v=INT_MIN
  v += -2.0**31       # if L7==1, then v==-0.0; add missing -2.0**31
v = rnd(v)
store(v)

u16 → fp32

This is trivial: 1 cycle per input row using SFPCAST; round to nearest with ties to even.

u16 → bf16

Similarly trivial: 1 cycle per input row using SFPCAST and SFPSTOCHRND, with the latter making this round to nearest with ties away from zero.

Int → Int

u32 → u16

This is relatively straightforward on Blackhole, since SFPGT conveniently has an option to perform a comparison, and write either -1 (0xffffffff) or 0 (0x00000000) to the destination register.

The trick is to use SFPLOAD (or SFPLOADMACRO) to load the high 16 bits only, do the comparison, and then use SFPOR to saturate the result.

a = load_hi16()
b = load_lo16()
a = a > 0 ? -1 : 0
store(b | a)

This allows us to achieve a throughput of 2 cycles per input row via SFPLOADMACRO.

On Wormhole, we could use SFPIADD instead of SFPGT, computing 0 - a. This would give us 0xffff in the high 16 bits if a > 0, otherwise 0x0000.

When loading b, we use a different loading mode to write b to the high 16 bits of the register, giving the final result in the high 16 bits instead of the low 16 bits.

a = load_hi16()         # writes to low 16 bits of register
b = load_lo16_to_hi16() # writes to high 16 bits of register
a = 0 - a               # 0xffff.... if a > 0, otherwise 0x0000....
store_hi16(b | a)       # store the high 16 bits

Unfortunately, this particular approach wasn’t immediately feasible due to a conflict when bit 11 of RISCV_DEBUG_REG_DBG_FEATURE_DISABLE is set, which prevents writing to low bits of Dst.

All is not lost, and we can still achieve a throughput of 2 cycles per input row on Wormhole, just with a slightly more complex macro setup.

Note that a’s lifetime exceeds 2 cycles, and so we have to alternate between a=L0 and a=L1. This then means that we require two macros for b, since it refers to a when using SFPOR.

The trick is that SFPOR can be scheduled simultaneously with SFPSHFT2, since one has VD=16 and the other VD!=16.

i32 → u16

At first glance, this seems to require 4 cycles, as we need to clamp negative values to 0, and saturate large values to 0xffff, which would require 2x two-cycle SFPSWAPs.

The trick is to note that SFPSTOCHRND saturates large values to 0xffff when converting from fp32 to u16. It’s executed by the round sub-unit, which means it can be scheduled simultaneously with any simple instruction as long as one has VD=16 and the other VD!=16.

We still need to use SFPSWAP to clamp negative values to 0, since SFPSTOCHRND takes the absolute value before converting to u16.

a = load()
a = cast_fp32(a)
sfpswap_maxmin(a, 0.0) # takes 2 cycles
a = rnd(a)             # converts abs(a) to integer, saturating at 65535
store(a)

Acknowledgements

Notes

Assembly syntax: the destination register is the last register specified, e.g. sfpmad A,B,C,D means D=A*B+C.

Introduction