Typecast on Tenstorrent

23rd December, 2025

Introduction

Below are optimised implementations of various typecast operations for Tenstorrent’s AI accelerators.

Float → Float

fp32 → bf16

The hardware supports round to nearest with ties away from zero, e.g. via SFPSTOCHRND or during early format conversion via packers.

However, we prefer round to nearest with ties to even: it doesn’t have a systematic bias, and is generally what most software uses.

Implementation is fairly simple:

v += ((v >> 16) & 1) + 0x7fff
v >>= 16 # truncation

We can achieve a throughput of 3 cycles per input row using SFPLOADMACRO. Note that the simple and round sub-units can only be used simultaneously if one has VD=16 and the other VD≠16. In our case, the final SFPIADD has VD=16, allowing SFPSHFT2 to be used simultaneously for the next row.

Note that we also store ((v >> 16) & 1) to the low 16 bits of the fp32 Dst register. This is only necessary to prevent double rounding by packers; it should also be possible to set packers to use truncation instead.

Float → Int

fp32/bf16 → i32

This converts fp32 to i32 with truncation.

The trick here is to take advantage of the fact that both SFPEXEXP and SFPIADD have the ability to set lane flags, which saves having to use SFPSETCC. Careful ordering of these conditionals means that we can nest them to reduce the number of SFPENCC or SFPCOMPC instructions.

result = 0
exp = in.Exp

if exp >= 0:
  result = INT_MIN
  if exp < 31:
    result = in.Man << (exp - 23)

if in < 0:
  result = -result # idempotent if result = INT_MIN

This results in the following 13 cycle sequence:

sfpload  L0, 0, ADDR_MOD_7, 0       ; load fp32/bf16
sfploadi L1, 0, 0                   ; result = 0

sfpexexp  0, L0, L2, CC_SGN|CC_COMP ; exp = in.Exp (LaneEnabled = exp >= 0)
sfploadi L1, FLOATB, 0x8000         ; result = INT_MIN (identical to fp32 -0.0)
sfpiadd -31, L2, L2, IMM|CC_LT0     ; exp -= 31 (LaneEnabled = exp < 31)
sfpiadd   8, L2, L2, IMM|CC_NONE    ; exp += 8
sfpexman  0, L0, L1, 0              ; result = in.Man, including implicit bit
sfpshft   0, L2, L1, 0              ; result <<= exp
sfpencc   0,  0,  0, 0              ; LaneEnabled = true

sfpsetcc  0, L0,  0, LREG_LT0       ; LaneEnabled = in < 0
sfpiadd   0, L9, L1, NEG|CC_NONE    ; result = 0 - result (two's complement; L9 is constant 0)
sfpencc   0,  0,  0, 0              ; LaneEnabled = true

sfpstore L1, INT32, ADDR_MOD_6, 0   ; store i32 (autoincrement addr)

fp32/bf16 → u32

Very similar to the i32 case.

result = 0
exp = in.Exp

if in >= 0 and exp >= 0:
  result = 0xffff_ffff
  if exp < 32:
    result = in.Man << (exp - 23)

This results in the following 11 cycle sequence:

sfpload  L0,  0, ADDR_MOD_7, 0      ; load fp32/bf16
sfploadi L1,  0,  0                 ; result = 0

sfpsetcc  0, L0,  0, LREG_GTE0      ; LaneEnabled = in ≥ 0
sfpexexp  0, L0, L2, CC_SGN|CC_COMP ; exp = in.Exp (LaneEnabled = exp >= 0)
sfploadi L1, SHORT, -1              ; result = 0xffffffff
sfpiadd -32, L2, L2, IMM|CC_LT0     ; exp -= 32 (LaneEnabled = exp < 32)
sfpiadd   9, L2, L2, IMM|CC_NONE    ; exp += 9
sfpexman  0, L0, L1, 0              ; result = in.Man, including implicit bit
sfpshft   0, L2, L1, 0              ; result <<= exp
sfpencc   0,  0,  0, 0              ; LaneEnabled = true

sfpstore L1, INT32, ADDR_MOD_6, 0   ; store i32 (autoincrement addr)

fp32/bf16 → u16

This is simpler, as we can use SFPSTOCHRND, though the rounding mode is round to nearest with ties away from zero.

We still need to detect negative numbers and clamp to zero, since SFPSTOCHRND takes the absolute value before clamping.

The most straightforward way to do this is via SFPSWAP in min/max mode with the constant register containing 0.0. When scheduling via SFPLOADMACRO, we also avoid the automatic stalling behaviour, achieving a throughput of 2 cycles per input row. Moreover, scheduling SFPSTOCHRND via SFPLOADMACRO allows us to schedule it for the same time as SFPSWAP using VD=16.

v = load()
v = max(v, 0.0)
L16 = rnd(v)
store(L16)

Int → Float

u32 → fp32

The hardware supports casting from integer to fp32 via SFPCAST, but the source is treated as a sign-magnitude integer.

v = load()
a = v & 0x7fff_ffff   # make safe for sign-magnitude by zeroing the sign bit
a = cast_int_float(a) # convert to fp32
if v < 0:             # check sign bit
  a += 2.0**31        # if Sign==1, then add missing 2.0**31
store(a)

The above is around 7 cycles. We can do better, achieving 3 cycles throughput via SFPLOADMACRO.

Instead of conditionally enabling/disabling lanes, we use SFPMAD’s indirect mode to conditionally add 2.0**31 if Sign==1, otherwise add 0.0.

L0 = 0.0
L1 = 2.0**31

a  = load()
b  = load()
L7 = load()
a = setsgn(a, 0)
b = cast_int_float(a)
L7 >>= 31
L16 = L[L7]*1.0 + b
store(L16)

i32 → fp32

Similar to the u32 case, but this time we need to use two’s complement negation to obtain the correct magnitude if the input is negative. Thankfully, this is easily achieved via SFPABS.

However, note that abs(INT_MIN) = INT_MIN, i.e. it remains -2**31 and becomes -0.0 when treated as sign-magnitude. We use the same trick with SFPMAD’s indirect mode, this time only detecting the INT_MIN case.

v = load()
a = abs(v)            # two's complement negation for v<0
L7 = a >> 31          # L7==1 iff v==INT_MIN
a = cast_int_float(a) # convert to fp32
a = setsgn(a, v)      # copy v.Sign to a
if L7 == 1:           # handle v=INT_MIN
  a += -2.0**31       # if v==INT_MIN, then a==-0.0 and we need -2.0**31
store(a)

The above is around 8 cycles. This time, we achieve 4 cycles throughput via SFPLOADMACRO.

L0 = 0.0
L1 = 2.0**31

v = load()
t = abs(v)
L7 = t >> 31
t = cast_int_float(t)
v = setsgn(t, v)
L16 = L[L7]*1.0 + v
store(L16)

u32 → bf16

This is almost identical to the u32 → fp32 case, except we want to use SFPSTOCHRND to round the result to bf16 (round to nearest with ties away from zero).

We retain the throughput of 3 cycles per input row.

L0 = 0.0
L1 = 2.0**31

v = load()
L7 = v >> 31
v = setsgn(v, 0)
v = cast_int_float(v)
v = L[L7]*1.0 + v
L16 = rnd(v)
store(L16)

The v register is alive for longer than 4 cycles, so we alternate between two registers.

i32 → bf16

Essentially identical to the i32 → fp32 case, except we add rounding (round to nearest with ties away from zero). We retain the throughput of 4 cycles per input row.

v = load()
t = abs(v)            # two's complement negation for v<0
L7 = t >> 31          # stash sign bit; L7==1 iff v==INT_MIN
t = cast_int_float(t) # convert to fp32
v = setsgn(t, v)      # v = {v.Sign,t.Exp,t.Man}
if L7 == 1:           # check sign bit to handle v=INT_MIN
  v += -2.0**31       # if L7==1, then v==-0.0; add missing -2.0**31
v = rnd(v)
store(v)

u16 → fp32

This is trivial: 1 cycle per input row using SFPCAST; round to nearest with ties to even.

u16 → bf16

Similarly trivial: 1 cycle per input row using SFPCAST and SFPSTOCHRND, with the latter making this round to nearest with ties away from zero.

Acknowledgements

Thanks to Tenstorrent for sponsoring this work.

Notes

Assembly syntax: the destination register is the last register specified, e.g. sfpmad A,B,C,D means D=A*B+C.