23rd December, 2025
Below are optimised implementations of various typecast operations for Tenstorrent’s AI accelerators.
The hardware supports round to nearest with ties away from zero,
e.g. via SFPSTOCHRND
or during early
format conversion via packers.
However, we prefer round to nearest with ties to even: it doesn’t have a systematic bias, and is generally what most software uses.
Implementation is fairly simple:
v += ((v >> 16) & 1) + 0x7fff
v >>= 16 # truncationWe can achieve a throughput of 3 cycles per input
row using SFPLOADMACRO.
Note that the simple and round sub-units can only be
used simultaneously if one has VD=16 and the other
VD≠16. In our case, the final SFPIADD
has VD=16, allowing SFPSHFT2
to be used simultaneously for the next row.
Note that we also store ((v >> 16) & 1) to the
low 16 bits of the fp32 Dst register. This is only
necessary to prevent double rounding by packers; it should also be
possible to set packers to use truncation instead.
This converts fp32 to i32 with truncation.
Exp < 0: result = 0Exp ≥ 31: result = overflow, we use
INT_MIN as a sentinel to match PyTorchresult = Man << (Exp-23), including
implicit bitThe trick here is to take advantage of the fact that both SFPEXEXP
and SFPIADD
have the ability to set lane flags, which saves having to use SFPSETCC.
Careful ordering of these conditionals means that we can nest them to
reduce the number of SFPENCC
or SFPCOMPC
instructions.
result = 0
exp = in.Exp
if exp >= 0:
result = INT_MIN
if exp < 31:
result = in.Man << (exp - 23)
if in < 0:
result = -result # idempotent if result = INT_MINThis results in the following 13 cycle sequence:
sfpload L0, 0, ADDR_MOD_7, 0 ; load fp32/bf16
sfploadi L1, 0, 0 ; result = 0
sfpexexp 0, L0, L2, CC_SGN|CC_COMP ; exp = in.Exp (LaneEnabled = exp >= 0)
sfploadi L1, FLOATB, 0x8000 ; result = INT_MIN (identical to fp32 -0.0)
sfpiadd -31, L2, L2, IMM|CC_LT0 ; exp -= 31 (LaneEnabled = exp < 31)
sfpiadd 8, L2, L2, IMM|CC_NONE ; exp += 8
sfpexman 0, L0, L1, 0 ; result = in.Man, including implicit bit
sfpshft 0, L2, L1, 0 ; result <<= exp
sfpencc 0, 0, 0, 0 ; LaneEnabled = true
sfpsetcc 0, L0, 0, LREG_LT0 ; LaneEnabled = in < 0
sfpiadd 0, L9, L1, NEG|CC_NONE ; result = 0 - result (two's complement; L9 is constant 0)
sfpencc 0, 0, 0, 0 ; LaneEnabled = true
sfpstore L1, INT32, ADDR_MOD_6, 0 ; store i32 (autoincrement addr)Very similar to the i32 case.
in ≤ 0 or Exp < 0:
result = 0Exp ≥ 32: result = 0xffff_ffffresult = Man << (Exp-23), including
implicit bitresult = 0
exp = in.Exp
if in >= 0 and exp >= 0:
result = 0xffff_ffff
if exp < 32:
result = in.Man << (exp - 23)This results in the following 11 cycle sequence:
sfpload L0, 0, ADDR_MOD_7, 0 ; load fp32/bf16
sfploadi L1, 0, 0 ; result = 0
sfpsetcc 0, L0, 0, LREG_GTE0 ; LaneEnabled = in ≥ 0
sfpexexp 0, L0, L2, CC_SGN|CC_COMP ; exp = in.Exp (LaneEnabled = exp >= 0)
sfploadi L1, SHORT, -1 ; result = 0xffffffff
sfpiadd -32, L2, L2, IMM|CC_LT0 ; exp -= 32 (LaneEnabled = exp < 32)
sfpiadd 9, L2, L2, IMM|CC_NONE ; exp += 9
sfpexman 0, L0, L1, 0 ; result = in.Man, including implicit bit
sfpshft 0, L2, L1, 0 ; result <<= exp
sfpencc 0, 0, 0, 0 ; LaneEnabled = true
sfpstore L1, INT32, ADDR_MOD_6, 0 ; store i32 (autoincrement addr)This is simpler, as we can use SFPSTOCHRND,
though the rounding mode is round to nearest with ties away from
zero.
We still need to detect negative numbers and clamp to zero, since
SFPSTOCHRND takes the absolute value before clamping.
The most straightforward way to do this is via SFPSWAP
in min/max mode with the constant register containing 0.0.
When scheduling via SFPLOADMACRO,
we also avoid the automatic stalling behaviour, achieving a throughput
of 2 cycles per input row. Moreover, scheduling
SFPSTOCHRND via SFPLOADMACRO allows us to
schedule it for the same time as SFPSWAP using
VD=16.
v = load()
v = max(v, 0.0)
L16 = rnd(v)
store(L16)The hardware supports casting from integer to fp32 via SFPCAST,
but the source is treated as a sign-magnitude integer.
v = load()
a = v & 0x7fff_ffff # make safe for sign-magnitude by zeroing the sign bit
a = cast_int_float(a) # convert to fp32
if v < 0: # check sign bit
a += 2.0**31 # if Sign==1, then add missing 2.0**31
store(a)The above is around 7 cycles. We can do better, achieving 3
cycles throughput via SFPLOADMACRO.
Instead of conditionally enabling/disabling lanes, we use SFPMAD’s
indirect mode to conditionally add 2.0**31 if
Sign==1, otherwise add 0.0.
L0 = 0.0
L1 = 2.0**31
a = load()
b = load()
L7 = load()
a = setsgn(a, 0)
b = cast_int_float(a)
L7 >>= 31
L16 = L[L7]*1.0 + b
store(L16)Similar to the u32 case, but this time we need to use
two’s complement negation to obtain the correct magnitude if the input
is negative. Thankfully, this is easily achieved via SFPABS.
However, note that abs(INT_MIN) = INT_MIN, i.e. it
remains -2**31 and becomes -0.0 when treated
as sign-magnitude. We use the same trick with SFPMAD’s
indirect mode, this time only detecting the INT_MIN
case.
v = load()
a = abs(v) # two's complement negation for v<0
L7 = a >> 31 # L7==1 iff v==INT_MIN
a = cast_int_float(a) # convert to fp32
a = setsgn(a, v) # copy v.Sign to a
if L7 == 1: # handle v=INT_MIN
a += -2.0**31 # if v==INT_MIN, then a==-0.0 and we need -2.0**31
store(a)The above is around 8 cycles. This time, we achieve 4
cycles throughput via SFPLOADMACRO.
L0 = 0.0
L1 = 2.0**31
v = load()
t = abs(v)
L7 = t >> 31
t = cast_int_float(t)
v = setsgn(t, v)
L16 = L[L7]*1.0 + v
store(L16)This is almost identical to the u32 → fp32 case, except we want to
use SFPSTOCHRND
to round the result to bf16 (round to nearest with ties away from
zero).
We retain the throughput of 3 cycles per input row.
L0 = 0.0
L1 = 2.0**31
v = load()
L7 = v >> 31
v = setsgn(v, 0)
v = cast_int_float(v)
v = L[L7]*1.0 + v
L16 = rnd(v)
store(L16)The v register is alive for longer than 4 cycles, so we
alternate between two registers.
Essentially identical to the i32 → fp32 case, except we add rounding (round to nearest with ties away from zero). We retain the throughput of 4 cycles per input row.
v = load()
t = abs(v) # two's complement negation for v<0
L7 = t >> 31 # stash sign bit; L7==1 iff v==INT_MIN
t = cast_int_float(t) # convert to fp32
v = setsgn(t, v) # v = {v.Sign,t.Exp,t.Man}
if L7 == 1: # check sign bit to handle v=INT_MIN
v += -2.0**31 # if L7==1, then v==-0.0; add missing -2.0**31
v = rnd(v)
store(v)This is trivial: 1 cycle per input row using SFPCAST;
round to nearest with ties to even.
Similarly trivial: 1 cycle per input row using SFPCAST
and SFPSTOCHRND,
with the latter making this round to nearest with ties away from
zero.
Thanks to Tenstorrent for sponsoring this work.
Assembly syntax: the destination register is the last register
specified, e.g. sfpmad A,B,C,D means
D=A*B+C.