23rd December, 2025
Below are optimised implementations of various typecast operations for Tenstorrent’s AI accelerators.
The hardware supports round to nearest with ties away from zero,
e.g. via SFPSTOCHRND
or during early
format conversion via packers.
However, we prefer round to nearest with ties to even: it doesn’t have a systematic bias, and is generally what most software uses.
Implementation is fairly simple:
v += ((v >> 16) & 1) + 0x7fff
v >>= 16 # truncationWe can achieve a throughput of 3 cycles per input
row using SFPLOADMACRO.
Note that the simple and round sub-units can only be
used simultaneously if one has VD=16 and the other
VD≠16. In our case, the final SFPIADD
has VD=16, allowing SFPSHFT2
to be used simultaneously for the next row.
Note that we also store ((v >> 16) & 1) to the
low 16 bits of the fp32 Dst register. This is only
necessary to prevent double rounding by packers; it should also be
possible to set packers to use truncation instead.
This converts fp32 to i32 with truncation.
Exp < 0: result = 0Exp ≥ 31: result = overflow, we use
INT_MIN as a sentinel to match PyTorchresult = Man << (Exp-23), including
implicit bitThe trick here is to take advantage of the fact that both SFPEXEXP
and SFPIADD
have the ability to set lane flags, which saves having to use SFPSETCC.
Careful ordering of these conditionals means that we can nest them to
reduce the number of SFPENCC
or SFPCOMPC
instructions.
result = 0
exp = in.Exp
if exp >= 0:
result = INT_MIN
if exp < 31:
result = in.Man << (exp - 23)
if in < 0:
result = -result # idempotent if result = INT_MINThis results in the following 13 cycle sequence:
sfpload L0, 0, ADDR_MOD_7, 0 ; load fp32/bf16
sfploadi L1, 0, 0 ; result = 0
sfpexexp 0, L0, L2, CC_SGN|CC_COMP ; exp = in.Exp (LaneEnabled = exp >= 0)
sfploadi L1, FLOATB, 0x8000 ; result = INT_MIN (identical to fp32 -0.0)
sfpiadd -31, L2, L2, IMM|CC_LT0 ; exp -= 31 (LaneEnabled = exp < 31)
sfpiadd 8, L2, L2, IMM|CC_NONE ; exp += 8
sfpexman 0, L0, L1, 0 ; result = in.Man, including implicit bit
sfpshft 0, L2, L1, 0 ; result <<= exp
sfpencc 0, 0, 0, 0 ; LaneEnabled = true
sfpsetcc 0, L0, 0, LREG_LT0 ; LaneEnabled = in < 0
sfpiadd 0, L9, L1, NEG|CC_NONE ; result = 0 - result (two's complement; L9 is constant 0)
sfpencc 0, 0, 0, 0 ; LaneEnabled = true
sfpstore L1, INT32, ADDR_MOD_6, 0 ; store i32 (autoincrement addr)Very similar to the i32 case.
in ≤ 0 or Exp < 0:
result = 0Exp ≥ 32: result = 0xffff_ffffresult = Man << (Exp-23), including
implicit bitresult = 0
exp = in.Exp
if in >= 0 and exp >= 0:
result = 0xffff_ffff
if exp < 32:
result = in.Man << (exp - 23)This results in the following 11 cycle sequence:
sfpload L0, 0, ADDR_MOD_7, 0 ; load fp32/bf16
sfploadi L1, 0, 0 ; result = 0
sfpsetcc 0, L0, 0, LREG_GTE0 ; LaneEnabled = in ≥ 0
sfpexexp 0, L0, L2, CC_SGN|CC_COMP ; exp = in.Exp (LaneEnabled = exp >= 0)
sfploadi L1, SHORT, -1 ; result = 0xffffffff
sfpiadd -32, L2, L2, IMM|CC_LT0 ; exp -= 32 (LaneEnabled = exp < 32)
sfpiadd 9, L2, L2, IMM|CC_NONE ; exp += 9
sfpexman 0, L0, L1, 0 ; result = in.Man, including implicit bit
sfpshft 0, L2, L1, 0 ; result <<= exp
sfpencc 0, 0, 0, 0 ; LaneEnabled = true
sfpstore L1, INT32, ADDR_MOD_6, 0 ; store i32 (autoincrement addr)This is simpler, as we can use SFPSTOCHRND,
though the rounding mode is round to nearest with ties away from
zero.
We still need to detect negative numbers and clamp to zero, since
SFPSTOCHRND takes the absolute value before clamping.
The most straightforward way to do this is via SFPSWAP
in min/max mode with the constant register containing 0.0.
When scheduling via SFPLOADMACRO,
we also avoid the automatic stalling behaviour, achieving a throughput
of 2 cycles per input row. Moreover, scheduling
SFPSTOCHRND via SFPLOADMACRO allows us to
schedule it for the same time as SFPSWAP using
VD=16.
v = load()
v = max(v, 0.0)
L16 = rnd(v)
store(L16)The hardware supports casting from integer to fp32 via SFPCAST,
but the source is treated as a sign-magnitude integer.
v = load()
a = v & 0x7fff_ffff # make safe for sign-magnitude by zeroing the sign bit
a = cast_int_float(a) # convert to fp32
if v < 0: # check sign bit
a += 2.0**31 # if Sign==1, then add missing 2.0**31
store(a)The above is around 7 cycles. We can do better, achieving 3
cycles throughput via SFPLOADMACRO.
Instead of conditionally enabling/disabling lanes, we use SFPMAD’s
indirect mode to conditionally add 2.0**31 if
Sign==1, otherwise add 0.0.
L0 = 0.0
L1 = 2.0**31
a = load()
b = load()
L7 = load()
a = setsgn(a, 0)
b = cast_int_float(a)
L7 >>= 31
L16 = L[L7]*1.0 + b
store(L16)Similar to the u32 case, but this time we need to use
two’s complement negation to obtain the correct magnitude if the input
is negative. Thankfully, this is easily achieved via SFPABS.
However, note that abs(INT_MIN) = INT_MIN, i.e. it
remains -2**31 and becomes -0.0 when treated
as sign-magnitude. We use the same trick with SFPMAD’s
indirect mode, this time only detecting the INT_MIN
case.
v = load()
a = abs(v) # two's complement negation for v<0
L7 = a >> 31 # L7==1 iff v==INT_MIN
a = cast_int_float(a) # convert to fp32
a = setsgn(a, v) # copy v.Sign to a
if L7 == 1: # handle v=INT_MIN
a += -2.0**31 # if v==INT_MIN, then a==-0.0 and we need -2.0**31
store(a)The above is around 8 cycles. This time, we achieve 4
cycles throughput via SFPLOADMACRO.
L0 = 0.0
L1 = 2.0**31
v = load()
t = abs(v)
L7 = t >> 31
t = cast_int_float(t)
v = setsgn(t, v)
L16 = L[L7]*1.0 + v
store(L16)This is almost identical to the u32 → fp32 case, except we want to
use SFPSTOCHRND
to round the result to bf16 (round to nearest with ties away from
zero).
We retain the throughput of 3 cycles per input row.
L0 = 0.0
L1 = 2.0**31
v = load()
L7 = v >> 31
v = setsgn(v, 0)
v = cast_int_float(v)
v = L[L7]*1.0 + v
L16 = rnd(v)
store(L16)The v register is alive for longer than 4 cycles, so we
alternate between two registers.
Essentially identical to the i32 → fp32 case, except we add rounding (round to nearest with ties away from zero). We retain the throughput of 4 cycles per input row.
v = load()
t = abs(v) # two's complement negation for v<0
L7 = t >> 31 # stash sign bit; L7==1 iff v==INT_MIN
t = cast_int_float(t) # convert to fp32
v = setsgn(t, v) # v = {v.Sign,t.Exp,t.Man}
if L7 == 1: # check sign bit to handle v=INT_MIN
v += -2.0**31 # if L7==1, then v==-0.0; add missing -2.0**31
v = rnd(v)
store(v)This is trivial: 1 cycle per input row using SFPCAST;
round to nearest with ties to even.
Similarly trivial: 1 cycle per input row using SFPCAST
and SFPSTOCHRND,
with the latter making this round to nearest with ties away from
zero.
Here we need to clamp values larger than 0xffff.
This is relatively straightforward on Blackhole, since SFPGT
conveniently has an option to perform a comparison, and write either
-1 (0xffffffff) or 0
(0x00000000) to the destination register.
The trick is to use SFPLOAD (or
SFPLOADMACRO) to load the high 16 bits only, do the
comparison, and then use SFPOR
to saturate the result.
a = load_hi16()
b = load_lo16()
a = a > 0 ? -1 : 0
store(b | a) This allows us to achieve a throughput of 2 cycles
per input row via SFPLOADMACRO.
On Wormhole, we could use SFPIADD
instead of SFPGT,
computing 0 - a. This would give us 0xffff in
the high 16 bits if a > 0, otherwise
0x0000.
When loading b, we use a different loading mode to write
b to the high 16 bits of the register, giving the final
result in the high 16 bits instead of the low 16 bits.
a = load_hi16() # writes to low 16 bits of register
b = load_lo16_to_hi16() # writes to high 16 bits of register
a = 0 - a # 0xffff.... if a > 0, otherwise 0x0000....
store_hi16(b | a) # store the high 16 bitsUnfortunately, this particular approach wasn’t immediately feasible
due to a conflict when bit 11 of
RISCV_DEBUG_REG_DBG_FEATURE_DISABLE is set, which prevents
writing to low bits of Dst.
All is not lost, and we can still achieve a throughput of 2 cycles per input row on Wormhole, just with a slightly more complex macro setup.
Notation: [x] means scheduled by SFPLOADMACRO with VD=x.
| t | Load | Simple | MAD | Round | Store |
| - | ---- | --------------- | --- | ---------- | ------- |
| 0 | [a] | | | | |
| 1 | ... | [a] = 0 - a | | | |
| 0 | ... | | | [a] >>= 16 | |
| 1 | [b] | | | | |
| 0 | ... | [b] L16 = b | a | | | |
| 1 | ... | | | | [b] L16 |
Note that a’s lifetime exceeds 2 cycles, and so we have
to alternate between a=L0 and a=L1. This then
means that we require two macros for b, since it refers to
a when using SFPOR.
The trick is that SFPOR can be scheduled simultaneously
with SFPSHFT2, since one has VD=16 and the
other VD!=16.
At first glance, this seems to require 4 cycles, as we need to clamp
negative values to 0, and saturate large values to
0xffff, which would require 2x two-cycle
SFPSWAPs.
The trick is to note that SFPSTOCHRND
saturates large values to 0xffff when converting from fp32
to u16. It’s executed by the round sub-unit, which means it can
be scheduled simultaneously with any simple instruction as long
as one has VD=16 and the other VD!=16.
We still need to use SFPSWAP to clamp negative values to
0, since SFPSTOCHRND takes the absolute value
before converting to u16.
a = load()
a = cast_fp32(a)
sfpswap_maxmin(a, 0.0) # takes 2 cycles
a = rnd(a) # converts abs(a) to integer, saturating at 65535
store(a)This lets us achieve 3 cycles per input row via
SFPLOADMACRO.
Thanks to Tenstorrent for sponsoring this work.
Assembly syntax: the destination register is the last register
specified, e.g. sfpmad A,B,C,D means
D=A*B+C.