4th March, 2026
Below we optimise reciprocal(x) for Tenstorrent’s Wormhole AI
accelerator, achieving latencies of 16 cycles with
faithful rounding for fp32 and 11 cycles with correct
rounding for bf16.
The standard approach is to normalise the input value x
so that it lies within a particular range, and then use a minimax
polynomial function that minimises the maximum relative error over
that range.
Once we have this initial approximation, apply Newton-Raphson to approximately double the precision with each iteration.
For optimisation purposes, we’re interested in minimising the number
of SFPMADs
required to achieve the desired precision, denoted in approximate number
of correct bits in the following table.
| Approximation | Seed precision | N-R iterations | Result precision | SFPMAD count |
|---|---|---|---|---|
| Linear | 3 | 0 | 3 | 1 |
| Linear | 3 | 1 | 6 | 3 |
| Linear | 3 | 2 | 12 | 5 |
| Linear | 3 | 3 | 24 | 7 |
| Quadratic | 6 | 0 | 6 | 2 |
| Quadratic | 6 | 1 | 12 | 4 |
| Quadratic | 6 | 2 | 24 | 6 |
| Cubic | 9 | 0 | 9 | 3 |
For fp32 (24 bits mantissa), it’s most efficient to use a quadratic initial approximation with two Newton-Raphson iterations, which achieves faithfully-rounded results.
First, we need to scale the input value to the range
[1.0, 2.0). This is easily done by setting the raw exponent
to 127 and the sign to zero (positive). However, since
Wormhole lacks NEGATE_VA and NEGATE_VC options
for SFPMAD,
it’s more efficient to negate the input value at the same time, scaling
it to the range (-2.0, -1.0].
This is achieved in a single cycle via SFPSETMAN
to combine the sign and exponent from the constant -1.0
(predefined by the compiler to be in register LReg[11])
with the mantissa from x.
0 < in.Exp < 255 (x is not
±0, denormal, ±infinity, or NaN),
we can rewrite our input as in = (1+m) * 2^(in.Exp - 127),
then out = 1/in = 1/(1+m) * 2^(127 - in.Exp), with
out.Exp = 127 + 127 - in.Exp = 254 - in.Exp if
m=0, or 253-in.Exp otherwise.in.Exp=0 we want
out.Exp=255, and for in.Exp=255 we want
out.Exp=0, i.e. out.Exp = 255-in.Exp; we
ignore denormals and NaN.Observe that 255-in.Exp = (~in.Exp) & 0xff.
Hence scale.{Exp=~in.Exp, Man=0} gives us the correct
(negated) scale factor for inputs x=±0 and
x=±infinity.
We can compute this using two operations, SFPNOT
of the input value followed by SFPSETMAN
to clear the mantissa. These two operations can be interleaved with the
SFPMAD
instructions to hide the SFPNOPs
that would otherwise be required.
Finally, observe that scale *= 0.5 leaves
±0 and ±infinity unchanged for
scale=±0 and scale=±infinity, otherwise it
gives us our desired scale factor with raw exponent
254-in.Exp, except for the inverted sign, which is solved
by using scale *= -0.5 instead. This can be computed using
floating-point multiply with bfloat16 immediate, SFPMULI,
again interleaved with any SFPMAD
instructions.
scale = ~in # scale.{Sign,Exp,Man} = {!x.Sign,255-x.Exp,~Man} via sfpnot
scale.Man = 0
scale *= -0.5 # scale.{Sign,Exp} = {x.Sign,scale.Exp-1 = 254-x.Exp if finite and non-zero}SFPMAD
flushes -0.0 to +0.0 on Wormhole, and so does
SFPSTOCHRND.
If we want to preserve signed zeros, we can use SFPSETSGN
at the cost of an extra cycle.
We ignore denormals as these aren’t supported by Tenstorrent and tend to be flushed to zero.
Note that NaN is not preserved with this
approach and will result in zero.
The SFPI code is as follows:
// fpminimax(1/x, [|0, 1, 2|], [|24, 24, 24|], [1; 2], relative);
// vConstFloatPrgm0 = 0.3232325017452239990234375
// vConstFloatPrgm1 = 1.4545459747314453125
// vConstFloatPrgm2 = 2.121212482452392578125
vFloat _reciprocal_(const vFloat x) {
vInt scale_bits = ~reinterpret<vUInt>(x);
vFloat negative_x = setman(vConstNeg1, reinterpret<vInt>(x));
vFloat y = vConstFloatPrgm1 + vConstFloatPrgm0 * negative_x;
vFloat scale = setman(reinterpret<vFloat>(scale_bits), 0);
y = vConstFloatPrgm2 + y * negative_x;
scale *= -0.5f;
// First iteration of Newton-Raphson: t = 1.0 - x*y; y = y + y*t.
vFloat t = vConst1 + negative_x * y;
y = y + y * t;
// Second iteration of Newton-Raphson: t = 1.0 - x*y; y = y + y*t.
t = vConst1 + negative_x * y;
y = y + y * t;
y = y * scale;
return y;
}This compiles to the following 16-cycle assembly sequence:
; L1 = x
sfpnot L2,L1,0 ; scale = ~x
sfpsetman L1,L11,0,0 ; (neg_x = x).{Sign,Exp} = (-1.0).{Sign,Exp}
sfpmad L0,L12,L1,L13,0 ; y = k0 * neg_x + k1
sfpsetman L2,L2,0,1 ; scale.Man = 0
sfpmad L0,L0,L1,L14,0 ; y = y * neg_x + k2
sfpmuli L2,48896,0 ; scale *= -0.5
sfpmad L3,L1,L0,L10,0 ; t = y * neg_x + 1.0
sfpnop
sfpmad L0,L0,L3,L0,0 ; y = y * t + y
sfpnop
sfpmad L1,L1,L0,L10,0 ; t = y * neg_x + 1.0
sfpnop
sfpmad L0,L0,L1,L0,0 ; y = y * t + y
sfpnop
sfpmul L0,L0,L2,L9,0 ; y = y * scale
sfpnopFor bf16 (8 bits mantissa), it’s actually more efficient not to use Newton-Raphson at all, and rely only on a cubic polynomial.
By restricting one coefficient to 11 bits of precision, we ensure it
fits within a single 16-bit immediate SFPLOADI.
The resulting optimal coefficient (2.828125) only needs 8
bits of precision, allowing us to encode it as bf16.
We use the same range reduction/result reconstruction method as for fp32.
// fpminimax(1/x, [|0, 1, 2, 3|], [|11, 24, 24, 24|], [1; 2], relative);
// vConstFloatPrgm0 = 0.22174917161464691162109375
// vConstFloatPrgm1 = 1.330615520477294921875
// vConstFloatPrgm2 = 2.93872928619384765625
// k3 = 2.828125
vFloat _reciprocal_(const vFloat in) {
vInt scale_bits = ~reinterpret<vUInt>(in);
vFloat negative_x = setman(vConstNeg1, reinterpret<vInt>(in));
vFloat y = vConstFloatPrgm1 + vConstFloatPrgm0 * negative_x;
vFloat scale = setman(reinterpret<vFloat>(scale_bits), 0);
y = vConstFloatPrgm2 + y * negative_x;
vFloat k3 = 2.828125; // bf16
y = k3 + y * negative_x;
scale *= -0.5f;
y = y * scale;
return y;
}This gives the following 11-cycle assembly sequence:
; L1 = x
sfpnot L2,L1,0 ; scale = ~x
sfpsetman L1,L11,0,0 ; (neg_x = x).{Sign,Exp} = (-1.0).{Sign,Exp}
sfpmad L0,L12,L1,L13,0 ; y = k0 * neg_x + k1
sfpsetman L2,L2,0,1 ; scale.Man = 0
sfpmad L0,L0,L1,L14,0 ; y = y * neg_x + k2
sfploadi L3,16437,0 ; k3 = 2.828125
sfpmad L0,L0,L1,L3,0 ; y = y * neg_x + k3
sfpmuli L2,48896,0 ; scale *= -0.5
sfpnop
sfpmul L0,L0,L2,L9,0 ; y = y * scale
sfpnopIt’s possible to save another cycle if the constant k3
can be initialised elsewhere. Strictly speaking, we have four
configurable constants, but one constant is reserved by the compiler for
-1.0.
Typically we would also need to round the result to bf16 before
storing it, via SFPSTOCHRND,
achieving correctly-rounded results.
Thanks to Tenstorrent for sponsoring this work.