Reciprocal on Tenstorrent Wormhole

4th March, 2026

Introduction

Below we optimise reciprocal(x) for Tenstorrent’s Wormhole AI accelerator, achieving latencies of 16 cycles with faithful rounding for fp32 and 11 cycles with correct rounding for bf16.

Method

The standard approach is to normalise the input value x so that it lies within a particular range, and then use a minimax polynomial function that minimises the maximum relative error over that range.

Once we have this initial approximation, apply Newton-Raphson to approximately double the precision with each iteration.

For optimisation purposes, we’re interested in minimising the number of SFPMADs required to achieve the desired precision, denoted in approximate number of correct bits in the following table.

Approximation Seed precision N-R iterations Result precision SFPMAD count
Linear 3 0 3 1
Linear 3 1 6 3
Linear 3 2 12 5
Linear 3 3 24 7
Quadratic 6 0 6 2
Quadratic 6 1 12 4
Quadratic 6 2 24 6
Cubic 9 0 9 3

fp32

For fp32 (24 bits mantissa), it’s most efficient to use a quadratic initial approximation with two Newton-Raphson iterations, which achieves faithfully-rounded results.

Range Reduction

First, we need to scale the input value to the range [1.0, 2.0). This is easily done by setting the raw exponent to 127 and the sign to zero (positive). However, since Wormhole lacks NEGATE_VA and NEGATE_VC options for SFPMAD, it’s more efficient to negate the input value at the same time, scaling it to the range (-2.0, -1.0].

This is achieved in a single cycle via SFPSETMAN to combine the sign and exponent from the constant -1.0 (predefined by the compiler to be in register LReg[11]) with the mantissa from x.

Result Reconstruction

Observe that 255-in.Exp = (~in.Exp) & 0xff.

Hence scale.{Exp=~in.Exp, Man=0} gives us the correct (negated) scale factor for inputs x=±0 and x=±infinity.

We can compute this using two operations, SFPNOT of the input value followed by SFPSETMAN to clear the mantissa. These two operations can be interleaved with the SFPMAD instructions to hide the SFPNOPs that would otherwise be required.

Finally, observe that scale *= 0.5 leaves ±0 and ±infinity unchanged for scale=±0 and scale=±infinity, otherwise it gives us our desired scale factor with raw exponent 254-in.Exp, except for the inverted sign, which is solved by using scale *= -0.5 instead. This can be computed using floating-point multiply with bfloat16 immediate, SFPMULI, again interleaved with any SFPMAD instructions.

scale = ~in   # scale.{Sign,Exp,Man} = {!x.Sign,255-x.Exp,~Man} via sfpnot
scale.Man = 0
scale *= -0.5 # scale.{Sign,Exp} = {x.Sign,scale.Exp-1 = 254-x.Exp if finite and non-zero}

Signed Zeros, Denormals, and NaN

SFPMAD flushes -0.0 to +0.0 on Wormhole, and so does SFPSTOCHRND. If we want to preserve signed zeros, we can use SFPSETSGN at the cost of an extra cycle.

We ignore denormals as these aren’t supported by Tenstorrent and tend to be flushed to zero.

Note that NaN is not preserved with this approach and will result in zero.

Code

The SFPI code is as follows:

// fpminimax(1/x, [|0, 1, 2|], [|24, 24, 24|], [1; 2], relative);
// vConstFloatPrgm0 = 0.3232325017452239990234375
// vConstFloatPrgm1 = 1.4545459747314453125
// vConstFloatPrgm2 = 2.121212482452392578125
vFloat _reciprocal_(const vFloat x) {

    vInt scale_bits = ~reinterpret<vUInt>(x);
    vFloat negative_x = setman(vConstNeg1, reinterpret<vInt>(x));

    vFloat y = vConstFloatPrgm1 + vConstFloatPrgm0 * negative_x;
    vFloat scale = setman(reinterpret<vFloat>(scale_bits), 0);
    y = vConstFloatPrgm2 + y * negative_x;
    scale *= -0.5f;

    // First iteration of Newton-Raphson: t = 1.0 - x*y; y = y + y*t.
    vFloat t = vConst1 + negative_x * y;
    y = y + y * t;

    // Second iteration of Newton-Raphson: t = 1.0 - x*y; y = y + y*t.
    t = vConst1 + negative_x * y;
    y = y + y * t;

    y = y * scale;

    return y;
}

This compiles to the following 16-cycle assembly sequence:

; L1 = x
sfpnot L2,L1,0         ; scale = ~x
sfpsetman L1,L11,0,0   ; (neg_x = x).{Sign,Exp} = (-1.0).{Sign,Exp}
sfpmad L0,L12,L1,L13,0 ; y = k0 * neg_x + k1
sfpsetman L2,L2,0,1    ; scale.Man = 0
sfpmad L0,L0,L1,L14,0  ; y = y * neg_x + k2
sfpmuli L2,48896,0     ; scale *= -0.5
sfpmad L3,L1,L0,L10,0  ; t = y * neg_x + 1.0
sfpnop
sfpmad L0,L0,L3,L0,0   ; y = y * t + y
sfpnop
sfpmad L1,L1,L0,L10,0  ; t = y * neg_x + 1.0
sfpnop
sfpmad L0,L0,L1,L0,0   ; y = y * t + y
sfpnop
sfpmul L0,L0,L2,L9,0   ; y = y * scale
sfpnop

bf16

For bf16 (8 bits mantissa), it’s actually more efficient not to use Newton-Raphson at all, and rely only on a cubic polynomial.

By restricting one coefficient to 11 bits of precision, we ensure it fits within a single 16-bit immediate SFPLOADI. The resulting optimal coefficient (2.828125) only needs 8 bits of precision, allowing us to encode it as bf16.

We use the same range reduction/result reconstruction method as for fp32.

Code

// fpminimax(1/x, [|0, 1, 2, 3|], [|11, 24, 24, 24|], [1; 2], relative);
// vConstFloatPrgm0 = 0.22174917161464691162109375
// vConstFloatPrgm1 = 1.330615520477294921875
// vConstFloatPrgm2 = 2.93872928619384765625
// k3 = 2.828125
vFloat _reciprocal_(const vFloat in) {

    vInt scale_bits = ~reinterpret<vUInt>(in);
    vFloat negative_x = setman(vConstNeg1, reinterpret<vInt>(in));

    vFloat y = vConstFloatPrgm1 + vConstFloatPrgm0 * negative_x;
    vFloat scale = setman(reinterpret<vFloat>(scale_bits), 0);
    y = vConstFloatPrgm2 + y * negative_x;
    vFloat k3 = 2.828125; // bf16
    y = k3 + y * negative_x;
    scale *= -0.5f;
    y = y * scale;

    return y;
}

This gives the following 11-cycle assembly sequence:

; L1 = x
sfpnot L2,L1,0         ; scale = ~x
sfpsetman L1,L11,0,0   ; (neg_x = x).{Sign,Exp} = (-1.0).{Sign,Exp}

sfpmad L0,L12,L1,L13,0 ; y = k0 * neg_x + k1
sfpsetman L2,L2,0,1    ; scale.Man = 0

sfpmad L0,L0,L1,L14,0  ; y = y * neg_x + k2
sfploadi L3,16437,0    ; k3 = 2.828125

sfpmad L0,L0,L1,L3,0   ; y = y * neg_x + k3
sfpmuli L2,48896,0     ; scale *= -0.5
sfpnop

sfpmul L0,L0,L2,L9,0   ; y = y * scale
sfpnop

It’s possible to save another cycle if the constant k3 can be initialised elsewhere. Strictly speaking, we have four configurable constants, but one constant is reserved by the compiler for -1.0.

Typically we would also need to round the result to bf16 before storing it, via SFPSTOCHRND, achieving correctly-rounded results.

Acknowledgements

Thanks to Tenstorrent for sponsoring this work.