Accurate sin/cos/tan on Tenstorrent

23rd February, 2026

Introduction

The “standard” recipe for high-precision sin(x) and cos(x) is as follows:

Reduce x to the range [-π/4, π/4] via k = round(x / (π/2)); a = x - k * π/2. Use Cody-Waite argument reduction to avoid subtractive cancellation for x up to some cutoff value, then switch to the more expensive Payne-Hanek method for large x.
Determine the quadrant via k % 4.
Evaluate a minimax polynomial for sin(a) or cos(a) depending on the quadrant (and negate depending on quadrant). In fact, both sin(a) and cos(a) can be evaluated simultaneously, then swapped/negated by quadrant, which is particularly useful if both are needed at once.

However, since Tenstorrent’s vector engine is SIMD, there’s no conditional branching. This means that even if we only need one of sin(a) or cos(a), we always end up evaluating both, which is inefficient.

Similarly for argument reduction, it makes no sense to evaluate both methods. First we look at argument reduction, followed by methods for evaluating sin(a) and cos(a) via a single polynomial instead of two.

Argument Reduction

In order to reduce x to some range [-P/2, P/2], the naive approach is:

k = round(x / P)
a = x - k * P

This suffers from subtractive cancellation when x and k * P are close, because low bits are lost in the subtraction.

Cody-Waite Argument Reduction

By splitting the constant into multiple parts, we preserve low bits at each stage.

For example, we can split π into four constants:

P0 = 0x1.92p+1f;      // 0b01000000010010010000000000000000
P1 = 0x1.fbp-11f;     // 0b00111010011111011000000000000000
P2 = 0x1.51p-21f;     // 0b00110101001010001000000000000000
P3 = 0x1.0b4612p-33f; // 0b00101111000001011010001100001001

Notice how the low 15 bits are zero for P0, P1, and P2. Incidentally, P0 and P1 are representable in bf16 and fp16 respectively, allowing them to be loaded with a single SFPLOADI.

Then compute:

a = x - k * P0;
a = a - k * P1;
a = a - k * P2;
a = a - k * P3;

To reduce by π/2, we adjust exponents to obtain split constants:

P0 = 0x1.92p+0f;      // 0b00111111110010010000000000000000
P1 = 0x1.fbp-12f;     // 0b00111001111111011000000000000000
P2 = 0x1.51p-22f;     // 0b00110100101010001000000000000000
P3 = 0x1.0b4612p-34f; // 0b00101110100001011010001100001001

For rounding, Tenstorrent has SFPSTOCHRND, but this only works for |x| ≤ 32767. In order to support larger |x| efficiently, we can use the following trick:

# compute x/P, then shift fractional mantissa bits to round-to-nearest-even.
k = x * (1/P) + (1.5 * 2.0**23)
# shift mantissa bits back
k -= (1.5 * 2.0**23)

This works for |x/P| < 2^23. Since we already need to compute x/P, we effectively fold the addition into the FMA for free.

Payne-Hanek Argument Reduction

For very large arguments, the classic approach is to switch to Payne-Hanek reduction.

Payne-Hanek works by representing 1/P at very high precision (typically as a multiword fixed-point constant), multiplying x by that constant to obtain a high-precision quotient, extracting the nearest integer k, and then reconstructing the remainder with a compensated subtraction such as a = x - k * P using split constants. The key advantage is that the quotient and remainder are computed with enough guard bits that argument reduction remains accurate even when |x| is huge.

Although tempting, it’s almost definitely more expensive to compute compared to Cody-Waite, and so we intentionally stick with four-stage Cody-Waite reduction, which gives us good throughput for a fairly wide range of values.

sin(x)

In order to use a single polynomial instead of two, first we reduce modulo π to get a ∈ [-π/2, π/2] using the above four-stage Cody-Waite.

Next, find a suitable polynomial via Sollya. sin(a) is odd, so we constrain the basis to odd powers only.

fpminimax(sin(x), [|1,3,5,7,9|], [|single...|], [2^-60; pi/2], relative);
// Horner form (s = x^2):
// x * (1 + s * (C0 + s * (C1 + s * (C2 + s * C3))))
// C0 = -0x1.55554cp-3
// C1 =  0x1.110edap-7
// C2 = -0x1.9f70fp-13
// C3 =  0x1.5dc908p-19

Map the odd polynomial to Horner form in s = a^2:

r = C3;
r = r * s + C2;
r = r * s + C1;
r = r * s + C0;
a3 = a * s;
r = r * a3 + a;

Finally, the result needs to be negated when k is odd. An efficient way to do this is to compute r ^ (k << 31), with the shift operation hidden in one of the SFPNOP slots.

This yields a maximum ulp error below 2 for x ∈ [0, π]:

The four-stage Cody-Waite argument reduction keeps maximum ulp error below 3 up to around 100,000:

cos(x)

As x → π/2, cos(x) approaches zero. This means we are subtracting a value that approaches 1 from 1 when evaluating an approximation for cos(x), leading to subtractive cancellation and ULP error spikes in the immediate vicinity of π/2.

The trick is to reuse the sin(x) polynomial instead and transform cosine into sine. We could use cos(x) = sin(π/2 - x), but the subtraction would need care to avoid cancellation.

Instead, we use cos(x) = sin(x + π/2) with the index transform i = round(x/π + 0.5); j = 2*i - 1.

Then a = x - j * π/2 and we evaluate the same sine polynomial on a, with sign selected from index parity.

This keeps maximum ulp error below 2 for x ∈ [0, π]:

The four-stage Cody-Waite argument reduction keeps maximum ulp error below 3 up to around 50,000:

tan(x)

In contrast to sin(x) and cos(x), we use the standard treatment for tan(x).

tan(x) has poles at odd multiples of π/2, so fitting one polynomial on [-π/2, π/2] is a bad idea.

Instead:

Reduce modulo π/2 to a ∈ [-π/4, π/4].
Use polynomial only on [-π/4, π/4].
In odd quadrants, use reciprocal identity tan(x) = -1/tan(a).

fpminimax((tan(sqrt(x))/sqrt(x)-1)/x, [|0,1,2,3,4,5,6|], [|single...|], [2^-120; (pi/4)^2], relative);
// Horner form (s = a^2):
// tan(a) = a + a^3 * (C0 + s * (C1 + s * (C2 + s * (C3 + s * (C4 + s * (C5 + s * C6))))))
// C0 = 0x1.555556p-2
// C1 = 0x1.111072p-3
// C2 = 0x1.ba5716p-5
// C3 = 0x1.620abcp-6
// C4 = 0x1.4787dp-7
// C5 = 0x1.2b404p-10
// C6 = 0x1.fa9f82p-9

This keeps maximum ulp error below 2 for x ∈ [0, π]:

The four-stage Cody-Waite argument reduction keeps maximum ulp error below 3 up to around 50,000:

Next we cover the specifics of implementing reciprocal on Blackhole and Wormhole.

Blackhole

Blackhole has a dedicated approximate reciprocal instruction, SFPARECIP, so the odd-quadrant path is straightforward:

Compute primary approximation r = tan(a).
Seed reciprocal t = 1/r.
Refine with one Newton-Raphson step: t = t * (2 - r*t)

One issue is that the use of reciprocal means tiny errors get amplified near poles. We use a compensated reconstruction term to reduce this effect:

s = -r + a;
s = t * a + s;

Then we reconstruct the final value from the refined reciprocal and residual:

r = r * t + 1.0;
r = s * t + r;
r = r * t + t;

The full code for tan(a) on the reduced argument:

// tan(a) for a in [-π/4, π/4]
// i is the quadrant integer k left-shifted by 31, moving the odd/even parity
// bit into the sign bit, allowing an efficient `v_if(i < 0)` sign check to
// apply the reciprocal for odd quadrants.
sfpi_inline vFloat sfpu_tan<true>(vFloat a, vInt i) {
    vFloat s = a * a;

    vFloat t = 0x1.fa9f82p-9f;
    t = t * s + 0x1.2b404p-10f;
    t = t * s + 0x1.4787dp-7f;
    t = t * s + 0x1.620abcp-6f;
    t = t * s + 0x1.ba5716p-5f;
    t = t * s + 0x1.111072p-3f;
    t = t * s + 0x1.555556p-2f;
    t = t * s;

    vFloat r = t * a + a;

    v_if(i < 0) {
        // Compensated residual for the reciprocal-correction branch.
        // This preserves precision when tan(x) is near its poles.
        s = vConstNeg1 * r + a;
        s = t * a + s;

        t = approx_recip(r);

        // Newton-Raphson refinement.
        // e = 1 - r*t, then t <- t*(1 + e) = t*(2 - r*t)
        vFloat e = -r * t + vConst1;
        // Negate to get t = -1/r.
        t = -t * e - t;

        // Reconstruct tan from corrected reciprocal terms.
        r = r * t + vConst1;
        r = s * t + r;
        r = r * t + t;
    }
    v_endif;

    return r;
}

Wormhole

Wormhole doesn’t have a dedicated SFPARECIP instruction.

Instead, we normalise r so it lies in the range [1, 2) and use a quadratic minimax polynomial on this range:

fpminimax(1/x, 2, [|single...|], [1; 2], relative);
// 2.121212482452392578125 + x * (-1.4545459747314453125 + x * 0.3232325017452239990234375)

The full code for tan(a) on the reduced argument:

// tan(a) for a in [-π/4, π/4]
// i is the quadrant integer k left-shifted by 31, moving the odd/even parity
// bit into the sign bit, allowing an efficient `v_if(i < 0)` sign check to
// apply the reciprocal for odd quadrants.
sfpi_inline vFloat sfpu_tan(vFloat a, vInt i) {
    vFloat s = a * a;

    vFloat t = 0x1.fa9f82p-9f;
    t = t * s + 0x1.2b404p-10f;
    t = t * s + 0x1.4787dp-7f;
    t = t * s + 0x1.620abcp-6f;
    t = t * s + 0x1.ba5716p-5f;
    t = t * s + 0x1.111072p-3f;
    t = t * s + 0x1.555556p-2f;
    t = t * s;

    vFloat r = t * a + a;

    v_if(i < 0) {
        // Compensated residual for the reciprocal-correction branch.
        // This preserves precision when tan(x) is near its poles.
        s = vConstNeg1 * r + a;
        vFloat negative_x = setman(vConstNeg1, reinterpret<vInt>(r));
        s = t * a + s;

        // Approximate reciprocal of -r using quadratic initial estimate.
        const float k0 = 0.3232325017452239990234375f;
        const float k1 = 1.4545459747314453125f;
        const float k2 = 2.121212482452392578125f;

        vInt scale_bits = ~reinterpret<vUInt>(r);
        t = k1 + k0 * negative_x;
        vFloat scale = setman(reinterpret<vFloat>(scale_bits), 0);
        t = k2 + t * negative_x;
        scale *= 0.5f;

        // Newton-Raphson refinement.
        vFloat e = vConst1 + negative_x * t;
        t = t * e + t;
        t = t * scale;

        // Reconstruct tan from corrected reciprocal terms.
        r = r * t + vConst1;
        r = s * t + r;
        r = r * t + t;
    }
    v_endif;

    return r;
}

bf16

Adjusting the above approaches for bf16 is straightforward. We use lower-degree polynomials, since we don’t need as many bits of precision, and we drop the compensated reconstruction code for tan.

sin(x)

Use a shorter polynomial sin(a) = a + a^3(C_0 + a^2(C_1 + a^2 C_2)).

fpminimax((sin(sqrt(x))/sqrt(x)-1)/x, [|0,1,2|], [|single...|], [2^-120; (pi/2)^2], relative);
// Horner form (s = a^2):
// sin(a) = a + a^3 * (C0 + s * (C1 + s * C2))
// C0 = -0x1.5554a4p-3
// C1 =  0x1.10c2a2p-7
// C2 = -0x1.8b10a4p-13

cos(x)

Cosine reuses the same reduced-range sine polynomial path, so bf16 cosine gets the same lower-degree benefit.

tan(x)

Use a lower-degree polynomial for tan(x) on [-π/4, π/4].
In bf16, a lighter reciprocal correction is sufficient.

fpminimax((tan(sqrt(x))/sqrt(x)-1)/x, [|0,1,2|], [|single...|], [2^-120; (pi/4)^2], relative);
// Horner form (s = a^2):
// tan(a) = a + a^3 * (C0 + s * (C1 + s * C2))
// C0 = 0x1.55953p-2
// C1 = 0x1.02b98p-3
// C2 = 0x1.4f1f4ep-4

Acknowledgements

Thanks to N. Juffa for patiently advising me on four-stage Cody-Waite and all aspects of numerical precision.
Thanks to Tenstorrent for sponsoring this work.