23rd February, 2026
The “standard” recipe for high-precision sin(x) and
cos(x) is as follows:
x to the range [-π/4, π/4] via
k = round(x / (π/2)); a = x - k * π/2. Use Cody-Waite
argument reduction to avoid subtractive cancellation for x
up to some cutoff value, then switch to the more expensive Payne-Hanek
method for large x.k % 4.sin(a) or cos(a) depending on
the quadrant (and negate depending on quadrant). In fact, both
sin(a) and cos(a) can be evaluated
simultaneously, then swapped/negated by quadrant, which is particularly
useful if both are needed at once.However, since Tenstorrent’s vector
engine is SIMD, there’s no conditional branching. This means that
even if we only need one of sin(a) or cos(a),
we always end up evaluating both, which is inefficient.
Similarly for argument reduction, it makes no sense to evaluate
both methods. First we look at argument reduction, followed by
methods for evaluating sin(a) and cos(a) via a
single polynomial instead of two.
In order to reduce x to some range
[-P/2, P/2], the naive approach is:
k = round(x / P)
a = x - k * P
This suffers from subtractive cancellation when x and
k * P are close, because low bits are lost in the
subtraction.
By splitting the constant into multiple parts, we preserve low bits at each stage.
For example, we can split π into four constants:
P0 = 0x1.92p+1f; // 0b01000000010010010000000000000000
P1 = 0x1.fbp-11f; // 0b00111010011111011000000000000000
P2 = 0x1.51p-21f; // 0b00110101001010001000000000000000
P3 = 0x1.0b4612p-33f; // 0b00101111000001011010001100001001Notice how the low 15 bits are zero for P0,
P1, and P2. Incidentally, P0 and
P1 are representable in bf16 and fp16 respectively,
allowing them to be loaded with a single SFPLOADI.
Then compute:
a = x - k * P0;
a = a - k * P1;
a = a - k * P2;
a = a - k * P3;To reduce by π/2, we adjust exponents to obtain split
constants:
P0 = 0x1.92p+0f; // 0b00111111110010010000000000000000
P1 = 0x1.fbp-12f; // 0b00111001111111011000000000000000
P2 = 0x1.51p-22f; // 0b00110100101010001000000000000000
P3 = 0x1.0b4612p-34f; // 0b00101110100001011010001100001001For rounding, Tenstorrent has SFPSTOCHRND,
but this only works for |x| ≤ 32767. In order to support
larger |x| efficiently, we can use the following trick:
# compute x/P, then shift fractional mantissa bits to round-to-nearest-even.
k = x * (1/P) + (1.5 * 2.0**23)
# shift mantissa bits back
k -= (1.5 * 2.0**23)This works for |x/P| < 2^23. Since we already need to
compute x/P, we effectively fold the addition into the FMA
for free.
For very large arguments, the classic approach is to switch to Payne-Hanek reduction.
Payne-Hanek works by representing 1/P at very high
precision (typically as a multiword fixed-point constant), multiplying
x by that constant to obtain a high-precision quotient,
extracting the nearest integer k, and then reconstructing
the remainder with a compensated subtraction such as
a = x - k * P using split constants. The key advantage is
that the quotient and remainder are computed with enough guard bits that
argument reduction remains accurate even when |x| is
huge.
Although tempting, it’s almost definitely more expensive to compute compared to Cody-Waite, and so we intentionally stick with four-stage Cody-Waite reduction, which gives us good throughput for a fairly wide range of values.
In order to use a single polynomial instead of two, first we reduce
modulo π to get a ∈ [-π/2, π/2] using the above four-stage
Cody-Waite.
Next, find a suitable polynomial via Sollya. sin(a) is odd,
so we constrain the basis to odd powers only.
fpminimax(sin(x), [|1,3,5,7,9|], [|single...|], [2^-60; pi/2], relative);
// Horner form (s = x^2):
// x * (1 + s * (C0 + s * (C1 + s * (C2 + s * C3))))
// C0 = -0x1.55554cp-3
// C1 = 0x1.110edap-7
// C2 = -0x1.9f70fp-13
// C3 = 0x1.5dc908p-19
Map the odd polynomial to Horner form in s = a^2:
r = C3;
r = r * s + C2;
r = r * s + C1;
r = r * s + C0;
a3 = a * s;
r = r * a3 + a;Finally, the result needs to be negated when k is odd.
An efficient way to do this is to compute
r ^ (k << 31), with the shift operation hidden in one
of the SFPNOP slots.
This yields a maximum ulp error below 2 for
x ∈ [0, π]:

The four-stage Cody-Waite argument reduction keeps maximum ulp error below 3 up to around 100,000:

As x → π/2, cos(x) approaches zero. This
means we are subtracting a value that approaches 1 from
1 when evaluating an approximation for cos(x),
leading to subtractive cancellation and ULP error spikes in the
immediate vicinity of π/2.
The trick is to reuse the sin(x) polynomial instead and
transform cosine into sine. We could use
cos(x) = sin(π/2 - x), but the subtraction would need care
to avoid cancellation.
Instead, we use cos(x) = sin(x + π/2) with the index
transform i = round(x/π + 0.5); j = 2*i - 1.
Then a = x - j * π/2 and we evaluate the same sine
polynomial on a, with sign selected
from index parity.
This keeps maximum ulp error below 2 for x ∈ [0, π]:

The four-stage Cody-Waite argument reduction keeps maximum ulp error below 3 up to around 50,000:

In contrast to sin(x) and cos(x), we use
the standard treatment for tan(x).
tan(x) has poles at odd multiples of π/2,
so fitting one polynomial on [-π/2, π/2] is a bad idea.
Instead:
π/2 to a ∈ [-π/4, π/4].[-π/4, π/4].tan(x) = -1/tan(a).fpminimax((tan(sqrt(x))/sqrt(x)-1)/x, [|0,1,2,3,4,5,6|], [|single...|], [2^-120; (pi/4)^2], relative);
// Horner form (s = a^2):
// tan(a) = a + a^3 * (C0 + s * (C1 + s * (C2 + s * (C3 + s * (C4 + s * (C5 + s * C6))))))
// C0 = 0x1.555556p-2
// C1 = 0x1.111072p-3
// C2 = 0x1.ba5716p-5
// C3 = 0x1.620abcp-6
// C4 = 0x1.4787dp-7
// C5 = 0x1.2b404p-10
// C6 = 0x1.fa9f82p-9
This keeps maximum ulp error below 2 for x ∈ [0, π]:

The four-stage Cody-Waite argument reduction keeps maximum ulp error below 3 up to around 50,000:

Next we cover the specifics of implementing reciprocal on Blackhole and Wormhole.
Blackhole has a dedicated approximate reciprocal instruction, SFPARECIP,
so the odd-quadrant path is straightforward:
r = tan(a).t = 1/r.t = t * (2 - r*t)One issue is that the use of reciprocal means tiny errors get amplified near poles. We use a compensated reconstruction term to reduce this effect:
s = -r + a;
s = t * a + s;Then we reconstruct the final value from the refined reciprocal and residual:
r = r * t + 1.0;
r = s * t + r;
r = r * t + t;The full code for tan(a) on the reduced argument:
// tan(a) for a in [-π/4, π/4]
// i is the quadrant integer k left-shifted by 31, moving the odd/even parity
// bit into the sign bit, allowing an efficient `v_if(i < 0)` sign check to
// apply the reciprocal for odd quadrants.
sfpi_inline vFloat sfpu_tan<true>(vFloat a, vInt i) {
vFloat s = a * a;
vFloat t = 0x1.fa9f82p-9f;
t = t * s + 0x1.2b404p-10f;
t = t * s + 0x1.4787dp-7f;
t = t * s + 0x1.620abcp-6f;
t = t * s + 0x1.ba5716p-5f;
t = t * s + 0x1.111072p-3f;
t = t * s + 0x1.555556p-2f;
t = t * s;
vFloat r = t * a + a;
v_if(i < 0) {
// Compensated residual for the reciprocal-correction branch.
// This preserves precision when tan(x) is near its poles.
s = vConstNeg1 * r + a;
s = t * a + s;
t = approx_recip(r);
// Newton-Raphson refinement.
// e = 1 - r*t, then t <- t*(1 + e) = t*(2 - r*t)
vFloat e = -r * t + vConst1;
// Negate to get t = -1/r.
t = -t * e - t;
// Reconstruct tan from corrected reciprocal terms.
r = r * t + vConst1;
r = s * t + r;
r = r * t + t;
}
v_endif;
return r;
}Wormhole
doesn’t have a dedicated SFPARECIP
instruction.
Instead, we normalise r so it lies in the range
[1, 2) and use a quadratic minimax polynomial on this
range:
fpminimax(1/x, 2, [|single...|], [1; 2], relative);
// 2.121212482452392578125 + x * (-1.4545459747314453125 + x * 0.3232325017452239990234375)
The full code for tan(a) on the reduced argument:
// tan(a) for a in [-π/4, π/4]
// i is the quadrant integer k left-shifted by 31, moving the odd/even parity
// bit into the sign bit, allowing an efficient `v_if(i < 0)` sign check to
// apply the reciprocal for odd quadrants.
sfpi_inline vFloat sfpu_tan(vFloat a, vInt i) {
vFloat s = a * a;
vFloat t = 0x1.fa9f82p-9f;
t = t * s + 0x1.2b404p-10f;
t = t * s + 0x1.4787dp-7f;
t = t * s + 0x1.620abcp-6f;
t = t * s + 0x1.ba5716p-5f;
t = t * s + 0x1.111072p-3f;
t = t * s + 0x1.555556p-2f;
t = t * s;
vFloat r = t * a + a;
v_if(i < 0) {
// Compensated residual for the reciprocal-correction branch.
// This preserves precision when tan(x) is near its poles.
s = vConstNeg1 * r + a;
vFloat negative_x = setman(vConstNeg1, reinterpret<vInt>(r));
s = t * a + s;
// Approximate reciprocal of -r using quadratic initial estimate.
const float k0 = 0.3232325017452239990234375f;
const float k1 = 1.4545459747314453125f;
const float k2 = 2.121212482452392578125f;
vInt scale_bits = ~reinterpret<vUInt>(r);
t = k1 + k0 * negative_x;
vFloat scale = setman(reinterpret<vFloat>(scale_bits), 0);
t = k2 + t * negative_x;
scale *= 0.5f;
// Newton-Raphson refinement.
vFloat e = vConst1 + negative_x * t;
t = t * e + t;
t = t * scale;
// Reconstruct tan from corrected reciprocal terms.
r = r * t + vConst1;
r = s * t + r;
r = r * t + t;
}
v_endif;
return r;
}Adjusting the above approaches for bf16 is straightforward. We use lower-degree polynomials, since we don’t need as many bits of precision, and we drop the compensated reconstruction code for tan.
Use a shorter polynomial
sin(a) = a + a^3(C_0 + a^2(C_1 + a^2 C_2)).
fpminimax((sin(sqrt(x))/sqrt(x)-1)/x, [|0,1,2|], [|single...|], [2^-120; (pi/2)^2], relative);
// Horner form (s = a^2):
// sin(a) = a + a^3 * (C0 + s * (C1 + s * C2))
// C0 = -0x1.5554a4p-3
// C1 = 0x1.10c2a2p-7
// C2 = -0x1.8b10a4p-13

Cosine reuses the same reduced-range sine polynomial path, so bf16 cosine gets the same lower-degree benefit.

tan(x) on
[-π/4, π/4].fpminimax((tan(sqrt(x))/sqrt(x)-1)/x, [|0,1,2|], [|single...|], [2^-120; (pi/4)^2], relative);
// Horner form (s = a^2):
// tan(a) = a + a^3 * (C0 + s * (C1 + s * C2))
// C0 = 0x1.55953p-2
// C1 = 0x1.02b98p-3
// C2 = 0x1.4f1f4ep-4
