16th March, 2026
We analyse an efficient variation on left-shift binary GCD algorithms, and show how floating-point representations can be used to increase throughput for 32-bit and 64-bit values on NVIDIA Ampere and later.
While optimising Greatest Common Divisor for Tenstorrent, I happened across various alternative binary GCD algorithms that involve left-shifting instead of right-shifting.
The earliest reference I could find is R. P.
Brent, “Analysis of the binary Euclidean algorithm” (1976). The
following is a modern rendition, using clz to compute the
exact left-shift amount instead of using a loop:
def brent_gcd(a, b):
# invariant: a ≥ b
if a < b:
a, b = b, a
while b != 0:
e = clz(b) - clz(a)
t = b << e
if t > a:
# e -= 1
t >>= 1
a -= t
if a < b:
a, b = b, a
return aShallit & Sorenson analyse a variant of this in Analysis of a Left-Shift
Binary Greatest Common Divisor (GCD) Algorithm, with the addition of
a min(a - t, (t << 1) - a) step.
def shallit_gcd(a, b):
# invariant: a ≥ b
if a < b:
a, b = b, a
while b != 0:
# find e ≥ 0 such that 2**e b ≤ a ≤ 2**(e+1) b
e = clz(b) - clz(a)
t = b << e
if t > a:
# e -= 1
t >>= 1
a, b = b, min(a - t, (t << 1) - a)
# maintain invariant a ≥ b
if a < b:
a, b = b, a
return aShallit & Sorenson’s variant has some nice properties, such as a worst-case number of iterations similar to Stein’s binary GCD, i.e. just over k iterations for k-bit inputs, and fewer iterations in the average case.
| Variant | Average Case | Worst Case |
|---|---|---|
| Stein (right-shift) | 0.7060 k | k |
| Brent | 0.8758 k | 1.4404 k |
| Shallit & Sorenson | ~0.67 k | 1.0527 k |
However, it’s considerably more expensive than Stein’s algorithm due to effectively computing and testing three possible shifted values.
Consider the following variant for the hot loop, which uses only a single shift value to align the MSBs, followed by using the absolute value of the difference.
# assume a ≥ b
e = clz(b) - clz(a)
t = b << e
a = abs(a - t)
b, a = swap_min_max(b, a) # restore invariantThis hot loop is considerably more efficient, but at the expense of requiring slightly more iterations for random input values: an average of ~24 steps compared to ~22 steps for Stein’s algorithm for 32-bit values.
The worst case is approximately 1.6 k steps for k-bit inputs, making it unsuitable for use on Tenstorrent’s vector engine, which has no way to easily exit a loop early, meaning the loop must run for the worst-case number of iterations.
On NVIDIA Ampere and later, fp32 throughput is roughly twice that of INT32.
Instead of left-shifting integers to align MSBs, we can instead
represent values in fp32 or fp64, and set the exponent of the smaller
value to that of the larger value. This can be achieved efficiently via
a single lop3 operation, since the mantissa field of the
smaller value and the exponent field of the larger value occupy disjoint
bit ranges.
This only works for integers that are exactly representable in
floating point, i.e. u24 for fp32 and u53 for
fp64.
The algorithm then performs a floating-point subtraction, followed by
fmin/fmax to restore the ordering
invariant.
On Ampere, the fp32/u24 inner loop lowers to the
following four-instruction SASS sequence. The subtraction appears as
FADD with a negated source operand, and the absolute value
is folded into the FMNMX source modifiers:
LOP3.LUT R2, R8, 0x7fffff, R7, ... ; INT: t = {a.Exp,b.Man}
FADD R3, -R2, R7 ; FP: t = a - t
FMNMX R7, |R3|, R8, !PT ; FP: a = max(b, abs(t))
FMNMX R8, |R3|, R8, PT ; FP: b = min(b, abs(t))In order to support 32-bit and 64-bit inputs, we use hybrid variants:
strip common powers of two as in Stein, run 8 Stein frontend iterations
for u32 or 11 for u64, and then switch to the
floating-point core once the operands are in the exact u24
or u53 regime.
For NVIDIA
A100 80GB PCIe, we get 1.64x for u24,
and 1.56x for u32 via the hybrid
algorithm, compared to plain Stein.
These measurements use count=1048576,
rounds=1024, launches=20,
block_size=256.
- fp32_u24: 258.993 ms, 8.292e+10 calls/s, 0.012 ns/call
- stein_u24: 424.368 ms, 5.060e+10 calls/s, 0.020 ns/call
- fp32_u32: 348.216 ms, 6.167e+10 calls/s, 0.016 ns/call
- stein_u32: 542.102 ms, 3.961e+10 calls/s, 0.025 ns/call
This throughput improvement extends to fp64, but only for NVIDIA’s HPC/datacenter GPUs. Consumer GeForce cards such as the RTX 3090 and RTX 4090 have severely reduced fp64 throughput (typically 1/64 of fp32).
We get 1.37x for u53, and
1.37x for u64 via the hybrid algorithm,
compared to plain Stein.
- fp64_u53: 1494.136 ms, 1.437e+10 calls/s, 0.070 ns/call
- stein_u53: 2049.027 ms, 1.048e+10 calls/s, 0.095 ns/call
- fp64_u64: 1759.152 ms, 1.221e+10 calls/s, 0.082 ns/call
- stein_u64: 2418.740 ms, 8.879e+09 calls/s, 0.113 ns/call