Floating Point Binary GCD

16th March, 2026

Introduction

We analyse an efficient variation on left-shift binary GCD algorithms, and show how floating-point representations can be used to increase throughput for 32-bit and 64-bit values on NVIDIA Ampere and later.

Left-Shift Binary GCD Variants

While optimising Greatest Common Divisor for Tenstorrent, I happened across various alternative binary GCD algorithms that involve left-shifting instead of right-shifting.

The earliest reference I could find is R. P. Brent, “Analysis of the binary Euclidean algorithm” (1976). The following is a modern rendition, using clz to compute the exact left-shift amount instead of using a loop:

def brent_gcd(a, b):
    # invariant: a ≥ b
    if a < b:
        a, b = b, a
    while b != 0:
        e = clz(b) - clz(a)
        t = b << e
        if t > a:
            # e -= 1
            t >>= 1
        a -= t
        if a < b:
            a, b = b, a
    return a

Shallit & Sorenson analyse a variant of this in Analysis of a Left-Shift Binary Greatest Common Divisor (GCD) Algorithm, with the addition of a min(a - t, (t << 1) - a) step.

def shallit_gcd(a, b):
    # invariant: a ≥ b
    if a < b:
        a, b = b, a
    while b != 0:
        # find e ≥ 0 such that 2**e b ≤ a ≤ 2**(e+1) b
        e = clz(b) - clz(a)
        t = b << e
        if t > a:
            # e -= 1
            t >>= 1
        a, b = b, min(a - t, (t << 1) - a)
        # maintain invariant a ≥ b
        if a < b:
            a, b = b, a
    return a

Shallit & Sorenson’s variant has some nice properties, such as a worst-case number of iterations similar to Stein’s binary GCD, i.e. just over k iterations for k-bit inputs, and fewer iterations in the average case.

Variant Average Case Worst Case
Stein (right-shift) 0.7060 k k
Brent 0.8758 k 1.4404 k
Shallit & Sorenson ~0.67 k 1.0527 k

However, it’s considerably more expensive than Stein’s algorithm due to effectively computing and testing three possible shifted values.

Single-Shift Absolute-Difference Variant

Consider the following variant for the hot loop, which uses only a single shift value to align the MSBs, followed by using the absolute value of the difference.

# assume a ≥ b
e = clz(b) - clz(a)
t = b << e
a = abs(a - t)
b, a = swap_min_max(b, a) # restore invariant

This hot loop is considerably more efficient, but at the expense of requiring slightly more iterations for random input values: an average of ~24 steps compared to ~22 steps for Stein’s algorithm for 32-bit values.

The worst case is approximately 1.6 k steps for k-bit inputs, making it unsuitable for use on Tenstorrent’s vector engine, which has no way to easily exit a loop early, meaning the loop must run for the worst-case number of iterations.

Floating Point Variant on NVIDIA

On NVIDIA Ampere and later, fp32 throughput is roughly twice that of INT32.

Instead of left-shifting integers to align MSBs, we can instead represent values in fp32 or fp64, and set the exponent of the smaller value to that of the larger value. This can be achieved efficiently via a single lop3 operation, since the mantissa field of the smaller value and the exponent field of the larger value occupy disjoint bit ranges.

This only works for integers that are exactly representable in floating point, i.e. u24 for fp32 and u53 for fp64.

The algorithm then performs a floating-point subtraction, followed by fmin/fmax to restore the ordering invariant.

On Ampere, the fp32/u24 inner loop lowers to the following four-instruction SASS sequence. The subtraction appears as FADD with a negated source operand, and the absolute value is folded into the FMNMX source modifiers:

LOP3.LUT  R2, R8, 0x7fffff, R7, ... ; INT: t = {a.Exp,b.Man}
FADD      R3, -R2, R7               ; FP:  t = a - t
FMNMX     R7, |R3|, R8, !PT         ; FP:  a = max(b, abs(t))
FMNMX     R8, |R3|, R8, PT          ; FP:  b = min(b, abs(t))

In order to support 32-bit and 64-bit inputs, we use hybrid variants: strip common powers of two as in Stein, run 8 Stein frontend iterations for u32 or 11 for u64, and then switch to the floating-point core once the operands are in the exact u24 or u53 regime.

For NVIDIA A100 80GB PCIe, we get 1.64x for u24, and 1.56x for u32 via the hybrid algorithm, compared to plain Stein.

These measurements use count=1048576, rounds=1024, launches=20, block_size=256.

- fp32_u24:  258.993 ms, 8.292e+10 calls/s, 0.012 ns/call
- stein_u24: 424.368 ms, 5.060e+10 calls/s, 0.020 ns/call
- fp32_u32:  348.216 ms, 6.167e+10 calls/s, 0.016 ns/call
- stein_u32: 542.102 ms, 3.961e+10 calls/s, 0.025 ns/call

This throughput improvement extends to fp64, but only for NVIDIA’s HPC/datacenter GPUs. Consumer GeForce cards such as the RTX 3090 and RTX 4090 have severely reduced fp64 throughput (typically 1/64 of fp32).

We get 1.37x for u53, and 1.37x for u64 via the hybrid algorithm, compared to plain Stein.

- fp64_u53:  1494.136 ms, 1.437e+10 calls/s, 0.070 ns/call
- stein_u53: 2049.027 ms, 1.048e+10 calls/s, 0.095 ns/call
- fp64_u64:  1759.152 ms, 1.221e+10 calls/s, 0.082 ns/call
- stein_u64: 2418.740 ms, 8.879e+09 calls/s, 0.113 ns/call