Floating Point Binary GCD

We analyse an efficient variation on left-shift binary GCD algorithms, and show how floating-point representations can be used to increase throughput for 32-bit and 64-bit values on NVIDIA Ampere and later.

Left-Shift Binary GCD Variants

While optimising Greatest Common Divisor for Tenstorrent, I happened across various alternative binary GCD algorithms that involve left-shifting instead of right-shifting.

def brent_gcd(a, b):
    # invariant: a ≥ b
    if a < b:
        a, b = b, a
    while b != 0:
        e = clz(b) - clz(a)
        t = b << e
        if t > a:
            # e -= 1
            t >>= 1
        a -= t
        if a < b:
            a, b = b, a
    return a

def shallit_gcd(a, b):
    # invariant: a ≥ b
    if a < b:
        a, b = b, a
    while b != 0:
        # find e ≥ 0 such that 2**e b ≤ a ≤ 2**(e+1) b
        e = clz(b) - clz(a)
        t = b << e
        if t > a:
            # e -= 1
            t >>= 1
        a, b = b, min(a - t, (t << 1) - a)
        # maintain invariant a ≥ b
        if a < b:
            a, b = b, a
    return a

Shallit & Sorenson’s variant has some nice properties, such as a worst-case number of iterations similar to Stein’s binary GCD, i.e. just over k iterations for k-bit inputs, and fewer iterations in the average case.

However, it’s considerably more expensive than Stein’s algorithm due to effectively computing and testing three possible shifted values.

Single-Shift Absolute-Difference Variant

Consider the following variant for the hot loop, which uses only a single shift value to align the MSBs, followed by using the absolute value of the difference.

Variant	Average Case	Worst Case
Stein (right-shift)	0.7060 k	k
Brent	0.8758 k	1.4404 k
Shallit & Sorenson	~0.67 k	1.0527 k

# assume a ≥ b
e = clz(b) - clz(a)
t = b << e
a = abs(a - t)
b, a = swap_min_max(b, a) # restore invariant

This hot loop is considerably more efficient, but at the expense of requiring slightly more iterations for random input values: an average of ~24 steps compared to ~22 steps for Stein’s algorithm for 32-bit values.

The worst case is approximately 1.6 k steps for k-bit inputs, making it unsuitable for use on Tenstorrent’s vector engine, which has no way to easily exit a loop early, meaning the loop must run for the worst-case number of iterations.

Floating Point Variant on NVIDIA

Instead of left-shifting integers to align MSBs, we can instead represent values in fp32 or fp64, and set the exponent of the smaller value to that of the larger value. This can be achieved efficiently via a single lop3 operation, since the mantissa field of the smaller value and the exponent field of the larger value occupy disjoint bit ranges.

This only works for integers that are exactly representable in floating point, i.e. u24 for fp32 and u53 for fp64.

The algorithm then performs a floating-point subtraction, followed by fmin/fmax to restore the ordering invariant.

On Ampere, the fp32/u24 inner loop lowers to the following four-instruction SASS sequence. The subtraction appears as FADD with a negated source operand, and the absolute value is folded into the FMNMX source modifiers:

LOP3.LUT  R2, R8, 0x7fffff, R7, ... ; INT: t = {a.Exp,b.Man}
FADD      R3, -R2, R7               ; FP:  t = a - t
FMNMX     R7, |R3|, R8, !PT         ; FP:  a = max(b, abs(t))
FMNMX     R8, |R3|, R8, PT          ; FP:  b = min(b, abs(t))

In order to support 32-bit and 64-bit inputs, we use hybrid variants: strip common powers of two as in Stein, run 8 Stein frontend iterations for u32 or 11 for u64, and then switch to the floating-point core once the operands are in the exact u24 or u53 regime.

For NVIDIA A100 80GB PCIe, we get 1.64x for u24, and 1.56x for u32 via the hybrid algorithm, compared to plain Stein.

These measurements use count=1048576, rounds=1024, launches=20, block_size=256.

This throughput improvement extends to fp64, but only for NVIDIA’s HPC/datacenter GPUs. Consumer GeForce cards such as the RTX 3090 and RTX 4090 have severely reduced fp64 throughput (typically 1/64 of fp32).

We get 1.37x for u53, and 1.37x for u64 via the hybrid algorithm, compared to plain Stein.

Introduction

Left-Shift Binary GCD Variants

Single-Shift Absolute-Difference Variant

Floating Point Variant on NVIDIA