5th March, 2026
Below we optimise gcd(a, b) for computing the greatest
common divisor of two 32-bit signed integers on Tenstorrent’s AI accelerators.
Stein’s binary GCD algorithm relies only on shifts, compares, and subtracts, and so is the obvious choice for Tenstorrent.
Here’s one possible optimised implementation:
def gcd(a: int, b: int) -> int:
a = abs(a)
b = abs(b)
k = ctz(a | b) # k = min(ctz(a), ctz(b))
# ensure b is odd (if possible)
if (b & 1) == 0:
a, b = b, a
while a != 0:
a >>= ctz(a)
if a < b:
a, b = b, a
a -= b
return b << kThere is no direct ctz (count trailing zeros)
instruction for Tenstorrent’s vector engine, but it does have SFPLZ
(count leading zeros). Assuming non-zero x, we use the
following:
ctz(x) = 31 - clz(x & -x)
In assembly, this looks roughly like this:
sfpmov ; y = x
sfpiadd ; y = -y
sfpand ; y &= x
sfplz ; y = clz(y)
sfpiadd ; y = 31 - ySince we have to perform a subtraction anyway, note that we can avoid
the final b << k by instead doing
a >>= ctz(a) - k in the loop:
def gcd(a: int, b: int) -> int:
a = abs(a)
b = abs(b)
k = ctz(a | b) # k = min(ctz(a), ctz(b))
# ensure b >> k is odd (if possible)
if (b & (1 << k)) == 0:
a, b = b, a
while a != 0:
a >>= ctz(a) - k
if a < b:
a, b = b, a
a -= b
return bThe initialisation portion ensures that we end up with
L0 = -a, L1 = b, L3 = k-31, with
a and b positive, and
b >> k odd (if possible).
; assume constant register L9 = 0
sfpmov L2,L0,0 ; c = a
sfpor L2,L1 ; c |= b
sfpmov L3,L2,0 ; L3 = c
sfpiadd L3,L9,0x000,6 ; L3 = -L3
sfpand L3,L2 ; L3 &= c (isolate lsb)
sfplz L3,L3,0 ; L3 = clz(L3) = 31-k
; ensure that b is odd if possible: if k-th bit is zero, then swap with a.
; left-shifting b by 31-k converts its k-th bit to the sign bit.
sfpshft2 L2,L3,L1,5 ; c = b << (31-k)
sfpsetcc L2,0,6 ; if c == 0 then b >> k is even
sfpswap L1,L0,0 ; swap(a, b)
sfpencc 0,0,0,0
sfpabs L0,L0,0 ; a = abs(a)
sfpabs L1,L1,0 ; b = abs(b)
sfpiadd L0,L9,6 ; a = -a
sfpiadd L3,L9,6 ; L3 = k-31The inner loop consists of the following 7 instructions, consuming
8 cycles per iteration (due to SFPSWAP
taking two cycles).
; L0 = -a, L1 = b, L3 = k-31
sfpabs L2,L0,0 ; a = abs(-a)
sfpand L0,L2 ; a & -a
sfplz L0,L0,2 ; clz(a & -a); disable lanes where a=0
sfpiadd L0,L3,0x000,4 ; -(ctz(a) - k) = clz(a & -a) + k - 31
sfpshft2 L0,L0,L2,5 ; a >>= (ctz(a) - k) & 31
sfpswap L1,L0,1 ; ensure b <= a
sfpiadd L0,L1,0x000,6 ; -a = b - a (kept in negated form)The worst case for 31-bit integers is 31 iterations of the inner
loop; however, we can skip the final iteration as it only affects
a, and we can also skip the final instruction of the 30th
iteration as it only affects a (but we need to consume one
additional cycle to reset LaneFlags via SFPENCC).
This gives a total cycle count of
15 + 30 * 8 - 1 + 1 = 255.
NaN sticky bits, which unfortunately didn’t work reliably
due to hardware bugs.