Greatest Common Divisor on Tenstorrent

5th March, 2026

Introduction

Below we optimise gcd(a, b) for computing the greatest common divisor of two 32-bit signed integers on Tenstorrent’s AI accelerators.

Stein’s Binary GCD Algorithm

Stein’s binary GCD algorithm relies only on shifts, compares, and subtracts, and so is the obvious choice for Tenstorrent.

Here’s one possible optimised implementation:

def gcd(a: int, b: int) -> int:
    a = abs(a)
    b = abs(b)
    k = ctz(a | b) # k = min(ctz(a), ctz(b))

    # ensure b is odd (if possible)
    if (b & 1) == 0:
        a, b = b, a

    while a != 0:
        a >>= ctz(a)
        if a < b:
            a, b = b, a
        a -= b

    return b << k

There is no direct ctz (count trailing zeros) instruction for Tenstorrent’s vector engine, but it does have SFPLZ (count leading zeros). Assuming non-zero x, we use the following:

ctz(x) = 31 - clz(x & -x)

In assembly, this looks roughly like this:

sfpmov  ; y = x
sfpiadd ; y = -y
sfpand  ; y &= x
sfplz   ; y = clz(y)
sfpiadd ; y = 31 - y

Since we have to perform a subtraction anyway, note that we can avoid the final b << k by instead doing a >>= ctz(a) - k in the loop:

def gcd(a: int, b: int) -> int:
    a = abs(a)
    b = abs(b)
    k = ctz(a | b) # k = min(ctz(a), ctz(b))

    # ensure b >> k is odd (if possible)
    if (b & (1 << k)) == 0:
        a, b = b, a

    while a != 0:
        a >>= ctz(a) - k
        if a < b:
            a, b = b, a
        a -= b
    return b

Code

The initialisation portion ensures that we end up with L0 = -a, L1 = b, L3 = k-31, with a and b positive, and b >> k odd (if possible).

; assume constant register L9 = 0
sfpmov   L2,L0,0           ; c = a
sfpor    L2,L1             ; c |= b

sfpmov   L3,L2,0           ; L3 = c
sfpiadd  L3,L9,0x000,6     ; L3 = -L3
sfpand   L3,L2             ; L3 &= c (isolate lsb)
sfplz    L3,L3,0           ; L3 = clz(L3) = 31-k

; ensure that b is odd if possible: if k-th bit is zero, then swap with a.
; left-shifting b by 31-k converts its k-th bit to the sign bit.
sfpshft2 L2,L3,L1,5        ; c = b << (31-k)
sfpsetcc L2,0,6            ; if c == 0 then b >> k is even
sfpswap  L1,L0,0           ; swap(a, b)
sfpencc  0,0,0,0
sfpabs   L0,L0,0           ; a = abs(a)
sfpabs   L1,L1,0           ; b = abs(b)

sfpiadd  L0,L9,6           ; a = -a
sfpiadd  L3,L9,6           ; L3 = k-31

The inner loop consists of the following 7 instructions, consuming 8 cycles per iteration (due to SFPSWAP taking two cycles).

; L0 = -a, L1 = b, L3 = k-31
sfpabs   L2,L0,0           ; a = abs(-a)
sfpand   L0,L2             ; a & -a
sfplz    L0,L0,2           ; clz(a & -a); disable lanes where a=0
sfpiadd  L0,L3,0x000,4     ; -(ctz(a) - k) = clz(a & -a) + k - 31
sfpshft2 L0,L0,L2,5        ; a >>= (ctz(a) - k) & 31
sfpswap  L1,L0,1           ; ensure b <= a
sfpiadd  L0,L1,0x000,6     ; -a = b - a (kept in negated form)

The worst case for 31-bit integers is 31 iterations of the inner loop; however, we can skip the final iteration as it only affects a, and we can also skip the final instruction of the 30th iteration as it only affects a (but we need to consume one additional cycle to reset LaneFlags via SFPENCC).

This gives a total cycle count of 15 + 30 * 8 - 1 + 1 = 255.

Acknowledgements