12th November, 2025
Tenstorrent’s AI accelerator chips consist of tiles arranged in a grid, connected with a network-on-chip for efficient dataflow processing. Each tile features a vector unit that executes a limited number of SIMD operations across 32 lanes.
where(condition, t, f, out) selects values from either
t or f, depending on the corresponding value
in condition, writing the result to out.
Below we implement where on the vector unit, achieving
optimal throughput of 3 cycles per row for the common in-place case and
4 cycles per row for the out-of-place case.
Our kernel is given four offset parameters, representing
condition, t, f, and
out. Pseudocode for a relatively optimised sequential
solution on the 32-lane vector unit looks like this:
# parameters: offset0, offset1, offset2, offset3
condition = load(offset0)
result = load(offset1)
if condition == 0:
result = load(offset2)
store(result, offset3)This is equivalent to the following assembly code, achieving 6 cycles per row:
// Parameters: offset0, offset1, offset2, offset3
// ADDR_MOD_7: doesn't increment counters
// ADDR_MOD_6: autoincrements Dst counter afterwards
sfpload L0, 0, ADDR_MOD_7, offset0
sfpload L1, 0, ADDR_MOD_7, offset1
sfpsetcc 0, L0, L0, SFPSETCC_MOD1_LREG_EQ0
sfpload L1, 0, ADDR_MOD_7, offset2
sfpencc 0, L0, L0, 0
sfpstore L1, 0, ADDR_MOD_6, offset3The vector unit has five sub-units:
load, simple, mad, round, and store. The only way to use more than one
sub-unit at a time is via SFPLOADMACRO,
which allows us to schedule up to one instruction per sub-unit to
execute during future cycles, subject to various constraints.
The following diagram shows the schedule for our sequential code, with register liveness on the right.
The most common pattern is to call where(0, 1, 2, 0), so
that the result is written to condition.
This allows us to schedule the SFPSTORE
to write the result back to condition, while the next SFPLOAD
executes at the same time.
Note that one of the constraints of SFPLOADMACRO is that
a scheduled SFPSTORE has to write to the address its macro
loaded from, so this is only possible when the output is written to
condition.
We require two macros, followed by a regular SFPLOAD:
SFPLOADMACRO:
load from condition to the L0 register. Also,
schedule two additional instructions:
SFPLOADMACRO:
load from t to L0, and schedule:
SFPENCC.SFPLOAD:
load from f, overwriting the value in L0 in
lanes that are enabled. Also, auto-increment the address counters (which
happens regardless of lane flags).Finally, the SFPSTORE
scheduled in the first step will write the result in L0
back to memory. At this point, the next SFPLOADMACRO call
can be executed, simultaneously with the SFPSTORE.
The trick here is that it’s safe to schedule an instruction that reads from a register for the same time as an instruction that writes to that register: the read happens at the beginning of the cycle, and the write happens at the end of the cycle.
For example, SFPSETCC L0 L0 reads L0 while
SFPLOAD L0 1 writes to L0 during the same
cycle, as illustrated by the diagram below, with register liveness shown
on the right.
Note also that we require only one register, since the
condition value only needs to live for one cycle.
3 cycles is theoretically optimal for this case, since the operation requires loading from three different memory addresses.
If we are required to write the result to a distinct address,
e.g. where(0, 1, 2, 3), we can no longer use SFPLOADMACRO
to schedule the SFPSTORE,
as it can only write to the address it loaded from.
Instead, we add a regular SFPSTORE
instruction, achieving 4 cycles per input row:
A stall between instructions will disrupt the timing-sensitive sequence of disabling and enabling lanes, leading to incorrect results.
A stall after the first instruction prevents
sfpload L0 1 from executing unconditionally, as it now
executes after sfpsetcc:
A stall after the second instruction executes
sfpload L0 2 unconditionally, as it now executes after
sfpencc:
This can be avoided by first loading the instructions into a replay
buffer, and using a single REPLAY
instruction to issue the 3 (or 4) instructions in sequence without
stalls. Alternatively, a MOP
expansion could be used.
Note that it’s possible to specify
DelayKind=WaitForElapsedInstructions, which only decrements
the delay counter every time a thread issues an instruction to the
vector unit, instead of every cycle. However, a scheduled instruction
executes on the cycle after the cycle where it counts down from
1 to 0; or, in the case of Delay=0, the scheduled
instruction will execute on the next cycle regardless, and so this
doesn’t help avoid the issue.
Leveraging SFPLOADMACRO,
we achieve a theoretically optimal 3-cycle throughput for in-place
where and a 4-cycle throughput for out-of-place
where, while ensuring stable execution by avoiding
instruction stalls.
Thanks to Tenstorrent for sponsoring this work.