Optimal "where" on Tenstorrent

Tenstorrent’s AI accelerator chips consist of tiles arranged in a grid, connected with a network-on-chip for efficient dataflow processing. Each tile features a vector unit that executes a limited number of SIMD operations across 32 lanes.

where(condition, t, f, out) selects values from either t or f, depending on the corresponding value in condition, writing the result to out.

Below we implement where on the vector unit, achieving optimal throughput of 3 cycles per row for the common in-place case and 4 cycles per row for the out-of-place case.

Sequential Code

Our kernel is given four offset parameters, representing condition, t, f, and out. Pseudocode for a relatively optimised sequential solution on the 32-lane vector unit looks like this:

# parameters: offset0, offset1, offset2, offset3
condition = load(offset0)
result = load(offset1)
if condition == 0:
  result = load(offset2)
store(result, offset3)

// Parameters: offset0, offset1, offset2, offset3
// ADDR_MOD_7: doesn't increment counters
// ADDR_MOD_6: autoincrements Dst counter

sfpload  L0, 0, ADDR_MOD_7, offset0
sfpload  L1, 0, ADDR_MOD_7, offset1
sfpsetcc 0, L0, L0, SFPSETCC_MOD1_LREG_EQ0
sfpload  L1, 0, ADDR_MOD_7, offset2
sfpencc  0, L0, L0, 0
sfpstore L1, 0, ADDR_MOD_6, offset3

Parallel Execution via SFPLOADMACRO

The vector unit has five sub-units: load, simple, mad, round, and store. The only way to use more than one sub-unit at a time is via SFPLOADMACRO, which allows us to schedule up to one instruction per sub-unit to execute during future cycles, subject to various constraints.

The following diagram shows the schedule for our sequential code, with register liveness on the right.

In-Place Output

The most common pattern is to call where(0, 1, 2, 0), so that the result is written to condition.

This allows us to schedule the SFPSTORE to write the result back to condition, while the next SFPLOAD executes at the same time.

Note that one of the constraints of SFPLOADMACRO is that a scheduled SFPSTORE has to write to the address its macro loaded from, so this is only possible when the output is written to condition.

Finally, the SFPSTORE scheduled in the first step will write the result in L0 back to memory. At this point, the next SFPLOADMACRO call can be executed, simultaneously with the SFPSTORE.

The trick here is that it’s safe to schedule an instruction that reads from a register for the same time as an instruction that writes to that register: the read happens at the beginning of the cycle, and the write happens at the end of the cycle.

For example, SFPSETCC L0 L0 reads L0 while SFPLOAD L0 1 writes to L0 during the same cycle, as illustrated by the diagram below, with register liveness shown on the right.

Note also that we require only one register, since the condition value only needs to live for one cycle.

3 cycles is theoretically optimal for this case, since the operation requires loading from three different memory addresses.

Out-of-Place Output

If we are required to write the result to a distinct address, e.g. where(0, 1, 2, 3), we can no longer use SFPLOADMACRO to schedule the SFPSTORE, as it can only write to the address it loaded from.

Instead, we add a regular SFPSTORE instruction, achieving 4 cycles per input row:

Avoiding Stalls

A stall between instructions will disrupt the timing-sensitive sequence of disabling and enabling lanes, leading to incorrect results. This can occur if the issuing RISC-V thread suffers instruction cache starvation, e.g. when a large number of unrolled instructions are present.

A stall after the first instruction prevents sfpload L0 1 from executing unconditionally, as it now executes after sfpsetcc:

A stall after the second instruction executes sfpload L0 2 unconditionally, as it now executes after sfpencc:

This can be avoided by first loading the instructions into a replay buffer, and using a single REPLAY instruction to issue the 3 (or 4) instructions in sequence without stalls. Alternatively, a MOP expansion could be used. Both replay and MOP expander are contained in the Tensix frontend, hence their generated instructions will always be stall-free for a given REPLAY or MOP instruction.

Note that it’s possible to specify DelayKind=WaitForElapsedInstructions, which only decrements the delay counter every time a thread issues an instruction to the vector unit, instead of every cycle. However, a scheduled instruction executes on the cycle after the cycle where it counts down from 1 to 0; or, in the case of Delay=0, the scheduled instruction will execute on the next cycle regardless, and so this doesn’t help avoid the issue.

Conclusion

Leveraging SFPLOADMACRO, we achieve a theoretically optimal 3-cycle throughput for in-place where and a 4-cycle throughput for out-of-place where, while ensuring stable execution by avoiding instruction stalls.

Acknowledgements

Addendum

If the output is written to t, this permits a slightly more optimised sequential version, as we only need to load f and then make the store conditional:

condition = load(offset0)
f = load(offset2)
if condition == 0:
  store(f, offset1)

// Parameters: offset0, offset1, offset2, offset3==offset1
// ADDR_MOD_7: doesn't increment counters
// ADDR_MOD_6: autoincrements Dst counter

sfpload  L0, 0, ADDR_MOD_7, offset0
sfpload  L1, 0, ADDR_MOD_7, offset2
sfpsetcc 0, L0, L0, SFPSETCC_MOD1_LREG_EQ0
sfpstore L1, 0, ADDR_MOD_6, offset1
sfpencc  0, L0, L0, 0

Unfortunately, this doesn’t translate into a 2-cycle SFPLOADMACRO equivalent, since there are three distinct addresses, and a macro-scheduled SFPSTORE can only write to the same address as the SFPLOADMACRO load address.

Introduction