5th February, 2026
Tenstorrent’s AI accelerators have low-level APIs that typically operate on 32×32 tiles. However, the built-in hardware instruction for matrix transpose operates on 16×16 subtiles, and moreover, it only supports 19-bit values.
Below, we explore how to optimise in-place 32×32 matrix transpose,
where the matrix already resides in Dst;
first for 19-bit values, and then for 32-bit values.
Hardware support for transpose is exposed via TRNSPSRCB,
which transposes a 16×16 matrix of 19-bit values in SrcB.
Note: a 16×16 matrix of 19-bit values can be transposed by unpacker
0 when moving data from L1 to SrcA, which
makes swapping the two middle subtiles trivial. Here we’ll focus only on
the case where the matrix is already in Dst.
A straightforward approach for transposing a 32×32 matrix might look
something like the following. Note that TRNSPSRCB operates
on SrcB[16:32]. Here we use SrcB[0:16] as
temporary storage to swap the middle subtiles (1 and 2).
Subtile layout:
[ 0 | 1 ]
[ 2 | 3 ]
Subtile 0 → SrcB[16:32]; transpose; SrcB[16:32] → Subtile 0
Subtile 2 → SrcB[0:16]
Subtile 1 → SrcB[16:32]; transpose; SrcB[16:32] → Subtile 2
SrcB[0:16] → Subtile 1
Subtile 1 → SrcB[16:32]; transpose; SrcB[16:32] → Subtile 1
Subtile 3 → SrcB[16:32]; transpose; SrcB[16:32] → Subtile 3
This involves a total of 10 subtile movements between Dst
and SrcB.
Temporarily copying subtile 2 to SrcB[0:16] in order to
swap it with subtile 1, then back to Dst,
then to SrcB[16:32] again to perform the transpose seems
suboptimal.
Instead, we can copy subtile 1 to SrcB,
perform the transpose, and then copy to SrcA
via MOVB2A,
storing it there temporarily until subtile 2 has been read, and finally
copying it to the subtile 2 position in Dst
via MOVA2D.
A nice bonus is that MOVA2D
supports moving 8 rows of 16 columns at a time, rather than 4 rows of 16
columns at a time like MOVB2D.
The full code looks like this:
; subtile 0
movd2b 0, 16, addr_mod_1, 1, 0
movd2b 0, 20, addr_mod_1, 1, 4
movd2b 0, 24, addr_mod_1, 1, 8
movd2b 0, 28, addr_mod_1, 1, 12
trnspsrcb
movb2d 0, 16, addr_mod_1, 1, 0
movb2d 0, 20, addr_mod_1, 1, 4
movb2d 0, 24, addr_mod_1, 1, 8
movb2d 0, 28, addr_mod_0, 1, 12 ; dst += 16
; subtile 1
movd2b 0, 16, addr_mod_1, 1, 0
movd2b 0, 20, addr_mod_1, 1, 4
movd2b 0, 24, addr_mod_1, 1, 8
movd2b 0, 28, addr_mod_1, 1, 12
trnspsrcb
movb2a 0, addr_mod_1, 1, 16
movb2a 4, addr_mod_1, 1, 20
movb2a 8, addr_mod_1, 1, 24
movb2a 12, addr_mod_0, 1, 28 ; dst += 16
; subtile 2
movd2b 0, 16, addr_mod_1, 1, 0
movd2b 0, 20, addr_mod_1, 1, 4
movd2b 0, 24, addr_mod_1, 1, 8
movd2b 0, 28, addr_mod_2, 1, 12 ; dst -= 16
trnspsrcb
movb2d 0, 16, addr_mod_1, 1, 0
movb2d 0, 20, addr_mod_1, 1, 4
movb2d 0, 24, addr_mod_1, 1, 8
movb2d 0, 28, addr_mod_3, 1, 12 ; dst += 32
; subtile 3
movd2b 0, 16, addr_mod_1, 1, 0
movd2b 0, 20, addr_mod_1, 1, 4
movd2b 0, 24, addr_mod_1, 1, 8
movd2b 0, 28, addr_mod_1, 1, 12
trnspsrcb
movb2d 0, 16, addr_mod_1, 1, 0
movb2d 0, 20, addr_mod_1, 1, 4
movb2d 0, 24, addr_mod_1, 1, 8
movb2d 0, 28, addr_mod_0, 1, 12 ; dst += 16
; subtile 1
mova2d 0, 0, addr_mod_2, 1,-32 ; dst -= 16
mova2d 0, 8, addr_mod_2, 1, -8 ; dst -= 16The REPLAY
buffer can store up to 32 instructions, though by convention,
Tenstorrent’s software reserves the first 16 slots for SFPU
operations, and the last 16 slots for FPU
operations.
Writing the instructions out in full, there are 17 unique instructions.
E movd2b 0, 16, addr_mod_1, 1, 0
F movd2b 0, 20, addr_mod_1, 1, 4
G movd2b 0, 24, addr_mod_1, 1, 8
H movd2b 0, 28, addr_mod_1, 1, 12
I trnspsrcb
J movb2d 0, 16, addr_mod_1, 1, 0
K movb2d 0, 20, addr_mod_1, 1, 4
L movb2d 0, 24, addr_mod_1, 1, 8
M movb2d 0, 28, addr_mod_0, 1, 12 ; dst += 16
E movd2b 0, 16, addr_mod_1, 1, 0
F movd2b 0, 20, addr_mod_1, 1, 4
G movd2b 0, 24, addr_mod_1, 1, 8
H movd2b 0, 28, addr_mod_1, 1, 12
I trnspsrcb
A movb2a 0, addr_mod_1, 1, 16
B movb2a 4, addr_mod_1, 1, 20
C movb2a 8, addr_mod_1, 1, 24
D movb2a 12, addr_mod_0, 1, 28 ; dst += 16
E movd2b 0, 16, addr_mod_1, 1, 0
F movd2b 0, 20, addr_mod_1, 1, 4
G movd2b 0, 24, addr_mod_1, 1, 8
P movd2b 0, 28, addr_mod_2, 1, 12 ; dst -= 16
I trnspsrcb
J movb2d 0, 16, addr_mod_1, 1, 0
K movb2d 0, 20, addr_mod_1, 1, 4
L movb2d 0, 24, addr_mod_1, 1, 8
Q movb2d 0, 28, addr_mod_3, 1, 12 ; dst += 32
E movd2b 0, 16, addr_mod_1, 1, 0
F movd2b 0, 20, addr_mod_1, 1, 4
G movd2b 0, 24, addr_mod_1, 1, 8
H movd2b 0, 28, addr_mod_1, 1, 12
I trnspsrcb
J movb2d 0, 16, addr_mod_1, 1, 0
K movb2d 0, 20, addr_mod_1, 1, 4
L movb2d 0, 24, addr_mod_1, 1, 8
M movb2d 0, 28, addr_mod_0, 1, 12 ; dst += 16
N mova2d 0, 0, addr_mod_2, 1,-32 ; dst -= 16
O mova2d 0, 8, addr_mod_2, 1, -8 ; dst -= 16All of the credit goes to @corsix for observing that these can be dispatched elegantly using a replay buffer of 15 instructions and the following seven stages.
Replay of EFGHIJKLM
Replay of EFGHI
Replay of ABCDEFG
P (single instruction, doesn't need to use replay)
Replay of IJKL
Q (single instruction, doesn't need to use replay)
Replay of EFGHIJKLMNO
What’s special about the number seven? Of course, it means we can use the MOP expander with template 0, which lets us invoke seven arbitrary instructions!
Using MOP expansion here may seem excessive, since we’re already
using REPLAY, but as it replaces the seven instructions
with just one instruction, I do find it quite elegant.
Since SrcA
and SrcB only support 19-bit values, we transpose low 16 bits and
high 16 bits separately using UseDst32bLo, then recombine
in-place.
The basic idea is that given a 16×16 matrix of 32-bit values in Dst,
we can transpose it as follows:
Then:
Thanks to @corsix for pointing out that
MOVB2D with UseDst32bLo=true can be used to
write to the low 16 bits while preserving the high 16 bits.
If we took the same approach as for the 19-bit 32×32 transpose, this
would involve a total of 9 * 2 = 18 subtile movements.
Unfortunately, we wouldn’t be able to stick to the convention of only using 16 slots of the replay buffer with this approach.
Instead, we can swap the two middle subtiles using the vector
unit, which loads and stores 32 32-bit values at a time. Leveraging
SFPLOADMACRO,
the middle subtiles can be swapped using only 16 cycles (the latency is
18 cycles, but the final 2 cycles can be overlapped). Each subtile can
then be transposed individually, involving a total of
8 * 2 = 16 subtile movements.
MOVD2B/MOVB2D.