In-Place 32×32 Matrix Transpose on Tenstorrent

5th February, 2026

Introduction

Tenstorrent’s AI accelerators have low-level APIs that typically operate on 32×32 tiles. However, the built-in hardware instruction for matrix transpose operates on 16×16 subtiles, and moreover, it only supports 19-bit values.

Below, we explore how to optimise in-place 32×32 matrix transpose, where the matrix already resides in Dst; first for 19-bit values, and then for 32-bit values.

19-bit In-Place Transpose: Minimising Data Movement

Hardware support for transpose is exposed via TRNSPSRCB, which transposes a 16×16 matrix of 19-bit values in SrcB.

Note: a 16×16 matrix of 19-bit values can be transposed by unpacker 0 when moving data from L1 to SrcA, which makes swapping the two middle subtiles trivial. Here we’ll focus only on the case where the matrix is already in Dst.

A straightforward approach for transposing a 32×32 matrix might look something like the following. Note that TRNSPSRCB operates on SrcB[16:32]. Here we use SrcB[0:16] as temporary storage to swap the middle subtiles (1 and 2).

Subtile layout:

[ 0 | 1 ]
[ 2 | 3 ]

Subtile 0 → SrcB[16:32]; transpose; SrcB[16:32] → Subtile 0
Subtile 2 → SrcB[0:16]
Subtile 1 → SrcB[16:32]; transpose; SrcB[16:32] → Subtile 2
SrcB[0:16] → Subtile 1
Subtile 1 → SrcB[16:32]; transpose; SrcB[16:32] → Subtile 1
Subtile 3 → SrcB[16:32]; transpose; SrcB[16:32] → Subtile 3

This involves a total of 10 subtile movements between Dst and SrcB.

Temporarily copying subtile 2 to SrcB[0:16] in order to swap it with subtile 1, then back to Dst, then to SrcB[16:32] again to perform the transpose seems suboptimal.

Instead, we can copy subtile 1 to SrcB, perform the transpose, and then copy to SrcA via MOVB2A, storing it there temporarily until subtile 2 has been read, and finally copying it to the subtile 2 position in Dst via MOVA2D.

A nice bonus is that MOVA2D supports moving 8 rows of 16 columns at a time, rather than 4 rows of 16 columns at a time like MOVB2D.

The full code looks like this:

; subtile 0
movd2b 0, 16, addr_mod_1, 1,  0
movd2b 0, 20, addr_mod_1, 1,  4
movd2b 0, 24, addr_mod_1, 1,  8
movd2b 0, 28, addr_mod_1, 1, 12
trnspsrcb
movb2d 0, 16, addr_mod_1, 1,  0
movb2d 0, 20, addr_mod_1, 1,  4
movb2d 0, 24, addr_mod_1, 1,  8
movb2d 0, 28, addr_mod_0, 1, 12 ; dst += 16

; subtile 1
movd2b 0, 16, addr_mod_1, 1,  0
movd2b 0, 20, addr_mod_1, 1,  4
movd2b 0, 24, addr_mod_1, 1,  8
movd2b 0, 28, addr_mod_1, 1, 12
trnspsrcb
movb2a     0, addr_mod_1, 1, 16
movb2a     4, addr_mod_1, 1, 20
movb2a     8, addr_mod_1, 1, 24
movb2a    12, addr_mod_0, 1, 28 ; dst += 16

; subtile 2
movd2b 0, 16, addr_mod_1, 1,  0
movd2b 0, 20, addr_mod_1, 1,  4
movd2b 0, 24, addr_mod_1, 1,  8
movd2b 0, 28, addr_mod_2, 1, 12 ; dst -= 16
trnspsrcb
movb2d 0, 16, addr_mod_1, 1,  0
movb2d 0, 20, addr_mod_1, 1,  4
movb2d 0, 24, addr_mod_1, 1,  8
movb2d 0, 28, addr_mod_3, 1, 12 ; dst += 32

; subtile 3
movd2b 0, 16, addr_mod_1, 1,  0
movd2b 0, 20, addr_mod_1, 1,  4
movd2b 0, 24, addr_mod_1, 1,  8
movd2b 0, 28, addr_mod_1, 1, 12
trnspsrcb
movb2d 0, 16, addr_mod_1, 1,  0
movb2d 0, 20, addr_mod_1, 1,  4
movb2d 0, 24, addr_mod_1, 1,  8
movb2d 0, 28, addr_mod_0, 1, 12 ; dst += 16

; subtile 1
mova2d 0,  0, addr_mod_2, 1,-32 ; dst -= 16
mova2d 0,  8, addr_mod_2, 1, -8 ; dst -= 16

MOP and Replay Expansion

The REPLAY buffer can store up to 32 instructions, though by convention, Tenstorrent’s software reserves the first 16 slots for SFPU operations, and the last 16 slots for FPU operations.

Writing the instructions out in full, there are 17 unique instructions.

E movd2b 0, 16, addr_mod_1, 1,  0
F movd2b 0, 20, addr_mod_1, 1,  4
G movd2b 0, 24, addr_mod_1, 1,  8
H movd2b 0, 28, addr_mod_1, 1, 12
I trnspsrcb
J movb2d 0, 16, addr_mod_1, 1,  0
K movb2d 0, 20, addr_mod_1, 1,  4
L movb2d 0, 24, addr_mod_1, 1,  8
M movb2d 0, 28, addr_mod_0, 1, 12 ; dst += 16

E movd2b 0, 16, addr_mod_1, 1,  0
F movd2b 0, 20, addr_mod_1, 1,  4
G movd2b 0, 24, addr_mod_1, 1,  8
H movd2b 0, 28, addr_mod_1, 1, 12
I trnspsrcb
A movb2a     0, addr_mod_1, 1, 16
B movb2a     4, addr_mod_1, 1, 20
C movb2a     8, addr_mod_1, 1, 24
D movb2a    12, addr_mod_0, 1, 28 ; dst += 16

E movd2b 0, 16, addr_mod_1, 1,  0
F movd2b 0, 20, addr_mod_1, 1,  4
G movd2b 0, 24, addr_mod_1, 1,  8
P movd2b 0, 28, addr_mod_2, 1, 12 ; dst -= 16
I trnspsrcb
J movb2d 0, 16, addr_mod_1, 1,  0
K movb2d 0, 20, addr_mod_1, 1,  4
L movb2d 0, 24, addr_mod_1, 1,  8
Q movb2d 0, 28, addr_mod_3, 1, 12 ; dst += 32

E movd2b 0, 16, addr_mod_1, 1,  0
F movd2b 0, 20, addr_mod_1, 1,  4
G movd2b 0, 24, addr_mod_1, 1,  8
H movd2b 0, 28, addr_mod_1, 1, 12
I trnspsrcb
J movb2d 0, 16, addr_mod_1, 1,  0
K movb2d 0, 20, addr_mod_1, 1,  4
L movb2d 0, 24, addr_mod_1, 1,  8
M movb2d 0, 28, addr_mod_0, 1, 12 ; dst += 16

N mova2d 0,  0, addr_mod_2, 1,-32 ; dst -= 16
O mova2d 0,  8, addr_mod_2, 1, -8 ; dst -= 16

All of the credit goes to @corsix for observing that these can be dispatched elegantly using a replay buffer of 15 instructions and the following seven stages.

Replay of EFGHIJKLM
Replay of EFGHI
Replay of ABCDEFG
P (single instruction, doesn't need to use replay)
Replay of IJKL
Q (single instruction, doesn't need to use replay)
Replay of EFGHIJKLMNO

What’s special about the number seven? Of course, it means we can use the MOP expander with template 0, which lets us invoke seven arbitrary instructions!

Using MOP expansion here may seem excessive, since we’re already using REPLAY, but as it replaces the seven instructions with just one instruction, I do find it quite elegant.

32-bit In-Place Transpose

Since SrcA and SrcB only support 19-bit values, we transpose low 16 bits and high 16 bits separately using UseDst32bLo, then recombine in-place.

The basic idea is that given a 16×16 matrix of 32-bit values in Dst, we can transpose it as follows:

  1. MOVD2B with UseDst32bLo=false
  2. TRNSPSRCB
  3. MOVB2D with UseDst32bLo=false

Then:

  1. MOVD2B with UseDst32bLo=true
  2. TRNSPSRCB
  3. MOVB2D with UseDst32bLo=true

Thanks to @corsix for pointing out that MOVB2D with UseDst32bLo=true can be used to write to the low 16 bits while preserving the high 16 bits.

If we took the same approach as for the 19-bit 32×32 transpose, this would involve a total of 9 * 2 = 18 subtile movements.

Unfortunately, we wouldn’t be able to stick to the convention of only using 16 slots of the replay buffer with this approach.

Instead, we can swap the two middle subtiles using the vector unit, which loads and stores 32 32-bit values at a time. Leveraging SFPLOADMACRO, the middle subtiles can be swapped using only 16 cycles (the latency is 18 cycles, but the final 2 cycles can be overlapped). Each subtile can then be transposed individually, involving a total of 8 * 2 = 16 subtile movements.

Acknowledgements