Forum
- Helpex
- Forum
How to transpose a 8x8 int64 matrix with AVX512
Consider 8 AVX512 registers containing rows of a matrix, so that each 64-bit lane is a cell of an 8x8 matrix. How to transpose such a matrix in C/C++?
What I have tried so far: 8 _mm512_i32scatter_epi64()
intrinsics. It's
extremely slow and uses the cache heavily so that
other CPU cores can't get enough L3 bandwidth.
@ harold do you _mm512_permutex2var_epi64()
intrinsic something?
Serge Rogatch
2 hours ago
A bunch of those yes. This should be possible with 3 rounds of "quadrant swapping", ie exchanging the 4x4 off-diagonal quadrants of the 8x8 matrix, then exchanging 2x2 quadrants within each 4x4 block and then exchanging the off-diagonal entries of each 2x2 block
That would be 24 shuffles in total I think, which isn't great.. maybe there's a better approach? harold 2 hours ago
1 Answer
If the values are originally in RAM, try
_mm512_i32gather_epi64
instead.Simultaneous reading from cache could lead to
less stalling than writing to it.
If the destination is memory, dummy writing zeroes to a cache-line-aligned destination area could ensure that the instructions don't stall waiting on cache. With the scatter instructions the CPU doesn't know that whole destination area will be overwritten, so it has to wait for loading of the previous values.
If the values are already in registers and the goal is to keep them in
registers, _mm512_permutex2var_pd
seems like the
most useful instruction. It lets to pick individual elements by index
from two sources. But because values are
distributed across so many registers, shuffling them into place will
take a lot of operations.
I'll mark the matrix rows A-H and columns 0-7
. Initially each register will
contain one row, which I'll mark e.g. A0-7
.
Goal is to have each register contain one column, marked e.g. A-H0.
A0 .. A7 .. .. H0 .. H7
The best sequence I found is still 24 shuffles:
-
Do 8 shuffles on pairs of registers in blocks of 4:
- Combine
A0-7
andB0-7
toA-B0-3
- Combine
A0-7
andB0-7
toA-B4-7
- ...
- Combine
G0-7
andH0-7
toG-H4-7
- Combine
-
Do 8 shuffles in blocks of 2:
- Combine
A-B0-3
andC-D0-3
toA-D0-1
- Combine
A-B0-3
andC-D0-3
toA-D2-3
- ...
- Combine
E-F4-7
andG-H4-7
toE-H6-7
- Combine
-
Do 8 shuffles in blocks of 1:
- Combine
A-D0-1
andE-H0-1
toA-H0
- Combine
A-D0-1
andE-H0-1
toA-H1
- ...
- Combine
A-D6-7
andE-H6-7
toA-H7
- Combine
I think this is the minimum possible number of shuffles when operating only in registers. Each AVX512 instruction takes at most two registers as input, and this sequence takes full advantage of the inputs and outputs to gather data from the 8 source registers.
Your Answer