Forum

Forum

How to transpose a 8x8 int64 matrix with AVX512

Asked

today

Modified today
Viewed

38 times

Consider 8 AVX512 registers containing rows of a matrix, so that each 64-bit lane is a cell of an 8x8 matrix. How to transpose such a matrix in C/C++?

What I have tried so far: 8 _mm512_i32scatter_epi64() intrinsics. It's extremely slow and uses the cache heavily so that other CPU cores can't get enough L3 bandwidth.

@ harold do you _mm512_permutex2var_epi64() intrinsic something? Serge Rogatch 2 hours ago

A bunch of those yes. This should be possible with 3 rounds of "quadrant swapping", ie exchanging the 4x4 off-diagonal quadrants of the 8x8 matrix, then exchanging 2x2 quadrants within each 4x4 block and then exchanging the off-diagonal entries of each 2x2 block

That would be 24 shuffles in total I think, which isn't great.. maybe there's a better approach? harold 2 hours ago

1 Answer

If the values are originally in RAM, try _mm512_i32gather_epi64 instead.Simultaneous reading from cache could lead to less stalling than writing to it.

If the destination is memory, dummy writing zeroes to a cache-line-aligned destination area could ensure that the instructions don't stall waiting on cache. With the scatter instructions the CPU doesn't know that whole destination area will be overwritten, so it has to wait for loading of the previous values.

If the values are already in registers and the goal is to keep them in registers, _mm512_permutex2var_pd seems like the most useful instruction. It lets to pick individual elements by index from two sources. But because values are distributed across so many registers, shuffling them into place will take a lot of operations.

I'll mark the matrix rows A-H and columns 0-7. Initially each register will contain one row, which I'll mark e.g. A0-7. Goal is to have each register contain one column, marked e.g. A-H0.

A0 .. A7
..    ..
H0 .. H7
                                                

The best sequence I found is still 24 shuffles:

  1. Do 8 shuffles on pairs of registers in blocks of 4:

    • Combine A0-7 and B0-7 to A-B0-3
    • Combine A0-7 and B0-7 to A-B4-7
    • ...
    • Combine G0-7 and H0-7 to G-H4-7
  2. Do 8 shuffles in blocks of 2:

    • Combine A-B0-3 and C-D0-3 to A-D0-1
    • Combine A-B0-3 and C-D0-3 to A-D2-3
    • ...
    • Combine E-F4-7 and G-H4-7 to E-H6-7
  3. Do 8 shuffles in blocks of 1:

    • Combine A-D0-1 and E-H0-1 to A-H0
    • Combine A-D0-1 and E-H0-1 to A-H1
    • ...
    • Combine A-D6-7 and E-H6-7 to A-H7

I think this is the minimum possible number of shuffles when operating only in registers. Each AVX512 instruction takes at most two registers as input, and this sequence takes full advantage of the inputs and outputs to gather data from the 8 source registers.

Your Answer