The Motivation

I recently decided I wanted to spend some time understanding intrinsics and SIMD at a deeper level.

In developing code for my project ffs, I wanted to make sure that the code was running with at least AVX instructions (because that's my target architecture, and honestly very few computers don't have AVX these days..).

This led me down a path of discovery; first I discovered the amazing-ness that is godbolt.org, then I joined their discord, where I then asked for some help with understanding basic .asm compiler output.

Ultimately though, I needed to know how to write SIMD-tuned functions for complex numbers i.e. std::complex<T>. These were the types that were going into ffs.

Old but Gold Stackoverflows

Some google searching led me to one of the best references for complex-number SIMD multiplication in an answer by Peter Cordes on StackOverflow.

Although it's almost 8 years old at this point, his recommendation is still valid: use -ffast-math and label arrays with __restrict__, and GCC will do a pretty damned good job vectorising with -mavx on.

So with that I placed essentially the same code and compiler arguments - -O3 -ffast-math -mavx - into godbolt again and re-tried it with newer compilers.

I tried first with a simple loop over 4 std::complex<float> values, enough to fill a 256-bit AVX YMM register:

for (int i = 0; i < 4; ++i)
{
    z[i] = x[i] * y[i];
}

And indeed, x86-64 GCC 13.2 outputs the following:

vmovups ymm2, YMMWORD PTR [rdi]
vmovups ymm0, YMMWORD PTR [rsi]
vpermilps       ymm1, ymm2, 160
vpermilps       ymm2, ymm2, 245
vmulps  ymm1, ymm1, ymm0
vpermilps       ymm0, ymm0, 177
vmulps  ymm0, ymm0, ymm2
vaddsubps       ymm1, ymm1, ymm0
vmovups YMMWORD PTR [rdx], ymm1
vzeroupper
ret

MSVC.. doesn't auto-vectorise

Using MSVC, the equivalent compiler options would be to use /O2 /fp:fast /arch:AVX. You also have to change __restrict__ to __restrict. This, however, emits the following:

    sub     rsp, 24
    vmovss  xmm5, DWORD PTR [rcx]
    vmulss  xmm1, xmm5, DWORD PTR [rdx]
    vmulss  xmm2, xmm5, DWORD PTR [rdx+4]
    vmovss  xmm5, DWORD PTR [rcx+8]
    vmovaps XMMWORD PTR [rsp], xmm6
    vmovss  xmm6, DWORD PTR [rcx+4]
    vmulss  xmm0, xmm6, DWORD PTR [rdx+4]
    vsubss  xmm3, xmm1, xmm0
    vmulss  xmm1, xmm6, DWORD PTR [rdx]
    vmovss  xmm6, DWORD PTR [rcx+12]
    vaddss  xmm0, xmm2, xmm1
    vmulss  xmm1, xmm5, DWORD PTR [rdx+8]
    vmulss  xmm2, xmm5, DWORD PTR [rdx+12]
    vmovss  xmm5, DWORD PTR [rcx+16]
    vmovss  DWORD PTR [r8+4], xmm0
    vmulss  xmm0, xmm6, DWORD PTR [rdx+12]
    vmovss  DWORD PTR [r8], xmm3
    vsubss  xmm3, xmm1, xmm0
    vmulss  xmm1, xmm6, DWORD PTR [rdx+8]
    vmovss  xmm6, DWORD PTR [rcx+20]
    vaddss  xmm0, xmm2, xmm1
    vmulss  xmm1, xmm5, DWORD PTR [rdx+16]
    vmulss  xmm2, xmm5, DWORD PTR [rdx+20]
    vmovss  xmm5, DWORD PTR [rcx+24]
    vmovss  DWORD PTR [r8+12], xmm0
    vmulss  xmm0, xmm6, DWORD PTR [rdx+20]
    vmovss  DWORD PTR [r8+8], xmm3
    vsubss  xmm3, xmm1, xmm0
    vmulss  xmm1, xmm6, DWORD PTR [rdx+16]
    vmovss  xmm6, DWORD PTR [rcx+28]
    vaddss  xmm0, xmm2, xmm1
    vmulss  xmm1, xmm5, DWORD PTR [rdx+24]
    vmulss  xmm2, xmm5, DWORD PTR [rdx+28]
    vmovss  DWORD PTR [r8+20], xmm0
    vmulss  xmm0, xmm6, DWORD PTR [rdx+28]
    vmovss  DWORD PTR [r8+16], xmm3
    vsubss  xmm3, xmm1, xmm0
    vmulss  xmm1, xmm6, DWORD PTR [rdx+24]
    vmovaps xmm6, XMMWORD PTR [rsp]
    vaddss  xmm0, xmm2, xmm1
    vmovss  DWORD PTR [r8+28], xmm0
    vmovss  DWORD PTR [r8+24], xmm3
    add     rsp, 24
    ret     0

Obviously, not a single YMM register used, and no packed instructions either. You can enable the following MSVC compiler flag, /Qvec-report: 2, to ask it why it didn't vectorise the loop:

<source>(15) : info C5002: loop not vectorized due to reason '1200'

Reason 1200 is apparently 'Loop contains loop-carried data dependencies that prevent vectorization. Different iterations of the loop interfere with each other such that vectorizing the loop would produce wrong answers, and the auto-vectorizer can't prove to itself that there aren't such data dependencies.', which obviously is complete garbage.

I posted all these to the godbolt discord to ask the grandmasters of assembly over there if there was any way around MSVC's issues:

If you'd like to play around with the MSVC implementation, here's the link on godbolt.

Since GCC works, just copy them?

I needed the above to work with MSVC and Windows, so the easiest thing I could think of was to just reverse-translate GCC's output into the appropriate intrinsic calls; I just went to Intel's catalogue and searched them one by one. The model answers are literally given, so why not right?

Word for word translation of the GCC instructions:

__m256 ymm2 = _mm256_loadu_ps(reinterpret_cast<const float*>(x));
__m256 ymm0 = _mm256_loadu_ps(reinterpret_cast<const float*>(y));

__m256 ymm1 = _mm256_permute_ps(ymm2, 160);
ymm2 = _mm256_permute_ps(ymm2, 245);

ymm1 = _mm256_mul_ps(ymm1, ymm0);

ymm0 = _mm256_permute_ps(ymm0, 177);

ymm0 = _mm256_mul_ps(ymm0, ymm2);

ymm1 = _mm256_addsub_ps(ymm1, ymm0);

_mm256_storeu_ps(reinterpret_cast<float*>(z), ymm1);

And their assembly output:

vmovups ymm1, YMMWORD PTR [rdi]
vmovups ymm0, YMMWORD PTR [rsi]
vpermilps       ymm3, ymm1, 160
vpermilps       ymm2, ymm0, 177
vpermilps       ymm1, ymm1, 245
vmulps  ymm0, ymm0, ymm3
vmulps  ymm1, ymm1, ymm2
vaddsubps       ymm0, ymm0, ymm1
vmovups YMMWORD PTR [rdx], ymm0
vzeroupper
ret

If you look closely, the assembly is not 100% identical, but it is functionally identical (just track the registers and you'll see it).

Performance-wise, there is also no difference, as confirmed by the man himself:

Links to my godbolt are here. Note that if you change it back to MSVC, the original movups instructions will disappear (this seems to be a thing with MSVC assuming/optimizing the arguments away directly into registers).

Some Concluding Remarks

I didn't go into the workings of how the complex number multiplies work, because that's pretty much inside the stackoverflow link.

Also, some might ask why I didn't use a library like Eigen. Yes, I could have, but that would require including a giant library just for one function which is ffs. Also, it was a pretty good learning opportunity.