C++ SSE2 or AVX2 intrinsics for grayscale to ARGB conversion

I was wondering if there is an SSE2/AVX2 integer instruction or sequence of instructions(or intrinsics) to be performed in order to achieve the following result:

Given a row of 8 byte pixels of the form:

A = {a, b, c, d, e, f, g, h}

Is there any way to load these pixels in an YMM register that contains 8 32bit ARGB pixels, such that the initial grayscale value gets broadcast to the other 2 bytes of each corresponding 32 bit pixel? The result should be something like this: ( the 0 is the alpha value )

B = {0aaa, 0bbb, 0ccc, 0ddd, 0eee, 0fff, 0ggg, 0hhh}

I'm a complete beginner in vector extensions so I'm not even sure how to approach this, or if it's at all possible.

Any help would be appreciated. Thanks!


Thanks for your answers. I still have a problem though:

I put this small example together and compiled with VS2015 on x64.

int main()
    unsigned char* pixels = (unsigned char*)_aligned_malloc(64, 32);
    memset(pixels, 0, 64);

    for (unsigned char i = 0; i < 8; i++)
        pixels[i] = 0xaa + i;

    __m128i grayscalePix = _mm_load_si128((const __m128i*)pixels);
    __m256i rgba = _mm256_cvtepu8_epi32(grayscalePix);
    __m256i mulOperand = _mm256_set1_epi32(0x00010101);

    __m256i result = _mm256_mullo_epi32(rgba, mulOperand);

    return 0;

The problem is that after doing

__m256i rgba = mm256_cvtepu8_epi32(grayscalePix)

rgba only has the first four doublewords set. The last four are all 0.

The Intel developer manual says:

VPMOVZXBD ymm1, xmm2/m64 Zero extend 8 packed 8-bit integers in the low 8 bytes of xmm2/m64 to 8 packed 32-bit integers in ymm1.

I'm not sure if this is intended behaviour or I'm still missing something.


Update: @chtz's answer is an even better idea, using a cheap 128->256 broadcast load instead of vpmovzx to feed vpshufb, saving shuffle-port bandwidth.

Start with PMOVZX like Mark suggests.

But after that, PSHUFB (_mm256_shuffle_epi8) will be much faster than PMULLD, except that it competes for the shuffle port with PMOVZX. (And it operates in-lane, so you still need the PMOVZX).

So if you only care about throughput, not latency, then _mm256_mullo_epi32 is good. But if latency matters, or if your throughput bottlenecks on something other than 2 shuffle-port instructions per vector anyway, then PSHUFB to duplicate the bytes within each pixel should be best.

Actually, even for throughput, _mm256_mullo_epi32 is bad on HSW and BDW: it's 2 uops (10c latency) for p0, so it's 2 uops for one port.

On SKL, it's 2 uops (10c latency) for p01, so it can sustain the same one per clock throughput as VPMOVZXBD. But it's an extra 1 uop, making it more likely to bottleneck.

(VPSHUFB is 1 uop, 1c latency, for port 5, on all Intel CPUs that support AVX2.)

C Programming Tutorial for Beginners, Learn C the Hard Way: Practical Exercises on the Computational Subjects You Keep Avoiding (Like C) Stock analysis for Citigroup Inc (C:New York) including stock price, stock chart, company news, key statistics, fundamentals and company profile.

You can load the packed bytes into a register, call __m256i _mm256_cvtepu8_epi32 (__m128i a) to convert to 32 bit values, then multiply by 0x00010101 to replicate the gray scale into R,G and B.

"C" Programming Language: Brian Kernighan, This course will give you a full introduction into all of the core concepts in the C programming Duration: 3:46:13 Posted: Aug 15, 2018 Discover historical prices for C stock on Yahoo Finance. View daily, weekly or monthly format back to when Citigroup, Inc. stock was issued.

You can convert 16 pixels with one vbroadcasti128 and two vpshufb. The broadcast does not require port 5, if it directly loads from memory, so the shuffles can fully utilize that port (it will still bottleneck on that port, or on storing back to memory).

void gray2rgba(char const* input, char* output, size_t length)
    length &= size_t(-16); // lets just care about sizes multiples of 16 here ...

    __m256i shuflo = _mm256_setr_epi32(
        0x80000000, 0x80010101, 0x80020202, 0x80030303,
        0x80040404, 0x80050505, 0x80060606, 0x80070707
    __m256i shufhi = _mm256_setr_epi32(
        0x80080808, 0x80090909, 0x800a0a0a, 0x800b0b0b,
        0x800c0c0c, 0x800d0d0d, 0x800e0e0e, 0x800f0f0f

    for(size_t i=0; i<length; i+=16)
        __m256i in = _mm256_broadcastsi128_si256(*reinterpret_cast<const __m128i*>(input+i));
        __m256i out0 = _mm256_shuffle_epi8(in, shuflo);
        __m256i out1 = _mm256_shuffle_epi8(in, shufhi);
        _mm256_storeu_si256(reinterpret_cast<__m256i*>(output+4*i),    out0);
        _mm256_storeu_si256(reinterpret_cast<__m256i*>(output+4*i+32), out1);

Godbolt Demo: https://godbolt.org/z/dUx6GZ

C Tutorial, "C" is one of the most widely used programming languages of all time. Prof Brian Kernighan Duration: 8:26 Posted: Aug 18, 2015 C programming is a general-purpose, procedural, imperative computer programming language developed in 1972 by Dennis M. Ritchie at the Bell Telephone Laboratories to develop the UNIX operating system. C is the most widely used computer language. It keeps fluctuating at number one scale of popularity

Learn C, C programming is a general-purpose, procedural, imperative computer programming language developed in 1972 by Dennis M. Programming Languages Development - C++ has been used extensively in developing new programming languages like C#, Java, JavaScript, Perl, UNIX’s C Shell, PHP and Python, and Verilog etc. Computation Programming - C++ is the best friends of scientists because of fast speed and computational efficiencies.

Learn C Programming, learn-c.org is a free interactive C tutorial for people who want to learn C, fast. Learn about C with our data and independent analysis including price, star rating, valuation, dividends, and financials. Start a 14-day free trial to Morningstar Premium to unlock our take on C.

Cprogramming.com: Learn C and C++ Programming, C is a powerful general-purpose programming language. Our C tutorials will guide you to learn C programming one step at a time with the help of examples. C - switch statement - A switch statement allows a variable to be tested for equality against a list of values. Each value is called a case, and the variable being switched on is chec

  • Your code looks right. Are you sure you're not just testing it wrong? Or that the compiler didn't optimize some / all of it away because the results are unused? On Godbolt, I had to use -O0 to make the compiler keep the vector ops. Even -Og or -O1 optimized away everything except the malloc/free. Try storing the vector into an uint32_t array and printing it with printf, or something C++ish.
  • The optimizer is not a concern as I was testing this in debug mode but you were still right though :) Apparently, the VS debugger does not display _m256i values correctly. It almost feels like it truncates them at a _m128i boundary. Also, the registers window was not much help either. Everything looks fine after I stored the vector to memory and did a printf, so I guess thanks are in order :)
  • Oh wow, things are BAD when you can't trust the debugger! Does it get any better with the debugger when you are using the result?
  • I didn't really bother with the looking at _m256i values in the debugger anymore. When I need to test my code for correctness, I use #ifdef _DEBUG code where I just copy everything to memory and look at it there.
  • pshufb is often going to be better than a multiply. See my answer.