The Linux Page

Fast render from image A to image B using alpha for blending

The attached package currently includes a C function and a SIMD function that both are used to blend two images together.

At this point I do not like the fact that the result is invalid (1 off with large values—if you have an alpha of 255, then you are likely to get the source component value minus 1). Outside of that, it works nicely and it is VERY fast.

I can blend 3,300 NTSC images per second with my i5 (Intel(R) Core(TM) i5 CPU 760 @ 2.80GHz.) With my newer computers I can run 3,750 NTSC images per second. These have Xeon processors (Intel(R) Xeon(R) CPU E31220 @ 3.10GHz.) The increase is probably due in part by the faster processor (10.7% faster in GHz) and the better memory. The CPU caches are very similar. Also there are 4 cores so I could run 2 processes together but that wouldn't make much difference on a such memory intensive algorigthm.

Anyway, the code is attached as a tar.gz if you want to have a look. The merge-with-alpha.cpp is the file you're interested in. The C and the SIMD versions are both included. Just for the hell of it, there it is inline in the C function (compiles with gcc/g++):

void _merge_with_alpha_mmx(unsigned char *d, const unsigned char *s1, const unsigned char *s2, int w, int h)
{
    asm(
    // the data is expected to be contiguous
    "imul %[w],%[h]\n\t"
    "sar $2,%[h]\n\t"  // we do 4 pixels at a time, so divide by 4
    "jmp loop\n\t"
    ".align 16\n"
"explode_alpha:\n\t"
    ".long 0x80038003, 0x80078007, 0x800B800B, 0x800F800F\n"
"explode_rgbe:\n\t"
    ".long 0x80028000, 0x80068004, 0x800A8008, 0x800E800C\n"
"explode_rgbo:\n\t"
    ".long 0x80038001, 0x80078005, 0x800B8009, 0x800F800D\n"
"m255:\n\t"
    ".long 0x00FF00FF, 0x00FF00FF, 0x00FF00FF, 0x00FF00FF\n"
"m257:\n\t"
    ".long 0x01010101, 0x01010101, 0x01010101, 0x01010101\n"
"solid:\n\t"
    ".long 0xFF000000, 0xFF000000, 0xFF000000, 0xFF000000\n"
"loop:\n\t"
    // load 4 pixels from both sources
    // (rrggbbaa:rrggbbaa:rrggbbaa:rrggbbaa)
    "movups (%[s1]),%%xmm1\n\t"
    "movups (%[s2]),%%xmm2\n\t"
    // extract the 4 alpha from source 1 (ignore source 2 alpha)
    // (00aa00aa:00aa00aa:00aa00aa:00aa00aa)
    "movdqa %%xmm1,%%xmm3\n\t"
    "pshufb explode_alpha(%%rip),%%xmm3\n\t"
    // compute (inverted = (255 - alpha <=> 255 ^ alpha))
    // (00ii00ii:00ii00ii:00ii00ii:00ii00ii)
    "movdqa %%xmm3,%%xmm4\n\t"
    "pxor m255(%%rip),%%xmm4\n\t"
    "movdqa %%xmm1,%%xmm5\n\t"
    "movdqa %%xmm2,%%xmm6\n\t"
    // explode two pixels per register (xmm1,2,5,6)
    // (even components: 00bb00rr:00bb00rr:00bb00rr:00bb00rr)
    // and multiply with alpha and (255 - alpha)
    "pshufb explode_rgbe(%%rip),%%xmm1\n\t"
    "pmullw %%xmm3,%%xmm1\n\t"
    "pshufb explode_rgbe(%%rip),%%xmm2\n\t"
    "pmullw %%xmm4,%%xmm2\n\t"
    // (odd components: 00aa00gg:00aa00gg:00aa00gg:00aa00gg)
    // and multiply with alpha and (255 - alpha)
    "pshufb explode_rgbo(%%rip),%%xmm5\n\t"
    "pmullw %%xmm3,%%xmm5\n\t"
    // even: sum both results (color * alpha) + (color * (255 - alpha))
    "paddw %%xmm1,%%xmm2\n\t"
    // (nearly) divide that result by 255, xmm2 is final: 00gg00aa:...
    "pmulhuw m257(%%rip),%%xmm2\n\t" // nearly equivalent to div m255(%rip),xmm2
    "pshufb explode_rgbo(%%rip),%%xmm6\n\t"
    "pmullw %%xmm4,%%xmm6\n\t"
    // odd: sum both results (color * alpha) + (color * (255 - alpha))
    "paddw %%xmm5,%%xmm6\n\t"
    // (nearly) divide that result by 255
    "pmulhuw m257(%%rip),%%xmm6\n\t" // nearly equivalent to div m255(%rip),xmm2
    // adjust the odd number back to the right place (colors << 8)
    // xmm6 is final: rr00bb00:...
    "psllw $8,%%xmm6\n\t"
    // combine with the color colors (rrggbbaa:...)
    "por %%xmm6,%%xmm2\n\t"
    "por solid(%%rip),%%xmm2\n\t"
    "movups %%xmm2,(%[d])\n\t"
    // move to the next pixel
    "add $16,%[s1]\n\t"
    "add $16,%[s2]\n\t"
    "add $16,%[d]\n\t"
    "dec %[h]\n\t" // [h] was set to ([w] * [h]) / 4 before the loop
    "jne loop\n\t"
    : /* output -- none */
    : /* input */ [d] "r" (d), [s1] "r" (s1), [s2] "r" (s2), [h] "r" (h), [w] "r" (w)
    : /*"%rbx", "%rcx", "%rsi", "%rdi",*/ "cc"
    );
}

What's annoying, at least to me, is the lack of a pdiv instruction. Instead I had to multiply by 257 and use the top 16 bits of the result (really only 8 bits the top 8 bits are always zero.) This generates an error: if you have a component that's 0xFF with alpha of 0xFF (not transparent,) it will be changed to 0xFE instead of 0xFF! I'm thinking I could use a round up by adding a value, such as 0x0080, before the pmulhuw.

Otherwise, I'll be testing this with yet another obtimization to see whether it makes any difference: there are 2 registers still unused so we can pre-load two of the constants in registers.

Well! By putting those 2 values in registers I could render another 230+ images per second!

AttachmentSize
images-0.1.1-Source.tar_.gz862.28 KB