Compiled Bitmaps
June 22nd, 2005
Ancient programmers speak in hushed tones of the legend known as the "compiled bitmap." Ok, no they don't, but there is such a thing. It's a technique that was sometimes used in the DOS days for rapidly drawing an image to the screen. But how can you compile a bitmap? And is there any value remaining in this technique, or should it be left in the dustbin along side TSRs and DOSSHELL? Let's find out!

Blitting

Drawing an image is called "blitting." The usual way we blit is to load a pixel from the source image buffer, store it in the destination, and repeat until we've blitted every pixel.

But with compiled bitmaps, the data for the image is in the instruction stream itself. So instead of "load a pixel from the source," we say "load red," and then store it as usual. Then "load green", or whatever the next color is, and store that, etc. A compiled bitmap is a function that knows how to draw a specific image.

Instruction counts

So an advantage of a compiled bitmap is that there's no need to load data from an image when we blit; the image data is in the instruction stream itself. But this is ameliorated by the fact that we'll have many more instructions in the function. How many more? That depends on the processor architecture, of course, but since I'm a Mac guy, I'll stick to PowerPC.

To load a single 32 bit pixel into a register requires two instructions: one to load each 16-bit half. And storing a pixel requires another instruction. By taking advantage of the literal offset field in the stw instruction, we can avoid having to increment our destination pointer (except when we overflow the 16 bit field every sixteen thousand pixels). The PowerPC also has a special instruction form, the store word with update instructions, stwu, that lets us increment the target pointer for free. But my tests showed that to be slower than stw, which makes sense, since it introduces more inter-instruction dependencies.

But of course, if the next pixel is the same color as the current one, there's no need to load a new color into the register: you can just store the current value again. So sometimes we can get away with only one instruction per pixel. So the number of instructions per pixel in a compiled bitmap ranges between one and three, depending on the complexity of the image and the sophistication of our compiler.

This compares favorably to ordinary blitting, where we have more instructions per pixel (load pixel, store pixel, check for loop exit, and pointer updates). So at first blush, a compiled bitmap may appear ~ 33% faster than ordinary blitting.

Bandwidth

But this isn't the whole story, of course; a processor is only as fast as it can be fed. How much memory bandwidth do we use? In standard blitting, the instruction stream is a very short loop which fits entirely into the instruction cache. The main bandwidth hog is reading and writing the pixels: 32 bits read, 32 bits written. That's 32 bits each way, per pixel.

With a compiled bitmap, the instruction stream no longer fits in the cache, so the instructions themselves become the bandwidth bottleneck. 32 bits for each instruction (remember, there's one to three), and 32 bits to write out each pixel, means between 32 and 96 bits read and 32 bits written per pixel. Based on that, we might expect a compiled bitmap to be up to three times slower than memcpy().

Test approach

I wrote a test program that blits three different images, at three different sizes, using each of three different methods. The images were the standard OS X Stones background image, an image with random pixels generated via random(), and a solid blue image. I blit (blitted?) to the screen using the CGDirectDisplay API, and repeated the tests to an in-memory buffer. The blitters were a memcpy() driven blitter, a bitmap compiler, and (for the screen only) the draw method of an NSBitmapImageRep. In each case, I blit 50,000 times and timed the number of seconds it took. Division gives the frames per second, and multiplying by the number of pixels gives pixels per second.

Here is the Xcode project I wrote. I place it in the public domain. (Note that it outputs frames per second, not pixels per second. You can calculate pixels per second by multiplying the framerate by the number of pixels in a frame. Note that I lowered the test count to 500 from 50,000 so you don't tie up your computer for too long by accident. I also left in the stwu version of the compiler in case you're interested.)

Results

Here's what I got.
Pixels/second for blitting to memory:
Image Size
128x128 256x256 512x512
memcpy Stones 2,582,493,355 401,030,682 244,721,918
Compiled Bitmap Stones 961,846,601 116,857,588 116,034,503
memcpy Random 2,599,446,090 392,847,697 244,445,199
Compiled Bitmap Random 732896429 116,561,007 115,185,567
memcpy Solid 2,592,274,147 431,870,370 245,302,313
Compiled Bitmap Solid 990,974,688 351,106,794 234,451,230
Pixels/second for blitting to the screen:
Image Size
128x128 256x256 512x512
memcpy Stones 40,984,006 40,984,335 40,933,309
CoreGraphics Stones 20,115,402 10,588,428 10,611,671
Compiled Bitmap Stones 13,321,951 13,300,135 13,303,299
memcpy Random 40,996,351 40,994,852 40,935,011
CoreGraphics Random 20,128,212 10,581,521 10,602,114
Compiled Bitmap Random 13,326,014 13,299,336 13,300,511
memcpy Solid 40,993,883 40,998,191 40,936,245
CoreGraphics Solid 20,125,009 10,574,159 10,602,177
Compiled Bitmap Solid 13,324,779 13,321,780 13,310,426

Man, that's boring. All right, let's draw some graphs.


Count the digits...one...two...ok, yeah, note that the scales are very different here. We can blit two and a half billion pixels to memory in the same amount of time as it takes to blit forty million pixels to the screen.

So it's pretty obvious that memcpy() is faster than my bitmap compiler. The only point where my compiled bitmap is competitive is blitting a large solid color.

Ex Post Facto Explanation

Was my Latin right? Darn if I know. Anyways, I noticed that memcpy() is using a sequence of unrolled lvx and stvx instructions. That means Altivec acceleration! It may be part of the reason memcpy() is so fast. Unfortunately, the only way to fill a vector register with literal values is to assemble them in memory and then load them in, so it would not be easy to vectorize the bitmap compiler, except in special cases. The results might be somewaht different on an Altivec-lacking G3.

Blitting the solid image to memory via a compiled bitmap is between two and three times faster than the non-solid images, and is competitive with memcpy(), except in the 128x128 case. I don't know why memcpy() is so much faster in the 128x128 case. I also don't know why blitting to the screen is so much faster with memcpy() compared with the other techniques. If you know or have ideas, please comment on them!

It's also very likely that I've missed a major optimization opportunity. I'm far from a guru when it comes to the PowerPC. If you spot one, please post it in the comments!

Looking ahead

My clumsy tests seem to confirm that compiled bitmaps aren't very fast. But note that the PowerPC has fairly big instructions, so it may not be the best candidate for a bitmap compiler. It takes twelve bytes of PowerPC instructions to write a 32 bit literal to a location in memory, but only six bytes of x86 instructions, which halves our bandwidth requirements. When the ICBMs ship, I'll write a bitmap compiler for them and repeat these tests.