Compiled Bitmaps

Blitting

Drawing an image is called "blitting." The usual way we blit is to load a pixel from the source image buffer, store it in the destination, and repeat until we've blitted every pixel.

But with compiled bitmaps, the data for the image is in the instruction stream itself. So instead of "load a pixel from the source," we say "load red," and then store it as usual. Then "load green", or whatever the next color is, and store that, etc. A compiled bitmap is a function that knows how to draw a specific image.

Instruction counts

So an advantage of a compiled bitmap is that there's no need to load data from an image when we blit; the image data is in the instruction stream itself. But this is ameliorated by the fact that we'll have many more instructions in the function. How many more? That depends on the processor architecture, of course, but since I'm a Mac guy, I'll stick to PowerPC.

To load a single 32 bit pixel into a register requires two instructions: one to load each 16-bit half. And storing a pixel requires another instruction. By taking advantage of the literal offset field in the stw instruction, we can avoid having to increment our destination pointer (except when we overflow the 16 bit field every sixteen thousand pixels). The PowerPC also has a special instruction form, the store word with update instructions, stwu, that lets us increment the target pointer for free. But my tests showed that to be slower than stw, which makes sense, since it introduces more inter-instruction dependencies.

But of course, if the next pixel is the same color as the current one, there's no need to load a new color into the register: you can just store the current value again. So sometimes we can get away with only one instruction per pixel. So the number of instructions per pixel in a compiled bitmap ranges between one and three, depending on the complexity of the image and the sophistication of our compiler.

This compares favorably to ordinary blitting, where we have more instructions per pixel (load pixel, store pixel, check for loop exit, and pointer updates). So at first blush, a compiled bitmap may appear ~ 33% faster than ordinary blitting.

Bandwidth

But this isn't the whole story, of course; a processor is only as fast as it can be fed. How much memory bandwidth do we use? In standard blitting, the instruction stream is a very short loop which fits entirely into the instruction cache. The main bandwidth hog is reading and writing the pixels: 32 bits read, 32 bits written. That's 32 bits each way, per pixel.

With a compiled bitmap, the instruction stream no longer fits in the cache, so the instructions themselves become the bandwidth bottleneck. 32 bits for each instruction (remember, there's one to three), and 32 bits to write out each pixel, means between 32 and 96 bits read and 32 bits written per pixel. Based on that, we might expect a compiled bitmap to be up to three times slower than memcpy().

Test approach

I wrote a test program that blits three different images, at three different sizes, using each of three different methods. The images were the standard OS X Stones background image, an image with random pixels generated via random(), and a solid blue image. I blit (blitted?) to the screen using the CGDirectDisplay API, and repeated the tests to an in-memory buffer. The blitters were a memcpy() driven blitter, a bitmap compiler, and (for the screen only) the draw method of an NSBitmapImageRep. In each case, I blit 50,000 times and timed the number of seconds it took. Division gives the frames per second, and multiplying by the number of pixels gives pixels per second.

Here is the Xcode project I wrote. I place it in the public domain. (Note that it outputs frames per second, not pixels per second. You can calculate pixels per second by multiplying the framerate by the number of pixels in a frame. Note that I lowered the test count to 500 from 50,000 so you don't tie up your computer for too long by accident. I also left in the stwu version of the compiler in case you're interested.)

Results

Here's what I got.

Pixels/second for blitting to memory:

		Image Size
		128x128	256x256	512x512
memcpy	Stones	2,582,493,355	401,030,682	244,721,918
Compiled Bitmap	Stones	961,846,601	116,857,588	116,034,503
memcpy	Random	2,599,446,090	392,847,697	244,445,199
Compiled Bitmap	Random	732896429	116,561,007	115,185,567
memcpy	Solid	2,592,274,147	431,870,370	245,302,313
Compiled Bitmap	Solid	990,974,688	351,106,794	234,451,230

Pixels/second for blitting to the screen:

		Image Size
		128x128	256x256	512x512
memcpy	Stones	40,984,006	40,984,335	40,933,309
CoreGraphics	Stones	20,115,402	10,588,428	10,611,671
Compiled Bitmap	Stones	13,321,951	13,300,135	13,303,299
memcpy	Random	40,996,351	40,994,852	40,935,011
CoreGraphics	Random	20,128,212	10,581,521	10,602,114
Compiled Bitmap	Random	13,326,014	13,299,336	13,300,511
memcpy	Solid	40,993,883	40,998,191	40,936,245
CoreGraphics	Solid	20,125,009	10,574,159	10,602,177
Compiled Bitmap	Solid	13,324,779	13,321,780	13,310,426

Man, that's boring. All right, let's draw some graphs.

Count the digits...one...two...ok, yeah, note that the scales are very different here. We can blit two and a half billion pixels to memory in the same amount of time as it takes to blit forty million pixels to the screen.

So it's pretty obvious that memcpy() is faster than my bitmap compiler. The only point where my compiled bitmap is competitive is blitting a large solid color.

Ex Post Facto Explanation

Was my Latin right? Darn if I know. Anyways, I noticed that memcpy() is using a sequence of unrolled lvx and stvx instructions. That means Altivec acceleration! It may be part of the reason memcpy() is so fast. Unfortunately, the only way to fill a vector register with literal values is to assemble them in memory and then load them in, so it would not be easy to vectorize the bitmap compiler, except in special cases. The results might be somewaht different on an Altivec-lacking G3.

Blitting the solid image to memory via a compiled bitmap is between two and three times faster than the non-solid images, and is competitive with memcpy(), except in the 128x128 case. I don't know why memcpy() is so much faster in the 128x128 case. I also don't know why blitting to the screen is so much faster with memcpy() compared with the other techniques. If you know or have ideas, please comment on them!

It's also very likely that I've missed a major optimization opportunity. I'm far from a guru when it comes to the PowerPC. If you spot one, please post it in the comments!

Looking ahead

My clumsy tests seem to confirm that compiled bitmaps aren't very fast. But note that the PowerPC has fairly big instructions, so it may not be the best candidate for a bitmap compiler. It takes twelve bytes of PowerPC instructions to write a 32 bit literal to a location in memory, but only six bytes of x86 instructions, which halves our bandwidth requirements. When the ICBMs ship, I'll write a bitmap compiler for them and repeat these tests.