Continuing with the goal of increasing the speed of the rendering to the mode 0x12 screen, I'm looking at improving what used to be called C2P, the Chunky-to-Planar conversion. This is the procedure which takes the "chunky" or "pixel-packed" data where all the bits encoding one pixel are next to one another, and converting it into "planar" data where where the first bit of each pixel are next to one another, followed by the second bit for each pixel, etc.
In my code, the output from the quantiser is a bitmap of "chunky" bytes where the least significant four bits identify one of the 16 possible palette colours (the upper bits are unused). To output this to the mode 0x12 screen, I take this data eight pixels at a time and convert it into four bytes where the first byte contains the first bit from each of the eight pixels, the second byte contains the second bit from each of the eight pixels, and so on. These four bytes can then be put into the back buffer, one in each plane.
The previous routine was written entirely in C, and looped through each of the eight pixels and used bit masking and shifting to populate an array of four bytes, then it loaded these into the planes. Suspecting that this heavily used piece of code could be more efficient, I thought I'd put in a small blob of assembly to do the heavy lifting.
After a few tests, some experiments, much confusion, very unhappies, such code and many wow, I came up with a mechanism that gives the correct output. I also unrolled the loop to make the best use of the CPU pipeline (it was only four iterations after all). The code now uses the assembly RCR instruction to rotate the first bit from the first byte into the CPU CARRY flag, then it uses the RCL instruction to rotate that CARRY bit into the DL register, then increments the pointer to move onto the first bit of the second byte.
The assembly looks something like this (eax starts out loaded with the address of the first pixel):
rcrb (%%eax)
rclb %%dl
inc %%eax
After running these three instructions eight times, each of the first plane bits have been rotated into the DL register, and we can load this into the corresponding plane byte of the back buffer.
The code then repeats the exact same procedure again, although as the pixel bytes were rotated by one bit the first time, this iteration gets all the bits corresponding to the second plane. Two more iterations and the C2P conversion is complete in 100 assembly instructions (including reloading eax) with no jumps.
The only downside to this process is that the source pixel data is destroyed by the process as the rotation fills the byte with whatever was left in the CARRY flag (actually it's whatever came out of the DL register, so I guess that means that whatever data was in DL before we started will be "P2C" converted back into the source buffer).
With this rotator in place, it has knocked about 300ms off each frame, so (with dithering) it's maintaining 0.99 FPS. It's good, but it still hasn't broken the 1 FPS mark.
With some more tweaking, I've moved where in process the resolution is doubled. Because Doom outputs a 320x200 resolution screen, I double this to 640x400 to display on my Mode 0x12 screen. Before I had the video driver specifically written to double the input buffer by calculating the required byte offset for each pixel. I wasn't keen on that because it was a special case in the kernel, even if this is the only thing it does yet.
I've moved this into the Doom renderer itself so that Doom sends me a 640x400 bitmap that the video routine can just iterate through to quantise and render. This has increased the frame rate to 96 ticks per frame = 1.04 Frames Per Second \ o /
Moving on, I'm looking at the mechanism used to perform the dithering. Currently the procedure calculates the error for each component into a set of signed 8-bit integers, then adds these onto the next pixel before quantising (see the "Displaying Images in 16 Colours" article for more information). This is all done in C and takes nearly half a second per frame on the emulator to perform.
I'm wondering if I can perform the arithmetic on the 32-bit 0RGB integer instead. If I set the least significant bit of each byte ( data | 0x01010101 ) before performing a 32-bit subtraction so that a "borrow" in one component will not affect the next, then clear the least significant bit of each byte ( data & 0xFEFEFEFE ) before performing the add to prevent a carry from one component overflowing into the next, I should be able to calculate the three component error values in only a few operations without having to extract each byte from the 32-bit value.
With this new carry mechanism in place, the frame rate is up to 1.6 FPS, but the dithering is not correct, so I may have to look into this more.
UPDATE: As I'm sure you were all quick to point out, this logic is fatally flawed. This doesn't work when a component plus its carry equal more than 255 because it doesn't clamp the value at 255, it just wraps around and generates a very low value instead. I'm also not convinced that the byte-wise negative numbers are going to work either. I have abandoned this for now until I can come up with an efficient way to address these issues.
With the dithering disabled, the new C2P routine and with Doom doubling the output gives a jaw-dropping 1.96 FPS. It almost looks fast enough to play, except without dithering, almost everything quantises to black :(
I'm happy with the C2P routine, but the other two seem to give only minor improvements. I'm wondering about trying to get this onto real hardware to see how much of the performance is the Bochs emulator on my machine.
No comments:
Post a Comment