2014-01-31

2014-01-31 Doom on Real Hardware

"So let's change the main() routine back to loading the Doom executable and see what happens."

Well, the short answer is this:



I didn't want to post this before because the video quality is quite terrible because it's a camera pointing at the monitor, but it shows off the frame rate quite nicely.  As it turns out, all the frame rate concerns I had were completely due to the Bochs emulator, and on real hardware it just flies.

So, let's break it down ...

Firstly, the audio.  The Octopus kernel has no support for sound yet, and while the music is a really good fit, it was added with YouTube to replace the terrible background noise and camera zoom noise.

The video has some significant banding on it.  This is because it's not using the Vertical Retrace interval (VSync) to update the screen.  I have the back buffer already, so this should be easy to implement.

Currently, Doom takes around four and a half minutes to load (most of which I didn't bother to record) but then as the game loads more data on the fly, it pauses briefly.  You can see the reflection of the blue light on the USB stick flashing at the bottom of the screen when it does this.

I have a list of things to address in the short term including:
  • the load time,
  • the VSync
  • a big code refactor
  • getting a half-decent recording

2014-01-30

2014-01-30 USB EHCI

"Hmmm, USB Midi devices?  Actually, probably the best thing to use is my USB IO board.  Actually, probably the best thing to do is to stop procrastinating and write the bloody EHCI mass storage code :("

So I'm onto to investigating EHCI (USB2).  I've already spent some time looking through the specifications and one of the most notable differences compared to UHCI is the way that Queue Heads are used.  In UHCI, the Queue Heads are arbitrary structures that are only used by software to organise Transfer Descriptors as it sees fit.  In EHCI, the Queue Heads contain all the information that pertains to a single USB Device Endpoint.  This makes a lot of sense as the transfer descriptors don't need to duplicate all this information each and every time, they can be linked to the Queue Head which will contain it.

Another significant difference is that the Frame List for periodic, scheduled transfers has been separated away from the "reclamation" list of transfers to happen as soon as possible (what it refers to as the Asynchronous list).

If you read the UHCI article, you'll know that I tried to disable the EHCI controller before by adjusting the CONFIGURE flag.  Turns out that I may have missed a fairly vital step from that code that I have implemented this time around.  The EHCI controller has a flag in it to indicate if it is "owned" by the BIOS (for USB legacy support of keyboards and whatnot) and another for if it is "owned" by the OS (which is what we want).  There is a protocol for requesting ownership from the BIOS so the two code-bases aren't competing and confusing things.  That actually explains some of the problems I was having before ...

After writing structures which match the EHCI Queue Head and Transfer Descriptors (called a Queue Element Transfer Descriptor) from the specification, I have used many of same steps as the UHCI code to reset the EHCI controller.  I have also created three Queue Head structures as global variables, one for the Control endpoint of the Flash Drive, and one each for the Bulk In and Bulk Out endpoints that the USB Mass Storage Class specification says it should have.  I have connected these three Queue Heads together in a linked list, programmed the EHCI adapter with the start address and set it running.  My intention is to create Transfer Descriptors as required and add them to the linked list from the relevant Queue Head.

One note about the structures for the Queue Head and the other USB in-memory structures is that they all have to be aligned to specific byte alignments.  I'll admit I may have cheated here and padded each of the structures to a multiple of their required alignment.  This allows me to create an array of each type of structure and for every element of the array to be correctly aligned.

In order to test that the EHCI controller was actually parsing the list of queue heads, I changed some of the Queue Head values to be gibberish and confirmed that the EHCI controller reported an error (tee hee).

The EHCI controller has the same "Port Status and Control" registers, and I have written some code to loop through these and display the values of the various status bits.  This I have used with a single test device to identify which port numbers correspond to which physical ports.

Happy that the Queue Heads are working correctly, I've moved onto forming a few Transfer descriptors to try to get the device descriptor for my test device.

64-bit Woes

At this point, I will spare you many of the details, but there was a lot of swearing and many many hours (days) of reading back and forth through the EHCI specification, my code, a whole load of articles on-line and my book on USB all to try to track down a single issue that had a stupid cause.  The issue was that as the EHCI controller processed the transfer descriptor, it experienced a transfer error (not surprising at the moment) but then the USB controller encountered a "Serious Error" and stopped itself.  With investigation, the transfer descriptor was being copied into the Queue head (which is correct) but in doing so it appeared to overwrite the beginning of the next queue head, which caused the queue processing to fail.  The irritatingly simple answer was that the EHCI controller is 64-bit capable and as such, I need to use the 64-bit version of the Transfer Descriptor, which is longer than the 32-bit one that I had coded.  The adapter copied this extra length, and trashed the following data.

Once that was fixed, progress was swift, and I soon had the device returning its device descriptor.

Device Descriptors

In USB, every device has a Device Descriptor which gives basic information about the device, including the number of configurations it has (almost every device only has one configuration though).  It then has one Configuration Descriptor for each configuration which includes a number of Interface Descriptors and Endpoint Descriptors.  It's important to note here (as it kept me confused for a day) that when you issue the GET_DESCRIPTOR request for each Configuration Descriptor, the Interface and Endpoint Descriptors for that Configuration are returned IN THE SAME REQUEST, then you parse them for the interface and endpoint details.

With all this information, I was able to read the Device Descriptor and read the single Configuration Descriptor (good) which had a single Interface Descriptor (with a class of Mass Storage device, good) and two Endpoint Descriptors (on for data in, one for data out).  The interface descriptor also told me that the device used the SCSI command set to actually read and write blocks which is fine because I used this briefly before when writing an ATAPI driver for CD-ROMs.

At this point, I got bored and rather than writing a proper parser to read this information to set up the queue heads for the mass storage endpoints, I just hard-coded the endpoint numbers and the other parameters.  They won't change for this device and I'm keen to keep up the progress.

Next thing to do is to actually send the SCSI command to the correct endpoint to request the data.  For this, the USB Mass storage specification says that I create a Command Block Wrapper (CBW) structure which wraps the SCSI command for the transfer, and use a Command Status Wrapper structure to receive the SCSI response.  I've played around with a few different SCSI commands, but that only one that seems to not error is the READ(10) command (so called because it is a 10-byte buffer that represents the command), but with this IT HAS READ A BLOCK!!  tada.wav!

One odd thing here is that if I issue a SCSI command that is invalid, the error is reported by the device endpoint halting.  This is really annoying as surely this error should be reported in the SCSI status data.  I don't know if all USB mass storage devices do this, but in my view this breaks encapsulation.  If the SCSI command is invalid or fails, then the SCSI response should indicate this, not the USB layer.

With this success, I have run a few stress tests, and unfortunately it stopped reading blocks after 50 or so, this was a fairly simple cause in that I wasn't free()ing the CBW and CSW structures, so eventually I was running out of memory (I only keep a small amount of memory available to the kernel to discover such memory leaks).

The Block Device Driver

Now that the EHCI controller is capable of reading blocks, I have created my OS's wrapper structure for the USB mass storage device (this structure is the one that provides a consistent interface for all block devices with such functions as ReadBlock() and WriteBlock()).

In debugging this, I encountered another issue I hadn't expected.  Now that I'm testing on real hardware the memory is not clear when the machine boots, instead is contains a good deal of whatever was in it before the last reboot.  This caused some issues where I had been sloppy and not properly cleared memory that I had allocated prior to using it and the code had assumed that unset areas were zero (bad practice, I know).  A quick tweak to the malloc() routine and all memory is being cleared automatically when it is allocated (although I guess I should use calloc()).


===========================================================================
                 Commercial product - do not distribute!
         Please report software piracy to the SPA: 1-800-388-PIR8
===========================================================================
M_Init: Init miscellaneous info.
R_Init: Init DOOM refresh daemon - Error: W_GetNumForName: PNAMES not found!
Exited!


Ok, now what?

Filesystem Testing

When I encountered this error before, it was because the file seeking wasn't working properly and it couldn't seek to the end of the WAD file where the index is.  Rather than continue to use Doom to validate the filesystem code, I thought I'd take a different approach.

I have written a small .NET program which generates a 16Mb file of random data, which it then tweaks so that every 512-byte block of the file checksums to a specific value which is 0xBEEF0000 where the lower 16 bits contain the block number.  Using this file on the key and a checksumming algorithm in the kernel (referred to as the Beef Test), I can quickly test various file operations to make sure they are behaving correctly.

With the aid of this, I was able to track down one of those really-simple-but-hard-to-debug issues.  It was a simple matter of the USB device driver dereferencing a NULL pointer, then using the resultant information as the size to a memcpy() operation, which wrote beyond the allocated memory block, which trashed the memory block header, which made the memory allocator think it was out of memory, which caused an allocation in a completely different part of the kernel to fail and error.

If I get the OS working with any hair left, it'll be impressive.

Anyway, according to my checksum test file, all the file operations from the USB flash drive are working, the system is capable of reading a 16MB file without issue, and there are no memory leaks being encountered.

So let's change the main() routine back to loading the Doom executable and see what happens.

2014-01-12

2014-01-12 USB UHCI

Ok, so on with the USB coding.  I'm starting with the UHCI controller (which is one of the USB 1 standards) as my test machine has a UHCI adapter in it.

Introduction to USB

When you want the USB adapter to perform a data transfer of some sort, you create a structure called a Transfer Descriptor in memory and add it onto the end of a linked list of work.  The USB adapter periodically scans this list and performs any outstanding work it finds.

As well as Transfer Descriptors, USB also has a structure called a Queue Head.  These aren't important as far as what data gets sent to the devices, but they are really useful on the software side to organise and arrange Transfer Descriptors in a sensible manner.  For example, we could have one Queue Head for each device to keep things tidy.

The starting point for all of this (and the first thing we need to configure) is a register in I/O space which holds the beginning of the Frame List.  This is an array of 1024 pointers, each of which is the start of a separate list of work.  Each millisecond, the USB adapter moves to the next entry in the frame list and does all the work.

"Why 1024 separate lists?  Why not just one list?"  I hear you cry.  Sometimes you need to transfer data that is time dependant, such as data from a USB camera or data to a USB sound device.  As one entry of the frame list is examined per millisecond, you can add work to the list that won't be processed until later on.  Using the sound device example, you need to keep the sound device loaded with data at the right time so that it can keep playing, and the USB frame list allows you to queue up the next second's-worth of sound data and forget about it.  USB refers to these transfers as Isochronous (in USB 1) or Periodic (in USB 2).

"But what about transfers that I want as soon as possible?"  For this we create a Queue Head, and we add all this non-time dependent work to this Queue Head.  This makes a list which the USB specification refers to as the Reclamation list.  Each of the frame list pointers then points to the time-dependent-work-for-this-millisecond which then points to the reclamation list.  Frame list pointers with no time-dependent work just point straight to the reclamation list.  This means that for each millisecond, the USB adapter transfers the data that is scheduled for that millisecond, then uses the remaining time to do anything from the reclamation list that it can fit in.

Implementing

So, first off we create a Queue Head structure, and set its pointer to a special value called the TERMINATE value.  This just indicates to the USB adapter that this is the end of the list and there is no more work to do.  Then we allocate the Frame List which is 1024 x 32-bit integers (totalling 4KB) and set each of these entries to point to our Queue Head.  Then we use the USB registers (which are in I/O space for the UHCI adapter) to tell the adapter where the frame list is, and then to set it "running".

At this point, I have verified that the USB controller is running because the Current Frame Number (another of the I/O space registers) is incrementing as the USB adapter parses the list.

Ports

The UHCI registers also contain two "Port Status and Control" registers, one for each of the physical ports connected to the adapter.  These registers allow you to find out if a device is connected, how fast the device is, and allow you to reset the device so that it can be configured.

Unfortunately, these don't seem to be doing anything, I can't enable the root ports or see connected devices.

I had thought from the USB specification that the EHCI (USB 2) controller wouldn't be a problem.  In a system that has both USB1 and USB2 controllers (which is most of them these days), the EHCI adapter is responsible for controlling which ports are routed to which controllers.  If the EHCI controller's CONFIGURE flag is not set, the hardware routes all the ports to the companion controller (the UHCI controller that I'm using).  What I suspect is happening is that the EHCI controller has been configured by the BIOS because I'm using the BIOS' Legacy Support in order to boot from the USB key in the first place, and this is messing things up.

Finding the EHCI controller

I ran the PCI scan again and there is no sign of an EHCI controller, but I pretty sure that the Revo specification says it has USB 2.  I was also able to find a forum post on the web where someone had posted a device list for their Revo R3700 as they were asking for help with an unrelated issue, but it lists the EHCI controller as a function of one of the PCI devices.

After much confusion, I found a bug in the PCI scanning code that I had been using for years.  Some PCI devices have multiple "functions" and these functions are addressed separately when interacting with the device.  Other devices have only one function, but that function responds to all the function addresses for the device (I found this with a RealTek network card before).  In order to find out if a device has many identical functions, or has one function responding to all the addresses, you read part of the PCI configuration space for a flag which indicates if it is a single-function device, which I was doing.  Unfortunately, this bit can be set on functions beyond the first function of a multi-function device, and that was being misinterpreted by my code.  It was getting to the second function of the multi-function device, finding this flag was set, thinking it was a single-function device (which was obviously wrong) and skipping onto the next device.

With that issue fixed, I have located the EHCI controller (and a Wireless LAN controller which apparently fitted to this machine, who knew?)

Disabling the EHCI controller

With the address of the EHCI controller, I quickly skimmed the EHCI specification and wrote some quick and dirty code to find and reset the CONFIGURE flag for the EHCI adapter which should return the control of all of the ports to the UHCI controller.

UPDATE:  Don't do this.  I didn't realise that I also had to "take ownership" of the EHCI controller from the BIOS as well.  I didn't and that caused the system to hang for about ten seconds after I connected or disconnected a device, presumably as the BIOS tried to get the disabled EHCI controller to connect to a port that it didn't own and things got confused.  Anyway, lesson learned.

Moving on from this, I also found that the EHCI controller had six ports, and the three UHCI controllers had two ports each (which makes sense) and I had to work my way through the different UHCI controllers to find the one that had my USB Flash Drive connected (it was the only port with something in).

Resetting the Ports

A bit more USB theory now.  Every USB device is assigned an address by the system and that address is used when sending commands to the device, except that the address is assigned by using a command, so which came first?  Well, when you reset a USB port (by using the Port Status and Control register on the USB controller) the device also resets and then responds to address 0, which is a special address specifically for this task.  Once the device has been reset, you issue a Set Address command to the device at address 0 to give it a number that is not already in use.  This is also why the USB system only supports 127 devices instead of 128, because address 0 cannot (well, "should" not) be used for normal transfers,

Many hours of hacking and confusion later, I have a rough function which resets the port, issues a Set Address to the device, and then sends the Get Descriptor request to read the Device Descriptor  \o/

I will post this function onto the blog once I have cleaned it up.  It is a very rough-and-ready single function that goes through the steps of resetting the UHCI adapter, configuring the frame list, resetting a port, and issuing a GET_DESCRIPTOR transfer to the device.  I'll post it in an attempt to save others the headache of getting to this point.  Just having a function like this, in my opinion, puts a great deal of the specification into context and allows a developer to research a particular flag or variable with the knowledge of where it fits into the process.  On top of all that, having some functional code that can be tweaked can be used to test out various variables to see what effect each of them have, and to compare to another implementation to find differences when one works and the other doesn't.  I've seen forum posts where a new developer has been asking questions about implementing a USB driver and the responses have been to go and look at another driver's source code (such as Linux) which (again, my opinion) is a terrible idea.  They need a clean implementation, not one that requires days of sifting through to see past all the implementation-specific details such as how they allocate the memory or what structures they use to store the device metadata, not to mention tracking through function pointers to work out where they were set and to what? I'm really annoyed that it has taken me so long and at least three attempts to get to this point, and annoyed at the documentation for dramatically over-complicating this basic process, and annoyed at various on-line sources for not having code like this posted already.  Hopefully I can try to remedy that, at least to a small degree.  /rant

I've taken a bit handful of various USB devices and run this code against each of them to read the device descriptors, most of them work but some are reporting HALT errors and some of them are reporting CRC errors.  I guess some of them may not be compatible with USB 1, or my code is just plain wrong somewhere.  :)

In either case, I think I might leave UHCI there and take the knowledge I have gained and look at applying it to a more elegant EHCI driver next as it's likely that will be more use in the long term.

2014-01-09

2014-01-09 USB ... En Garde!

In order to get my OS onto some real hardware for further tests, I've decided to revisit an old enemy, USB.  Before now I have used PXE network booting to get various monolithic test kernels onto hardware for some real world testing, and at least one of my previous test kernels had a UDP based filesystem that it could use to load files from a network server.  Whilst I could use the same approach for the Octopus kernel, I would like to crack the USB nut to be able to use USB Mass Storage devices.  I quite like the idea of having a flash drive that I can plug into almost any machine and boot my OS.

I have looked at USB before, but I have always found a very large amount of USB documentation and specification (much of it only relevant if you're creating USB hardware), and very small amount of good USB example code.  While I appreciate the point that developers shouldn't implement something they don't understand and that asking for example code is a big no-no, a small amount of actual implementation answers a significant number of questions that new developers may have, provides a codebase full of keywords that can be researched in the (significant) documentation, and also gives a working baseline implementation that can be tweaked and experimented with to explore the interface and help debug their own implementation.

All that aside, I had no specific goal to achieve by investigating USB before and many reasons to investigate other things, so the specification documents were filed away under "tl;dr".



Anyway, before we can actually get down to some proper USB implementation, I'll need a system on which to test.  For this purpose, I have set up one of my machines, an Acer Aspire Revo R3700, and dug out a random 4GB USB2.0 flash drive.

Up until now I've been using the Bochs emulator to test the Octopus kernel so far.  My Bochs configuration has two hard drives, the first is a small image file onto which I installed Grub (Grub Legacy), the second is a VVFAT-type hard drive where Bochs takes the contents of a folder on the host OS and emulates a FAT partition which contains those files.  The configuration file for Grub then loads the kernel from the second hard drive.  Grub Legacy has to be installed onto an image-type hard drive because it has to know exactly which blocks contain the rest of its code.

I copied the Grub hard drive image onto the key and this worked beautifully on the Revo, and I set up a second partition on the key by editing the partition table entries (1337 H4XX!).  Unfortunately, it appears as if Windows doesn't fully support partitions on a USB flash drive (no idea why).  It reads the partition table correctly, and handles the first partition correctly, but none of the others show up.

A LOT more tinkering later and I have a 1GB partition on the USB drive with GRUB installed, a copy of my kernel, Windows recognises it, and the Revo boots from it.  (I know it's 4Gb key, but I REALLY don't need that much space yet.)  It's worth noting that the reason it boots is because the BIOS has the capability to load from a USB mass storage device, but as soon as I enter protected mode (which Grub does for me) and the BIOS interrupts are no longer available, I'm on my own.

With this set up in place, I can compile a new version of the kernel in NetBeans, copy the output file onto the USB drive, put it into the Revo, and boot the new changes.  It's a bit more cumbersome than the PXE booting I've used before, but should work on most modern machines without needing a network cable and a second machine to run the TFTP server.

Now to actually start writing the USB driver ...

2014-01-01

2014-01-01 Chunky-To-Planar Conversion

Continuing with the goal of increasing the speed of the rendering to the mode 0x12 screen, I'm looking at improving what used to be called C2P, the Chunky-to-Planar conversion.  This is the procedure which takes the "chunky" or "pixel-packed" data where all the bits encoding one pixel are next to one another, and converting it into "planar" data where where the first bit of each pixel are next to one another, followed by the second bit for each pixel, etc.

In my code, the output from the quantiser is a bitmap of "chunky" bytes where the least significant four bits identify one of the 16 possible palette colours (the upper bits are unused).  To output this to the mode 0x12 screen, I take this data eight pixels at a time and convert it into four bytes where the first byte contains the first bit from each of the eight pixels, the second byte contains the second bit from each of the eight pixels, and so on.  These four bytes can then be put into the back buffer, one in each plane.

The previous routine was written entirely in C, and looped through each of the eight pixels and used bit masking and shifting to populate an array of four bytes, then it loaded these into the planes.  Suspecting that this heavily used piece of code could be more efficient, I thought I'd put in a small blob of assembly to do the heavy lifting.

After a few tests, some experiments, much confusion, very unhappies, such code and many wow, I came up with a mechanism that gives the correct output.  I also unrolled the loop to make the best use of the CPU pipeline (it was only four iterations after all).  The code now uses the assembly RCR instruction to rotate the first bit from the first byte into the CPU CARRY flag, then it uses the RCL instruction to rotate that CARRY bit into the DL register, then increments the pointer to move onto the first bit of the second byte.

The assembly looks something like this (eax starts out loaded with the address of the first pixel):

rcrb   (%%eax)
rclb   %%dl
inc    %%eax

After running these three instructions eight times, each of the first plane bits have been rotated into the DL register, and we can load this into the corresponding plane byte of the back buffer.

The code then repeats the exact same procedure again, although as the pixel bytes were rotated by one bit the first time, this iteration gets all the bits corresponding to the second plane.  Two more iterations and the C2P conversion is complete in 100 assembly instructions (including reloading eax) with no jumps.

The only downside to this process is that the source pixel data is destroyed by the process as the rotation fills the byte with whatever was left in the CARRY flag (actually it's whatever came out of the DL register, so I guess that means that whatever data was in DL before we started will be "P2C" converted back into the source buffer).

With this rotator in place, it has knocked about 300ms off each frame, so (with dithering) it's maintaining 0.99 FPS.  It's good, but it still hasn't broken the 1 FPS mark.



With some more tweaking, I've moved where in process the resolution is doubled.  Because Doom outputs a 320x200 resolution screen, I double this to 640x400 to display on my Mode 0x12 screen.  Before I had the video driver specifically written to double the input buffer by calculating the required byte offset for each pixel.  I wasn't keen on that because it was a special case in the kernel, even if this is the only thing it does yet.

I've moved this into the Doom renderer itself so that Doom sends me a 640x400 bitmap that the video routine can just iterate through to quantise and render.  This has increased the frame rate to 96 ticks per frame = 1.04 Frames Per Second  \ o /



Moving on, I'm looking at the mechanism used to perform the dithering.  Currently the procedure calculates the error for each component into a set of signed 8-bit integers, then adds these onto the next pixel before quantising (see the "Displaying Images in 16 Colours" article for more information).  This is all done in C and takes nearly half a second per frame on the emulator to perform.

I'm wondering if I can perform the arithmetic on the 32-bit 0RGB integer instead.  If I set the least significant bit of each byte ( data | 0x01010101 ) before performing a 32-bit subtraction so that a "borrow" in one component will not affect the next, then clear the least significant bit of each byte ( data & 0xFEFEFEFE ) before performing the add to prevent a carry from one component overflowing into the next, I should be able to calculate the three component error values in only a few operations without having to extract each byte from the 32-bit value.

With this new carry mechanism in place, the frame rate is up to 1.6 FPS, but the dithering is not correct, so I may have to look into this more.

UPDATE:  As I'm sure you were all quick to point out, this logic is fatally flawed.  This doesn't work when a component plus its carry equal more than 255 because it doesn't clamp the value at 255, it just wraps around and generates a very low value instead.  I'm also not convinced that the byte-wise negative numbers are going to work either.  I have abandoned this for now until I can come up with an efficient way to address these issues.



With the dithering disabled, the new C2P routine and with Doom doubling the output gives a jaw-dropping 1.96 FPS.  It almost looks fast enough to play, except without dithering, almost everything quantises to black  :(

I'm happy with the C2P routine, but the other two seem to give only minor improvements.  I'm wondering about trying to get this onto real hardware to see how much of the performance is the Bochs emulator on my machine.