2016-09-24

2016-09-24 More Multitasking

A significant feature of many operating systems is the ability to multi-task.  You may be running Calculator and Sound Recorder and Wordpad and Sticky Notes and Doom and Paint and Command Prompt and Clippy, but they aren't actually running at the same time in the processor.  Instead, the processor is very quickly switching between them (with each running for maybe a few microseconds at a time) to make it appear if they are happening at once.

My multitasking system will be what is called pre-emptive multitasking, which is where the OS uses interrupts to pause the running programme in order to switch to the next programme.  This means that the programmes can be written as if they're the only programme on the system, the kernel takes care of the heavy lifting, and the programmes aren't usually aware that they were even interrupted.

Before I could design the multitasking system (or even fully understand a lot of the tutorials and articles on designing a multitasking OS) I had to go back and do a lot more research to get a solid footing in the concepts (hence the confused previous post).

Interrupts and the TSS


I spent four days at the beginning of this week trawling through the Intel Processor manuals reading up about the hardware multitasking system that the i386 architecture provides including the TSS, the specifics of CPL, DPLs, and RPLs, how interrupts work and what the IRET function does.  I knew a lot of this already but this was a whole new level of detail, altogether.

The i386 provides a hardware multitasking capability which uses what it calls Task State Segments (TSSes) in order to store the state of each task that might be running.  This feature is only used by a few OSes because it means the OS would have to be significantly changed in order to run on another platform.  Instead, many OSes use what is referred to as Software Multitasking where the kernel developer does a lot more of the work.  However, on the i386, software multitasking still requires at least one TSS (hardware task state) in order to use software multitasking.

The Linux kernel v0.01 used i386 hardware task switching, with each task having a TSS and an LDT (Local Descriptor Table), but it was limited to 64 tasks.

Why do I need a Task State Segment (TSS)?


The short answer is that Intel says I do.

The longer (and more useful) answer is that as soon as the system starts running protected mode code outside of Ring 0 (the most privileged mode of operation), it will need a TSS in order to handle interrupts.  If the whole OS is written to run in Ring 0 (as mine was until recently) then it doesn't need a TSS, but then any code that runs on the system has full privileges and can potentially change anything it wants.  As soon as code is running in the lesser privilege modes (Rings 1, 2 or 3), then it needs a TSS or it will not be able to handle interrupts.

When running in a lower privilege mode, such as Ring 3, the processor is running code from a Ring 3 code segment (indicated by the code segment descriptor having a DPL of 3) and the processor stack is pointing to a Ring 3 stack segment.  In order to jump to code running in any other privilege level (such as an interrupt handler which normally runs in ring 1 or 0), the processor must also change to a stack segment that is the same Ring level as the code.  If it didn't, the low privilege task could potentially change the stack (and therefore parameters, local variables, and return addresses) of the privileged code, causing a massive security flaw.  In order to do this automatically, the processor requires a TSS to have been set up by the kernel which defines a stack segment for each ring level that may be required.  When the processor changes ring level, it will load the stack segment for that ring from the TSS and carry on.

Interrupt handlers frequently run with greater privileges (lower numbered rings) in order to be able to interact with IO ports, update kernel tables or the like.

A quick note about multi-processor systems, because all the processors could potentially be running interrupt handlers at the same time (or other code which requires a change of ring level, such as traps or gates or whatnot), they each need a different stack so that they don't tread on each others' toes and have a bad time.  It's common therefore to use one TSS for each processor, each pointing to a different stack.

Interrupt Worked Example


So our processor is currently running a lowly ring 3 task, such as playing Tic-Tac-Toe against itself, when an interrupt arrives.  Here's what happens:


  • The CPL of the currently running code (more specifically the DPL of the current Code Segment) is 3 (least privileged) in Ring 3.  This means that the current processor stack will be in a Data Segment that is also in Ring 3 (has a DPL of 3).  If you were to read the CS or SS registers, the lower two bits would contain the value 3, indicating the ring number.


  • Moooooooooo!  (That's the interrupting cow, signalling an interrupt has occurred).


  • The kernel interrogates the IDT (Interrupt Descriptor Table) to find the Interrupt descriptor.


  • It reads the DPL of the Interrupt Descriptor and finds that it is set to 0, indicating the the Interrupt Service Routine it points to should be run with full privileges.


  • The processor will not use the current Ring 3 stack for processing the Ring 0 interrupt handler.  Instead, the processor performs what is called a "Stack Switch" to AUTOMATICALLY change to a stack that is in a Ring 0 Data Segment (a segment with a DPL of 0).


  • To do this, it goes to the current TSS, and loads the TSS' SS0 value into the Stack Segment (SS) register, and the TSS' ESP0 value into the ESP register.  It uses the "0" versions of these values because (in our example) it's going to be running a Ring 0 Interrupt handler.  If our interrupt handler were in ring 1 or 2, it would use SS1:ESP1 or SS2:ESP2 instead.


  • As soon as the processor has performed the stack switch, it needs to know what stack it was using before the switch so that it can switch back, so it immediately pushes the previous values of SS and ESP to the new Ring 0 stack.


Now the processor continues to process the interrupt just the same as if no stack switch occurred.  It pushes the EFLAGS register, then CS and EIP registers so that it knows where to go back to after the interrupt, and pushes an Error Code if this interrupt generated one.
Then the processor calls the Interrupt Service Routine.  The ISR does ISR-like things such as acknowledge the interrupt, deal with the device, handle the page fault, whatever.

When the ISR finishes, it must have restored the Ring 0 stack so that the ESP register points to the EIP value that was pushed, then it calls the IRET (return from interrupt) instruction.


  • The processor looks at the CS value that was pushed to the stack earlier to determine the Ring number of the code that was interrupted.  This is again in the two least significant bits in the CS value.


  • If the Ring number of the code that was interrupted is the same as the ring number of the interrupt handler (they have the same privilege level), the processor concludes that no stack switch occurred, so it performs a normal IRET, it restores the EFLAGS from the stack, then jumps to the CS:EIP value from the stack.


  • If the Ring number of the code that was interrupted was different to the interrupt handler, the IRET performs an additional operation, which is to undo the stack switch by restoring the SS and ESP from the Ring 0 stack and going to back to the Ring 3 Stack at the same time as it returns to the Ring 3 Code Segment.


Notes:
The SS0 and ESP0 values from the TSS do not get changed by these operations, the Ring 0 stack is left all ready to process the next interrupt.
Nothing gets pushed to the Ring 3 stack at all during the interrupt event.
If you don't have a TSS, and an interrupt or call gate or the like requires a change to a different Ring level, the processor will have a bad time.

Designing Multitasking


Armed with this knowledge, I started designing the multitasking system.  I identified three scenarios in which the processor could be when an interrupt happens:


  1. The Processor could be running a kernel thread in Ring 0.  When the interrupt occurs, there will not be a stack switch.
  2. The Processor could be running a user thread in Ring 3.  When the interrupt occurs, it will automatically switch to the stack pointed to by the TSS.
  3. The Processor could be running a user thread which is performing a system call, and is therefore in Ring 0 code with a Ring 0 stack.  When the interrupt occurs, there will not be a stack switch.


For my multitasking, the interrupt handler (actually all interrupt handlers) will push the bulk of the processor state to the stack (specifically the kernel stack, either the one for the current thread, or the one pointed to by the TSS).  From here, I take a pointer to this state and then if I need to perform a task switch, I can copy the state from the stack to the process table (which is kept simple by using the same structure in both).  I can then use the stack to restore the state for the new task, then complete the interrupt handler which restores the state and IRETs back out.

If I'm using a kernel thread, or a process thread in a system call, it will already have its own stack in kernel memory which is specific to that task, so I save some time by leaving the task state in its own stack, ready for next time.  I do copy it to the process table for completeness, but I don't copy it back.

If I'm starting a new thread, I will need to create a structure on the stack that is as the interrupt handler return expects.  This is quite easy because I can memcpy() most of it from the process table, then set up a few other values manually.

Implementation


This part took less time than I expected, even with the inevitable debugging it required.  The biggest problem that had me tearing out my hair was that I hadn't assigned enough stack space to each task, so they all overflowed their stack and overwrote the Process Table entry for the previous process.  I wouldn't have expected an infinite loop in C to have required more than 512 bytes of stack.

I started by allocating a TSS structure for the kernel, adding it into the GDT, and then selecting it as the current TSS with the "LTR" (Load Task Register) instruction.  I half expected the LTR to cause an implicit jump to whatever the TSS has in its EIP value, but apparently there's a special case if the Task Register is null before an LTR that it doesn't.  I allocated some space for an interrupt stack in kernel memory (Ring 0), and set the corresponding value into the SS0 and ESP0 values in the TSS.  With this in place, the kernel should be able to handle an interrupt correctly if its running in Ring 3.

Next I created an infinite loop that I would run in Ring 3 (infinite loops are much better in ring 3).  Apparently, the Intel architecture is quite odd here in that you can't just say "be in Ring 3 now", you have to create a stack frame that looks exactly like an interrupt frame, then use IRET to "return" from that frame to a different Ring, just as if a Ring 3 task had been interrupted.  It sounds like more of a pain than it is, ultimately it's a single function that pushes the correct values, then uses some assembly.

With all that in place, my kernel main() started up, set up the TSS and interrupt stack, then switched to Ring 3 and went into an infinite loop.  With that in place, I could see that the Ring 3 loop was being correctly interrupted (by the keyboard and timer events and whatnot), the kernel was switching to Ring 0, handling the interrupt, then returning to Ring 3.

\ o /

(I would normally include a picture here to show the OS doing the thing I've just talked about, but it's just an infinite loop, it doesn't make for a good picture.)

No comments:

Post a Comment