OpenBSD 5.3, now with nscan
OpenBSD 5.3 is coming soon, the order page is up now. I don’t think I made a single commit for this release, which represents a new low. At least things can only get better, I already made a few minor commits for 5.4. One feature that I did hack on but which may go unnoticed because it’s largely invisible is a new disk I/O sorting algorithm. beck@ was the driving force in testing this and pushing me to write the code he wanted. :)
OpenBSD has used the same elevator scan algorithm ever since the days when it was just called BSD. The concept is simple. To minimize seek times, we sweep back and forth across the disk, from the start cylinder to the end cylinder, issuing requests. As the file system code makes new requests, we insert them in order into the queue. This works great except for the possibility of starvation. If you try to read something from the other end of the disk, you have to wait for the disk to work it’s way over there, which can take a while (or longer) if another process is making requests in between. (Even if the other process makes a request chronologically after you, it will be processed first if it’s closer on disk.) This problem only gets worse as the buffer cache size increases, leading to longer and longer chains.
The code for disksort in sys/kern/subr_disk.c isn’t super complicated, but it does have its share of subtleties, not to mention relying on directly manipulating list pointers and two key variables named bp and bq.
For a while now, OpenBSD has abstracted the disk queuing mechanism via a layer called bufq. Before that, each disk driver directly called disksort. The theory was to allow changing the queuing algorithm, though in practice bufq just called disksort. But now we can finally put bufq into effect. We added a new discipline called nscan, made it the default, and presto, all the common disk drivers took advantage of it automatically.
NSCAN, or N-Step-SCAN, or whatever you want to call it, is a combination of FIFO and SCAN. We use similar logic as in disksort to manage a short queue of requests in cylinder order. When that queue empties, we refill it from a FIFO (first in, first out) queue. The combination means that a request can’t languish at the end of a super long sorted queue that will take ages to process and continues to grow. (If you’re at the end of the line, you still have to wait, but nobody will jump in front of you.) Every 128 (in the current implementation) requests are sorted into order, to prevent the disk from thrashing back and forth.
The code for nscan lives in sys/kern/kern_bufq.c and is, if I’m allowed to brag, a lot easier to understand than disksort. It uses regular list macros instead of direct pointer manipulation and separates the algorithm logic from the bit banging that makes it happen. During development, we were able to rapidly experiment with variations on the number and size of queues just by making a few tweaks.
Combined with some other changes to limit the number of pending writes, copying a large file to a slow disk will no longer cripple your workstation. The bufq experiment finally paid off, allowing us to drop in a new algorithm without major driver surgery.
Bob is also giving a talk at BSDCan about the buffer cache.