2Q buffer cache algorithm
Since the dawn of time, the OpenBSD buffer cache replacement algorithm has been LRU. It’s not always ideal, but it often comes close enough and it’s simple enough to implement that it’s remained the tried and true classic for a long time. I just changed the algorithm to one modelled somewhat after the 2Q algorithm by Johnson and Shasha. (PDF)
LRU is simple enough it doesn’t require much explanation. Keep a list of all buffers. Whenever you use one, put it on the front of the list. Whenever you need a new (recycled) buffer, take it from the end of the list. Those are the oldest, least recently used buffers. In high level terms, the current working set is at the front of the list and the previous working set is fading away off the end. It’s responsive to changes in the working set, very quickly replacing old unused buffers with the latest. In other words, it has a short history; it’s not “sticky”.
This strength is also a weakness. LRU is particularly vulnerable to erroneous replacement during scanning; i.e. it’s not scan resistant. After weeks of uptime, your system is running great and everything you need from disk is in cache. Then you download a few Linux ISOs to see what the fuss is about. Suddenly, even simple commands like ps need to be read in from disk. Seek, seek, thrash, seek, grind. Should have bought an SSD. What happened? Those ISOs were so large that they pushed everything out of the disk cache. (The obvious optimization, to only cache reads and not writes, doesn’t help here because you’re equally obviously going to run sha256 and checksum those ISOs, leading to the same result.)
This problem has been known for long time, but thanks again to the short history rarely causes permanent damage. And caches with history are often harder to implement due to the additional state tracking required, and they can even suffer from too much history. After weeks of uptime, the exec pages for a command like ls are going to be pegged in memory. What if you catch color fever and switch to using colorls? Sorry, those plain ls buffers aren’t going to be recycled. Eldritch buffers are eldritch.
At the hackathon I decided to look into the problem of scan resistance a little more. A hundred CS papers have proposed better algorithms; I settled on 2Q. Or at least something inspired by it, as the final result is more or less different depending on your squint.
Reading the paper is the best way to understand exactly what they’re doing. I’ll just summarize the highlights that I thought were applicable. The main insight is that in addition to LRU’s current working set and previous working set, there’s a third working set: the long term working set. We start by assuming that a buffer is in the current working set, then quickly move it to the previous working set (or sets; they keep quite a long history). If we need that buffer again, that’s a hint that it’s really in the long term working set. To conserve memory, 2Q doesn’t actually keep the previous working set data cached in memory. Only the addresses are retained which is enough to determine when a cache miss is new data vs previously seen data.
The 2Q algorithm is scan resistant. The memory reserved for current working set is kept quite small; most of it is dedicated to long term. This keeps transient data out. At the same time, the long term list is itself managed as LRU, so it’s responsive to changes. All good.
Earlier, I refactored the buffer cache interface so that the line between the buf cache and buf midlayer was better defined. The bufcache family of functions does all the work now, allowing us to make algorithm changes in one place. At the time I was experimenting with a different LFU style algorithm, but it was sufficient to prove the interface would work. 2Q represents the first real world test.
not quite 2Q
The code in OpenBSD resembles 2Q, but contains some significant differences as well. In large part, this is because I had another priority: ease of implementation. 2Q keeps a history of past buffer addresses to know when a miss was previously in the cache; bufcache doesn’t provide that capability in its current form. There were some other obstacles as well. The result is that I tried to implement the simplest variation I could, while also considering future plans. There are comments in the code as well, but I repeat the explanation here.
We retain the key three working set distinction. In the OpenBSD code, they are named hot, cold, and warm, and each is an LRU queue. New buffers start hot. They stay that way as long as they remain on the hot queue. Eventually, a buffer will slip from the end of the hot queue onto the front of the cold queue. (We preserve the data, not just the address.) When a new buffer is needed, we recycle one from the tail of the cold queue. The oldest and coldest. If, on the other hand, we have a cache hit on a cold buffer, it turns into a warm buffer and goes to the front of the warm queue. Then as the warm queue lengthens, buffers start slipping from the end onto the cold queue. Both the hot and warm queues are capped at one third of memory each to ensure balance.
Scan resistance is achieved by means of the warm queue. Transient data will pass from hot queue to cold queue and be recycled. Responsiveness is maintained by making the warm queue LRU so that expired long term set buffers fade away.
In some ad hoc md5 some large files testing, I confirmed the contents of the bin directory remained in cache when they otherwise would not. Other interactive responsiveness testing went well, too. The primary goal had been to achieve some degree of scan resistance: mission accomplished.
The 2Q paper results are based on replaying traces in a simulator, which is the kind of thing you have to do to get papers published, but which also means they didn’t have to integrate with an existing system which anybody used. I’m not trying to publish a paper, so my results are lacking, but I did have to weld this thing into the existing kernel’s buf layer, which is why it didn’t come out quite the same. I’m all about rigor, as long somebody else does the work.
The numbers and ratios for the various queue lengths may need tuning. The current design and numbers were selected in part to minimize regressions, not maximize performance. One hopeful change is to finally add support for highmem on amd64 systems. The same algorithm that works to balance between hot and warm queues will work to balance between low and high mem. Now at least we have a launch point for future efforts to build upon.
A name? TU-Q?