flak rss random

go may require prefaulting mmap

Trying to go too fast may be slow.

mmap

I was reading a bunch of modestly large files, and thought, hey, I could mmap these files to speed things up. In general, mmap can be faster because the kernel doesn’t need to copy any data into your memory, it just appears there. Even if the kernel does fancy page flipping techniques to replace read with invisible mappings of the cache, the mappings can be more complex because changes should not be visible. A shared mmap mapping says, sure, it’s okay if it changes (I know it won’t).

Also, in a language like go, it wants to zero the buffer as soon as you allocate it. So we waste a lot of time on memset. New allocations should be lazy zeroed, but then it will eventually get garbage collected and recycled, and then it does need zeroing, even if we know we’re only going to read into it. Generally, manual control over the mmap isn’t hard, the required lifetime of the file data is well understood, and we can save the garbage collector the trouble of accounting for our big blobs.

go exposes mmap via syscall.Mmap wrapper. There’s also a dozen other “friendly” wrapper, though I found the syscall package version quite sufficient. Easy to use, let’s do it. Done.

faults

The problem arises when we take page faults. If the memory isn’t mapped, then accessing it will stop the current thread until it’s loaded. go mostly assumes this never happens.

The go runtime uses a M:N thread model, where it tries to pack many userland threads/goroutines onto only a few kernel threads, and switch between them when it predicts they may block. It gives the illusion of full concurrency. But if the prediction is wrong and somebody blocks unexpectedly, the process can grind to a halt.

This is either surprising or completely obvious.

What we need is a way to make sure the data is in memory. Of if it’s not, a way to load it that doesn’t upset the go scheduler.

prefault

The naive approach would be to walk the memory in another goroutine, to prime it. But that’s exactly the problem. go will not expect that to block, and won’t schedule a kernel thread to do it.

I came up with a workaround to write to /dev/null. devNull.Write(data). This should fault all the data, and go knows that write is a blocking operation, so it will make sure that a dedicated thread takes the hit. But now we’re copying the data back into the kernel, and the idea was to eliminate some copies. We still get some other benefits, like better memory management, but oof.

Or we can use a C function. Just take a peek at every 4096th byte to get the pages into memory. go knows that C functions can do anything, so this will also get a dedicated thread. And probably faster than a system call.

Or ignore the problem. If we’re going to write the data into something like a socket, that’s already a system call, and it’s okay to fault in data during that operation. We only need to prefault if we’re going to be accessing the memory in our go code.

Or really ignore the problem. The data is probably hot, right? And SSDs are fast, right? How bad can it be?

reading

Everything old is new again. More discussion of trying to achieve nonblocking with mmap and prefaulting using a helper process can be found in the Flash web server paper.

Posted 28 May 2025 18:17 by tedu Updated: 28 May 2025 18:17
Tagged: go