parallel tree running

For the first version of where’s all the code I did a simple tree walk to find all the code, one file at a time. But I wrote it in go, a supposedly concurrency friendly language, so it seemed the obvious thing should be to walk the tree in parallel. This went great, until it didn’t, but then it ended up okay.

intro

We want to walk a tree in parallel. We don’t know how big or deep it is. We will discover new nodes of unknown branchiness as we go. As we progress, we will end up performing two types of work. Sometimes we find a directory and recurse deeper. Sometimes we find a file and have to count the lines.

Ideally, our code will balance IO and CPU demands. I would also prefer that it not completely hammer my computer. I’m already running a web browser. I don’t need a second application that behaves like I bought my computer for the sole purpose of running it.

As a bonus, it’d be nice if we have some way to balance tasks. Like two tree walkers and two line counters. Or decide in which order things get done. Generally, more control over what happens would be better than leaving it to the whims of some other component. If you know me, you’ll know that bad things happen when magic is involved.

naive

The naive go approach is to use the go keyword to launch a goroutine whenever we find a new file to process. This is bad because I don’t want 9000 threads jammed up in the stat syscall at the same time. The way I structured the code, this probably wouldn’t even work because it would crash before it even managed to open 9000 files.

The next obvious idea is a limited set of running threads, reading from a channel. Every new work item is sent into the channel, and then the workers dig through them. This kinda works for small cases, but eventually it will hang (and subsequently crash with a deadlock crash) when all the worker threads try to send more work into the channel but there’s nobody left to read from the channel. Then people try to fix this with more buffering, but that only delays the inevitable.

I guess I wasn’t upfront about the requirement that work consumers may also be work producers, but come on. Everybody knows the more you get done the more there is to do. Of course our solution needs to support that.

very smart

The very smart way would be to have a limited set of threads running, sending work back and forth over channels, as above, but with the work queued in an unlimited size work list. We add one more management goroutine to handle all the busy work of delegating work to and fro. I like this idea because the central dispatch thread gives us a nice place to add visibility into what’s happening. How much work is left, how much completed, etc.

I decided on a simple limit of four workers. I think that gives the right balance between speeding up and not being abusive. If I have a 64 core threadripper workstation, is the goal to launch 256 threads here? (64 x 2x HT x 2x magic scale factor)

So I built this, and ran it a few times to see the speed up, but noticed that sometimes the total file count was changing. Ran it a few more times to check, and woah, one time a thousand files disappeared. This is very bad. Rechecked all my queue code. Looked good. Added some of the aforementioned counters, and everything seemed correct. We were doing all the work, but sometimes a different amount of work showed up. Very weird.

Spent some more time pondering the problem, but was unable to find the error. This approach is clearly very smart. I programmed myself right past the debug horizon.

magic token bags

Another approach to concurrency control in go is the magic token bag, which go calls a buffered channel. We start up infinitely many goroutines, but before they are allowed to do any work, they must get a token. And then when finished they put it back. Nobody gets to go on the CPU ride without a token, and so we can limit concurrency by only filling the bag with a limited number of tokens.

tokens := make(chan bool, 4)
for _, w := items {
    go func(w item) {
        <-tokens
        dothework(w)
        tokens<-true
    }(w)
}

I think it’s more common to switch the channel send and receive operations, so you’re actually sending before doing the work, but I find it conceptually easier to reason about it this way. You have to invert the metaphor to cramming into a phone booth or something the other way.

I have also heard magic token bags referred to as semaphores.

Anyway, I reverted back to the single threaded tree walking code, stuck a few go calls around the recursive calls with some tokens, and got a nice speedup while still counting all the files every time. So fine, it works. It was simpler to write, and rather less code than the previous queue approach.

But I still don’t like it. Across even a modestly large tree structure, this will create thousands of pending goroutines. True, they won’t be hammering on the kernel all at once. And a goroutine is cheap, but it’s not free. A 2KB stack times thousands of goroutines is... okay, a few megabytes of memory. But the memory required for a work item is only about 16 bytes. It kills me to be wasting 99% of memory, even if it’s small in absolute terms. Maybe someday it won’t be.

go tools

I decided to take a look at what some similar tools do. Walking through a directory tree is a pretty well explored space. And everybody wants max nerd points for being the fastest.

godu uses a magic token bag called c.

gdu does the same, but nicely calls it concurrencyLimit. Less nicely (IMO) it uses a channel of struct{}, which I find kinda ugly. My efficiency fetish only goes so far. I’ll pay the overhead to send bools around. I know this is kinda idiomatic; I just don’t like looking at it.

So it’s a fine approach, but as stated, not my favorite. The swarm of goroutines is quick and easy to write, but it’s hard to inspect.

I just found scc which is a much more advanced line counter than watc, and it uses a custom thread pool as well.

rust tools

How do people approach this problem in rust?

diskus uses rayon, a thread pool implementation, with a specified thread pool. There’s some comments in main.rs discussing why they believe 2x CPU is the right mix for cold and warm cache runs.

dust uses rayon with the default thread pool.

I couldn’t immediately determine what dua-cli does.

So thread pools for the win.

thread pools

If you search the google for “go thread pool” the first hit is Go by Example: Worker Pools. And it’s not great. It uses a fixed size buffered channel for storing work. In the rather common scenario where the consumers are also producers, this will eventually deadlock. But it looks like it will work in toy examples.

Two other projects on github are threadpool and tunny. Neither appears to be very much code, or have any dependencies.

One thing I thought was kinda clever was the worker sending a channel back to the dispatcher to receive more work. I just have a single outgoing work channel that workers read from, but this is interesting. From threadpool:

workerPool  chan chan interface{}

case job := <-t.jobQueue:
    // Got job
    func(job interface{}) {
        //Find a worker for the job
        jobChannel := <-t.workerPool
        //Submit job to the worker
        jobChannel <- job
    }(job)

Alas, this deadlocks if the workers passing more work back. Nobody will announce availability via workerPool and the worker side send to jobQueue will block. Actually, it won’t because there’s a check for a full channel, and then you get an error back, but what then? In a tree traversal, I guess you can then recurse yourself, but now you have to add queue full logic.

Abstractly, I don’t think immediate recursion is always viable. Even in tree walking, you have the directory file descriptor open. If you keep recursing, you may run out of descriptors before hitting the bottom. Always adding work to the queue for deferred processing means less resources tied up in half finished work.

People just really love go channels. I think they have their place, but there’s quite some edge cases.

resmart

Freshly inspired with confidence that it must work, I once again implemented the original thread pool implementation, the one that didn’t work before, and once again it didn’t work. But I refused to give up, and spent some more time looking at what was happening.

Long story short, the queue was fine, but the walk code was sending it incomplete larval work items. Due to a quirk of how I was assembling the tree, the node for “src/sys/kern” was initially created as just “kern”, and didn’t earn its prefix until a little bit later. Later, as in after it was submitted to the work queue. Since the “kern” directory doesn’t exist, no files were found or scanned.

Fixing that bug and everything worked just the way it should.

notes

The swarm of goroutines approach appeared to work because the go scheduler picked goroutines to run in an order that allowed the work items to be finished up after submission and before processing. The bug was still there, and to be clear, it was definitely my bug, but it was masked by implementation detail happenstance. The thread pool I wrote pulled from the back, because that’s the cheapest way to pop items from a slice.

One of the things I like to do whenever building a FIFO or LIFO, if the order doesn’t really matter, is add the option to reverse it periodically or randomly. This really helps uncover these hidden bugs. But you can only do this if you control the run queue, which the go scheduler takes away from us.

I wrote the code to silently handle missing files and incomplete tree nodes because it seemed like a “clean” and “elegant” way to handle various edge cases. Don’t want to crash if a file is deleted mid run. Have to handle the root directory with no parent case somehow. Etc., etc. But then when something was going wrong, the code continued handling that as if expected. There’s a fine line between robust and oblivious.

There’s a school of thought that all work queues should be bounded, because otherwise what are you doing to do when you run out of memory. Fair point. Although for a task like this, right sizing the queue doesn’t seem easy. It can trivially handle millions of entries, but as a developer I’d feel weird typing queue.SetMaxWork(10000000). FWIW the swarm of goroutines has no queue limit unless you implement a second counter of some sort.

The go race checker found and pinpointed the bug instantly. I should have used it much sooner.

Posted 16 May 2022 02:23 by tedu Updated: 20 May 2022 16:41
Tagged: go programming