some gerc notes
gerc (good enough revision control) is a partial reimplementation of mercurial. Between got and bitbucket, it seems source control is back in the news. Here are some scattered notes about gerc and its development. It’s not complete or recommended for use, so don’t expect much.
background
I wrote humungus as a web frontend for hg project hosting when I didn’t like other available options. The easiest solution at the time was to run hg and communicate via pipes. This had a few drawbacks. Unstructured text protocols are fragile to parse and frequently become desynced. Some more commentary in linked post. Also, hg isn’t slow when running in daemon mode, but it’s not fast either.
I’d initially started with pygments for code highlighting and this was slow. Replacing it with simpler, if less featureful, native go code was 10x faster. Maybe I can do the same for mercurial as well.
internals
The mercurial repo format is quite nice conceptually. The main file format is the revlog, an indexed append only log for each file. Of course, the revlog format has been the victim of reality and learning from experience. I had a number of repos created from various stages of mercurial development, and without knowing that certain things had changed, it made the development of gerc very frustrating.
Initially, the revlog always contained a diff against the previous entry. To recreate any particular version, you start with the last complete version, then apply diffs in sequence. Great. But then I noticed some corruption. Ah, ha, I observed. We don’t apply every diff. We follow a chain of parent revision pointers. In many cases, this was the same as the linear chain, but occasionally it’s different. But then I noticed some other corruption in a repo that previously worked.
I ended up spending entirely too much time flipping code around back and forth trying to test things both ways before noticing there’s a flag in the file header which indicates the method to use. So both approaches were correct, but only one works for any given file.
The right thing to do would have been to study the file format more closely, but early easy successes led me to believe it was a simple matter of a tweak here or there.
After a few frustrating late nights (I’m sure it’ll work, just need to try one more thing, ad nauseum) it all came together. The revlog format really is quite nice to work with once you know what’s going on.
diff
The internal diff format is binary deltas, but humans like more readable text diffs. I implemented patience diffing because it was easy. Sometimes this means gerc and hg will show different diffs, but usually it’s fine. (And sometimes patience is better, too.)
garbage
I got to play with the go profiling tools. For the most part, runtime isn’t too bad, but initially I was allocating far more memory than necessary. For example, reconstructing a file starts with an original, then applies a series of deltas to create a new file. Applying many changes in sequnece I would throw away the old file. But swapping buffers eliminates an allocation.
In general, I found the total number of allocations to be the best correlation with performance. Figure out where I was allocating memory, stop doing that, observe big wins.
Also, found that for whatever reason, at least on OpenBSD, using multiple threads in a nominally single threaded program is very slow. I got a 2x speedup by setting gomaxprocs to 1. This appeared to be mostly time that the garbage collector seemed to be contending against itself.
use
Replacing piped calls to hg with library calls to gerc made humungus dramatically simpler. It’s a lot easier to iterate over arrays of structs.
There’s also a command line shell that mimics the hg command.
performance
Consistently faster than hg. Or chg. gerc diff in a working directory is close to instant.
carbon:~/proj/honk> time chg log -p | wc
32009 140897 1010760
0m02.20s real 0m00.01s user 0m00.07s system
carbon:~/proj/honk> time gerc log -p | wc
30774 130752 969487
0m00.39s real 0m00.32s user 0m00.06s system
carbon:~/proj/honk> time chg log -p | wc
32009 140897 1010760
0m02.25s real 0m00.01s user 0m00.05s system
carbon:~/proj/honk> time gerc log -p | wc
30774 130752 969487
0m00.39s real 0m00.33s user 0m00.07s system
status
gerc is nowhere close to done. I poke at it from time to time, but it’s unlikely to ever really come close to being an hg replacement.
fin
Mercurial is dying. I’m riding this ship all the way to the bottom.