thoughts on hyperthreading
From the time Intel released CPUs with hyperthreading until quite recently, I was a skeptic. A hater even. But now that I own some computers that feature hyperthreading, I’ve naturally had to change my mind, so now I like it.
revised view
I’ve revised my view of hyperthreading by not seeing it so much as more cores, but as faster context switches. Normally, if you’re running two processes on a single core, every 10ms the kernel interrupts one, saves all its registers, then switches to the other process and restores all its registers. The first down side is even if you only have 1ms of work to do, you have to wait for the other process to finish its timeslice. Unless your operating system interrupts every 1ms, but shorter timeslices mean more busy work saving and restoring registers.
Let’s think about another problem. A process reads from disk. That’s slow, and we don’t want to wait for it, so the kernel picks another process to run. Then later we continue running the first process. Everybody knows how that works. Now let’s pretend that main memory is like disk. Compared to CPU cache, it’s pretty slow. What if we could switch to a different process every time a process read from memory without waiting for that operation to complete? Obviously, that’s outside the scope of what the OS can do, but it’s something the CPU can do.
Enter hyperthreading. The OS programs the CPU with a list of processes it would like to run. The CPU then runs them, round robin, switching every time one of them does something slow, like read from memory. Of course, if you fill the scheduler with too much work, it’s not going to perform miracles. The same is true of a regular OS scheduler. make -j 4 will speed things up, but make -j 128 on a dual core system isn’t the brightest idea.
Once I came around to this view, I realized hyperthreading isn’t so bad. It’s actually a great solution to the problem that CPUs are getting faster than memory, much like memory became so much faster than disk. We don’t serialize programs waiting for disk, we shouldn’t serialize programs waiting for memory. The problem here is that Intel decided the interface the OS uses to program the CPU is through virtual CPUs. Which confuses the user, who maybe thinks they’re getting more CPU than in reality, and the OS, which needs updating to understand this new world of CPUs that aren’t CPUs.
Windows 7 scheduler
I decide to test the Windows 7 scheduler by running my favorite CPU benchmark (md5 -t) in two OpenBSD VMs. At first glance, the results are not good. Task manager shows all four CPUs in use, with lines going up, down, and in between. It’s like there’s no CPU affinity at all. But then I play around a bit, pausing one VM, then the other. Setting the affinity to a single CPU. Forcing the VMs onto adjacent HT CPUs on the same core. Forcing them onto the same CPU.
These numbers are only approximate. For reference, the benchmark (md5 -tt in this case) takes 2.0 seconds to run. Left up to the Windows scheduler, I can run two instances that take 2.0 seconds each. If I pin one VM to CPU 3 and one VM to CPU 4, forcing them to share resources, the time goes up to about 2.3 seconds. If I force both onto CPU 4, the time goes up to 3.5 seconds. We can see there is a performance penalty by being forced to share a core, but by default Windows avoids it.
Why does task manager show such crazy behavior then? Because there’s really no difference to running on CPU 1 vs 2, or CPU 3 vs 4. Migrations between logical CPUs on the same core are essentially free. When I pinned one VM on CPU 1, CPU 2 usage went to zero. CPUs 3 and 4 were still bouncing like crazy, but Windows knew if CPU 1 was in use, stay off CPU 2.
OpenBSD scheduler
Situation on OpenBSD is a little less pleasant. On my Atom system md5 -t should take about 0.4s, but if I run several at once, that degrades to about 0.6s for the ones that are sharing a core. (ie, I run md5 -tttt to peg a CPU, then while that’s running, also run md5 -t to measure performance.) The scheduler isn’t aware of hyperthreading, so whether I get 0.4s or 0.6s is luck of the draw. On the bright side, running four md5 -t processes, I get 1.6 seconds of work done in only 0.6s, compared to 0.8s if I had run two processes at a time, then two more. So the solution here seems to be keeping all the logical CPUs busy.
AMD crashes the party
AMD has started releasing CPUs, starting with the Bulldozer architecture, that aren’t hyperthreaded, but aren’t quite multicore either. The problem I have with this is that AMD is referring to these almost cores as cores. Intel may have exaggerated the benefits of hyperthreading (the wonders of “up to” marketing claims), but they clearly market the i7 as a quad core, not an octa core.