probing my ssd's latency
My SSD is probably pretty fast, but maybe a faster one would let me compile a kernel even quicker by reducing the time spent waiting for I/O to complete. First though, I need to determine its latency, and the benchmark tool available to me, dd, measures throughput not latency. We need to go deeper.
To measure latency, we’ll have to write some code to build a custom tool. By which I mean somewhere on the order of a dozen lines, taking advantage of the capabilities of dt, btrace, and bt.
initial probe
The dt dynamic tracer allows userland to enable trace points in the kernel and collect events. It’s present but not enabled by default. The first thing is to add kern.allowdt=1
to sysctl.conf and reboot.
After that we can run btrace -l
to see the available probes. There’s a bunch of syscall probes. We could look at read and write calls, but I’m really more interested in physical disk activity. The closest match I can see relates to the buffer cache. We can trace when buffers are acquired and released. This will probably correlate with disk activity, even if it’s not quite our target measurement.
Just to get things started, I wrote a very simple bt script. I want to see some numbers on the screen first, and then we’ll figure out what else needs to be done.
tracepoint:vfs:bufcache_take { printf("get %u\n", nsecs) }
tracepoint:vfs:bufcache_rel { printf("rel %u\n", nsecs) }
Sure enough, running this prints lots of numbers. If we had some way of correlating the timestamps, we could figure out how long each buffer is in use. Alas, the trace point arguments available to us include the buffer flags and some other info, but not the buffer pointer. In any case, I want disk time, not buffer time, so we’ll move on.
targeted probe
If the probe we want isn’t available, we’ll add our own. My SSD attaches as an sd device, so that’s where the new tracepoint will be going. The function that initiates I/O is called sdstrategy and the function that’s called after it completes is sd_buf_done. We’ll make sure to include the buffer pointer value in our trace point to allow correlation. All in all, a very minor edit to scsi/sd.c.
@@ -529,6 +530,8 @@ sdstrategy(struct buf *bp)
struct sd_softc *sc;
int s;
+ TRACEPOINT(vfs, sdstrategy, bp, bp->b_dev, bp->b_flags);
+
sc = sdlookup(DISKUNIT(bp->b_dev));
if (sc == NULL) {
bp->b_error = ENXIO;
@@ -763,6 +766,8 @@ retry:
bp->b_resid = bp->b_bcount;
break;
}
+
+ TRACEPOINT(vfs, sd_buf_done, bp, bp->b_dev, bp->b_flags);
disk_unbusy(&sc->sc_dk, bp->b_bcount - xs->resid, bp->b_blkno,
bp->b_flags & B_READ);
One more edit is required to tie it into the system. All probes are defined in dev/dt/dt_prov_static.c.
@@ -68,6 +68,8 @@ DT_STATIC_PROBE2(uvm, pool_put, "void *"
DT_STATIC_PROBE3(vfs, bufcache_rel, "long", "int", "int64_t");
DT_STATIC_PROBE3(vfs, bufcache_take, "long", "int", "int64_t");
DT_STATIC_PROBE4(vfs, cleaner, "long", "int", "long", "long");
+DT_STATIC_PROBE3(vfs, sdstrategy, "void *", "int", "long");
+DT_STATIC_PROBE3(vfs, sd_buf_done, "void *", "int", "long");
#ifdef __amd64__
/*
@@ -115,6 +117,8 @@ struct dt_probe *const dtps_static[] = {
&_DT_STATIC_P(vfs, bufcache_rel),
&_DT_STATIC_P(vfs, bufcache_take),
&_DT_STATIC_P(vfs, cleaner),
+ &_DT_STATIC_P(vfs, sdstrategy),
+ &_DT_STATIC_P(vfs, sd_buf_done),
#ifdef __amd64__
/* VMM */
&_DT_STATIC_P(vmm, guest_enter),
Recompile. Reboot. Off to a pretty breezy start.
historic histograms
I see in the bt manual that there’s a hist function for incrementing a histogram, which sounds like exactly what I want. Although I’m not exactly sure which histogram the value is going to be incrementing, since it seems the function is missing at least one argument, and the manual is a little thin on this point. Ah well, we have the source. The definitive reference for any parser is the yacc grammar. Just pop open bt_parse.y and take a look.
* Copyright (c) 2015 Ted Unangst <tedu@openbsd.org>
Woah. I was not expecting that. Apparently this file started life as doas parse.y. I’m feeling right at home. (I’m fairly certain any code I wrote has long been replaced, but it’s cool that somebody went looking for a skeleton yacc file and chose one I wrote it.)
We search around for “hist” which leads to OP1 and eventually the matching rule.
| gvar '=' OP1 '(' pat ')' { $$ = bh_inc($1, $5, NULL); }
The mist falls from our eyes. We add to the histogram by assigning to it, and the grammar reveals the magic wormhole by which the destination becomes a function argument. Does that sound weird? The bpftrace manual doesn’t try explaining with prose either; it just provides examples. This syntax originates in dtrace, which also seems to prefer the show don’t tell approach to documenting how aggregate functions work. It’s not awkward if you simply don’t talk about it.
analysis
We now have the trace point in place and enough knowledge to be dangerous. The advanced bt script isn’t really so much longer than the test prototype.
tracepoint:vfs:sdstrategy {
@diskstarts[arg0] = nsecs
}
tracepoint:vfs:sd_buf_done {
$delay = nsecs - @diskstarts[arg0]
@disktimes = hist($delay)
}
END {
clear(@diskstarts)
}
Run btrace sd.bt
, compile a kernel, ^C, and we have results.
@disktimes:
[0] 1009 |@@ |
[8K, 16K) 9496 |@@@@@@@@@@@@@@@@@@@@@@@@ |
[16K, 32K) 8845 |@@@@@@@@@@@@@@@@@@@@@@ |
[32K, 64K) 20370 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64K, 128K) 12415 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[128K, 256K) 12394 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[256K, 512K) 5309 |@@@@@@@@@@@@@ |
[512K, 1M) 1749 |@@@@ |
[1M, 2M) 166 | |
[2M, 4M) 155 | |
[4M, 8M) 70 | |
[8M, 16M) 130 | |
[16M, 32M) 102 | |
[32M, 64M) 204 | |
[64M, 128M) 331 | |
[128M, 256M) 358 | |
[256M, 512M) 203 | |
[512M, 1G) 187 | |
[1G, 2G) 27 | |
[2G, 4G) 15 | |
[4G, 8G) 6 | |
[16G, 32G) 675 |@ |
[32G, 64G) 8 | |
[64G, 128G) 29 | |
[128G, 256G) 26 | |
[256G, 512G) 5 | |
[1K, 2K) 8911 |@@@@@@@@@@@@@@@@@@@@@@ |
Something weird going on with the 1K bucket landing at the bottom. Otherwise, it looks like we have latency in the modest microseconds, with the occasional foray into the milliseconds. Is that fast?
on demand
Ordinarily, one prints the results from btrace by killing the process, but maybe we’d like to take a peek at the current results and keep collecting. Unfortunately, there’s no other signal handler. Instead of editing more code though, we can play with the flexibility of the bt script itself. I open a spare terminal and run echo $$
to learn my shell’s pid is 690.
syscall:kill:entry / pid == 690 /{
print(@disktimes)
}
Now whenever that shell makes a kill system call, we’ll print out the current disk times histogram. kill 1
is easy to type and mostly harmless...
There is no space between “/{” because a quirk of the lexer doesn’t recognize “/ {”. A few puzzling moments were spent trying to figure out the parse error until going back into bt_parse.y.
results
I’ve thought about adding disk latency recording to the kernel a few times in the past, but it always seemed like a lot of work. Record the numbers, funnel them to userland somehow, crunch the numbers, print it pretty. Do I really care enough to do all that? The combination of dt and btrace takes care of all that. It was only a few minutes work to add exactly the trace points I wanted and get results.