lessons learned about TRIM
TRIM support actually comes in a few different flavors, depending on whether the disk is SCSI or ATA. In this case, I was working with a SATA Vertex 2 drive, but attached as SCSI behind OpenBSD’s atascsi layer. We were using bufs with a special flag set to indicate a trim vs a regular write operation.
Before we get to the part about actually trimming the disk, we need to fix the filesystem free code. The current code receives a block number to mark free, reads the relevant part of the bitmap, marks it free, and asynchronously writes the bitmap to disk. The important thing to note is that the in memory and on disk representations of the bitmap are kept in sync because the in memory part is really just the disk cache. (There’s a window where the disk cache lags the disk, but it’s not important for us.) We can’t trim the disk after marking the bitmap, because then a different file might start trying to use those blocks for storage and have its data erased. We can do the trim first, but we have to wait for it to finish before updating the bitmap or the same problem can happen. We solve this problem by attaching the bitmap free operation as a callback on the trim buf. We execute trims asynchronously, and as they finish, update the respective parts of the bitmap.
For each block marked free, we notify the disk that the block is free. One block, one buf. The first thing observed is that trim commands are slow. Very very slow. I was only able to do about 20 trims per second. With a file system block size of 32k, this resulted in being able to delete a file at only 0.6 MB/s. A gigabyte file takes a long time to erase, which is a substantial regression from instantaneous removal previously.
The trim command supported by most disks actually accepts ranges of blocks that are free. I modified the FFS code to coalesce consecutive trim operations. This works because FFS normally allocates adjacent blocks for a file. I also discarded trims shorter than 96k as irrelevant. This pushed performance up to nearly 300 MB/s, though the system experienced high latency while trimming a large file. Still not good enough.
In theory, marking space free should only require the SSD to update its page tables, or whatever system it uses to map from logical disk blocks to physical flash blocks. I expected this to be a fast operation. Instead, it appears to flush and sync data in a rather expensive operation. In hind sight, this isn’t too surprising. In normal operation, the page tables probably don’t change much and are cached in RAM, but trim operations require the new version to be committed to flash. But at least the time taken to perform a trim command is the same regardless of how much data is trimmed. That tells us how to build a better version.
This code hasn’t been written yet, but it’s the plan for the future. It’s now clear that we should be utilizing a disk’s ability to trim multiple extents with the same command. Instead of having the filesystem code collect one extent, it’s going to pass all requests to the disk. The disk driver will now collect and coalesce ranges, and only pass them to the disk periodically. This introduces one difficulty for the filesystem code. We had been marking the bitmap free when trim finishes. If we coalesce and postpone too many trim operations, the bitmap will have many potentially free blocks still marked as in use. Deleting a large file and then creating a new large file may result in out of disk errors.
What I want to do is to mark the bitmap free immediately, then schedule those blocks for trimming. When a new allocation requests space, it first checks the trim queue and removes any pending trim command for the blocks it wants. When a trim operation is actually about to begin, it first marks those blocks used in the bitmap, and refrees them upon completion. In effect, this gives us back the space immediately, but allows trim commands to hint to the disk so it can start cleaning more space ahead of time.
I think this will work well with FFS’s allocation policy. FFS will frequently reuse the same blocks. It would be a poor choice for non-wearleveled flash, but all SSD drives map logical blocks to physical blocks. Reusing an existing block is good, because the drive knows that the old physical block is unused. TRIM is only needed to inform the drive of free, not reused blocks.