my int is too big
The thrsleep system call is a part of the kernel code that supports threads. As the name implies, it gives userland a measure of control over scheduling and lets a thread sleep until something happens. As such, it takes a timeout in the form of a timespec. The kernel, however, internally implements time keeping using ticks (there are HZ, 100, ticks per second). The tsleep function (t is for timed) takes an int number of ticks and performs basic validation by checking that it’s not negative. A negative timeout would indicate that the caller has miscalculated. The kernel panics so you can fix the bug, instead of stalling forever.
The trouble therefore is when userland is allowed to specify a timeout that could be negative. The existing code made an attempt to handle various tricks by converting the timespec to a ticks value stored as a 64 bit long long which was checked against INT_MAX before passing to tsleep. Any value over INT_MAX would be truncated, so we can’t allow that. Instead, we saturate the value to INT_MAX. Unfortunately, this check didn’t account for the possibility that the tick conversion from the timespec could also overflow and result in a negative value.
I first fixed this in the most obvious way. Change the type to uint64_t. If the calculations overflowed, the check against INT_MAX would still work. We don’t really care if the overflow results in a short timeout, because userland can only screw with itself. But Tim pointed out that the intermediate calculations could still overflow, which is undefined. I think it’d take a really malicious compiler, even by compiler standards, to actually break this code, but better sound than sorry. We fixed several more instances by also casting the initial calculation to uint64_t.
Every kqueue keeps a list of all the attached events it’s watching for. Because it’s possible to monitor a number of different event types which can have a variety of identifiers, they are kept in a hash table. However, as a practical consideration, the majority of kqueue events are going to be file descriptors, all of which are identified by small integers. In addition to the hash table, a simple array is used to store file events, indexed by fd.
This array is scaled to accommodate the largest fd that needs to be stored. This would obviously cause trouble, consuming too much memory, if the identifier were not validated first. Which is exactly what kqueue tries to do. The fd_getfile function checks that the identifier is a file that the process has open. One wrinkle. fd_getfile takes an int argument but ident is a uintptr_t, possibly 64 bits. An ident of 2^32 + 2 will look like a valid file descriptor, but then cause the array to be resized to gargantuan proportions.
Again, the fix is pretty simple. We must check that the ident is bounded by INT_MAX before calling fd_getfile. There’s obviously a smaller bound as well, since a process could never have an fd anywhere close to that, but that’s already enforced by the called function. We need only verify that the argument is compatible with the input range. This bug likely would have been exploitable beyond a panic, but the array allocation was changed to use mallocarray instead of multiplying arguments by hand, thus preventing another overflow.
Some time ago a secret magic __MAP_NOFAULT flag was added to mmap which enables secret magic behavior. Normally, when a file is truncated, accessing any mapped segments of the file will crash. By setting this flag, such faults are converted to anonymous mappings instead. This allows a server process, such as X, to receive untrusted fds from users and map segments of those files without worrying that an abusive user will truncate the file underneath it. Because this is the moral equivalent of MAP_ANON it should be subject to the same checks, but this was overlooked. Consequently users can allocate more memory than they should.
That’s bad, but things get really bad when the allocation gets really big. As described in the UVM paper, anonymous memory is managed by amaps, which have a certain number of slots, one for each page in the map. Slot accounting is computed in plain ints, probably because it wasn’t anticipated that somebody could allocate so much memory that it would take more than 31 bits to count the number of slots required. A not unreasonable assumption at the time. Or even today, barring other bugs, though some terabyte systems are pushing the limit. Alas, the bug above allowed bad slot calculations to take place. In this case, mallocarray wasn’t enough to prevent corruption because the slot count had already overflowed.
The hunt for this bug also turned about another historic artifact. The mmap system call doesn’t look exactly like the libc function. Sometimes even the bug hunters have bugs.