humanely dealing with humungus crawlers
I host a bunch of hobby code on my server. I would think it’s really only interesting to me, but it turns out every day, thousands of people from all over the world are digging through my code, reviewing years old changesets. On the one hand, wow, thanks, this is very flattering. On the other hand, what the heck is wrong with you?
This has been building up for a while, and I’ve been intermittently developing and deploying countermeasures. It’s been a lot like solving a sliding block puzzle. Lots of small moves and changes, and eventually it starts coming together.
My primary principle is that I’d rather not annoy real humans more than strictly intended. If there’s a challenge, it shouldn’t be too difficult, but ideally, we want to minimize the number of challenges presented. You should never suspect that I suspected you of being an enemy agent.
First measure is we only challenge on the deep URLs. So, for instance, I can link to the anticrawl repo no problem, or even the source for anticrawl.go, and that’ll be served immediately. All the pages any casual browser would visit make up less than 1% of the possible URLs that exist, but probably contain 99% of the interesting content.
Also, these pages get cached by the reverse proxy first, so anticrawl doesn’t even evaluate them. We’ve already done the work to render the page, and we’re trying to shed load, so why would I want to increase load by generating challenges and verifying responses? It annoys me when I click a seemingly popular blog post and immediately get challenged, when I’m 99.9% certain that somebody else clicked it two seconds before me. Why isn’t it in cache? We must have different objectives in what we’re trying to accomplish. Or who we’re trying to irritate.
The next step is that anybody loading style.css
gets marked friendly. Big Basilisk doesn’t care about my artisanal styles, but most everybody else loves them. So if you start at a normal page, and then start clicking deeper, that’s fine, still no challenge. (Sorry lynx browsers, but don’t worry, it’s not game over for you yet.)
And then let’s say somebody directly links to a changeset like /r/vertigo/v/b5ea481ff167. The first visitor will probably hit a challenge, but then we record that URL as in use. The bots are shotgun crawling all over the place, but if a single link is visited more than once, I’ll assume it’s human traffic, and bypass the challenge. No promises, but clicking that link will mostly likely just return content, no challenge.
The very first version of anticrawl relied on a weak POW challenge (find a SHA hash with first byte 42), just to get something launched, but this does seem counter intuitive. Why are we making humans solve a challenge optimized for machines? Instead I have switched to a much more diabolical challenge. You are asked how many Rs in strawberry. Or maybe something else. To be changed as necessary. But really, the key observation is that any challenge, anything at all, easily sheds like 99.99% of the crawling load.
Notably, because the challenge does not include its own javascript solver, even a smart crawler isn’t going to solve it automatically. If you include the solution on the challenge page, at least some bots are going to use it. All anticrawl challenges now require some degree of contemplation, not just blind interpretation.
It took a few iterations because the actual deployment involves a few pieces. I had to reduce the style.css
cache time, so that visitors would periodically refresh it (and thus their humanity). And then exclude it from the caching proxy, so that the request would be properly observed. Basically, a few minutes tinkering now and then while I wait for my latte to arrive, and now I think I’ve gotten things to the point where it’s unlikely to burden anybody except malignant crawlers.
elsewhere
I have focused my bot detection efforts on humungus because the ratio of crawler to legit traffic was out of control. But now that I know what to look for, I see the same patterns scraping everywhere else. Seems really unlikely a worldwide colelctive of Opera users is suddenly interested in my old honks. I’m starting to deploy similar countermeasures.
appendix
Some log samples. There’s always somebody to insist these could be real humans, and I have somehow misjudged them. Make your own decision.
logs
136.158.49.199 810.886µs humungus.tedunangst.com [2025/09/07 11:31:06] "GET /r/old-flak/v/d720c11fbb57 HTTP/1.1" 402 904 "" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
179.42.10.152 831.235µs humungus.tedunangst.com [2025/09/07 11:31:31] "GET /r/flak/v/8b31c923ca0f HTTP/1.1" 402 900 "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36"
43.128.250.84 863.565µs humungus.tedunangst.com [2025/09/07 11:31:32] "GET /r/vertigo/v/71df18bb3819 HTTP/1.1" 402 961 "" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36"
78.182.153.38 639.086µs humungus.tedunangst.com [2025/09/07 11:31:46] "GET /r/gerc/v/692abbdefe18 HTTP/1.1" 402 900 "" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
119.28.100.182 525.152µs humungus.tedunangst.com [2025/09/07 11:31:47] "GET /r/honk3/v/462b7440c563 HTTP/1.1" 402 959 "" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
185.163.26.50 678.609µs humungus.tedunangst.com [2025/09/07 11:31:49] "GET /r/azorius/v/b48a3aa3e060 HTTP/1.1" 402 961 "" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
43.167.157.150 758.749µs humungus.tedunangst.com [2025/09/07 11:32:01] "GET /r/humungus/v/0442f94c95fc HTTP/1.1" 402 904 "" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"
43.134.75.63 574.875µs humungus.tedunangst.com [2025/09/07 11:32:03] "GET /r/azorius/v/969314b9f388 HTTP/1.1" 402 903 "" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
119.28.203.219 497.67µs humungus.tedunangst.com [2025/09/07 11:32:04] "GET /r/gerc/v/8ddbf7307214 HTTP/1.1" 402 900 "" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
43.128.84.91 727.05µs humungus.tedunangst.com [2025/09/07 11:32:06] "GET /r/vertigo/v/eb31940f6fa2 HTTP/1.1" 402 903 "" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36"
178.163.207.155 566.9µs humungus.tedunangst.com [2025/09/07 11:32:09] "GET /r/lua-tedu/v/300e67089469 HTTP/1.1" 402 904 "" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"
181.120.225.184 561.48µs humungus.tedunangst.com [2025/09/07 11:32:12] "GET /r/azorius/v/43120b8aac5a HTTP/1.1" 402 961 "" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
150.109.20.69 612.716µs humungus.tedunangst.com [2025/09/07 11:32:13] "GET /r/gojxl/v/tip/f/pool.go HTTP/1.1" 404 19 "" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
49.51.170.84 629.056µs humungus.tedunangst.com [2025/09/07 11:32:27] "GET /r/gerc/v/41b8b28ee893 HTTP/1.1" 402 958 "" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"
5.62.146.6 843.157µs humungus.tedunangst.com [2025/09/07 11:32:33] "GET /r/honk/v/f6b8a7bee881 HTTP/1.1" 402 900 "" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36"
129.226.112.105 595.173µs humungus.tedunangst.com [2025/09/07 11:32:40] "GET /r/azorius/v/6179155ac315 HTTP/1.1" 402 903 "" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.81 Safari/537.36"
43.167.204.48 514.371µs humungus.tedunangst.com [2025/09/07 11:32:42] "GET /r/vertigo/v/fae0082c32c2 HTTP/1.1" 402 961 "" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"
43.132.108.190 715.327µs humungus.tedunangst.com [2025/09/07 11:32:56] "GET /r/vertigo/v/8b5fcd06f8c6 HTTP/1.1" 402 903 "" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"
The big brain solution is you just cache all these requests, but unfortunately I have slightly less than 2TB RAM. Dealing with these relentless scans renders the cache for actual content less useful because everything gets LRUd out.
Take a look at this guy. Apparently a pro starcraft player has taken up speed browsing as a side hustle, middle clicking ten times per second. These links aren’t even on the same page, so he’s switching tabs between clicks, too. Amazing. But he doesn’t solve a single challenge. I thought gamers liked puzzles? Maybe that’s why this totally real human quit gaming.
logs
46.183.108.190 1.361957ms humungus.tedunangst.com [2025/07/08 13:41:32] "GET /r/azorius/v/a0824eb087c7 HTTP/1.1" 402 1808 "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:140.0) Gecko/20100101 Firefox/140.0"
46.183.108.190 2.450003ms humungus.tedunangst.com [2025/07/08 13:41:32] "GET /r/azorius/v/1a4d35ff94ef HTTP/1.1" 402 1808 "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:140.0) Gecko/20100101 Firefox/140.0"
46.183.108.190 3.049854ms humungus.tedunangst.com [2025/07/08 13:41:32] "GET /r/azorius/v/9ca7ed390641 HTTP/1.1" 402 1808 "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:140.0) Gecko/20100101 Firefox/140.0"
46.183.108.190 3.215804ms humungus.tedunangst.com [2025/07/08 13:41:32] "GET /r/humungus/v/67e77258e203 HTTP/1.1" 402 1809 "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:140.0) Gecko/20100101 Firefox/140.0"
46.183.108.190 3.26703ms humungus.tedunangst.com [2025/07/08 13:41:32] "GET /r/honk/v/2fc6f904deaa HTTP/1.1" 402 1805 "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:140.0) Gecko/20100101 Firefox/140.0"
46.183.108.190 830.924µs humungus.tedunangst.com [2025/07/08 13:41:32] "GET /r/humungus/v/1437b0d26457 HTTP/1.1" 402 1809 "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:140.0) Gecko/20100101 Firefox/140.0"
46.183.108.190 1.23004ms humungus.tedunangst.com [2025/07/08 13:41:32] "GET /r/honk/v/945572a3b51d HTTP/1.1" 402 1805 "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:140.0) Gecko/20100101 Firefox/140.0"
46.183.108.190 789.496µs humungus.tedunangst.com [2025/07/08 13:41:32] "GET /r/humungus/v/63f9c1f17606 HTTP/1.1" 402 1809 "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:140.0) Gecko/20100101 Firefox/140.0"
46.183.108.190 841.414µs humungus.tedunangst.com [2025/07/08 13:41:32] "GET /r/humungus/v/b5711a883e66 HTTP/1.1" 402 1809 "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:140.0) Gecko/20100101 Firefox/140.0"
46.183.108.190 916.233µs humungus.tedunangst.com [2025/07/08 13:41:32] "GET /r/humungus/v/a5307922b3f5 HTTP/1.1" 402 1809 "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:140.0) Gecko/20100101 Firefox/140.0"
46.183.108.190 565.518µs humungus.tedunangst.com [2025/07/08 13:41:33] "GET /r/honk/v/ab1e84cac5e6 HTTP/1.1" 402 1805 "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:140.0) Gecko/20100101 Firefox/140.0"
46.183.108.190 574.083µs humungus.tedunangst.com [2025/07/08 13:41:33] "GET /r/honk/v/a9043d011e41 HTTP/1.1" 402 1805 "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:140.0) Gecko/20100101 Firefox/140.0"
46.183.108.190 602.026µs humungus.tedunangst.com [2025/07/08 13:41:33] "GET /r/azorius/v/1cc1393b6832 HTTP/1.1" 402 1808 "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:140.0) Gecko/20100101 Firefox/140.0"
46.183.108.190 497.831µs humungus.tedunangst.com [2025/07/08 13:41:33] "GET /r/azorius/v/4d53be2bdbd5 HTTP/1.1" 402 1808 "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:140.0) Gecko/20100101 Firefox/140.0"
46.183.108.190 516.365µs humungus.tedunangst.com [2025/07/08 13:41:33] "GET /r/honk/v/302e58335796 HTTP/1.1" 402 1805 "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:140.0) Gecko/20100101 Firefox/140.0"
46.183.108.190 614.239µs humungus.tedunangst.com [2025/07/08 13:41:33] "GET /r/azorius/v/1a4d35ff94ef HTTP/1.1" 402 1808 "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:140.0) Gecko/20100101 Firefox/140.0"
46.183.108.190 530.161µs humungus.tedunangst.com [2025/07/08 13:41:33] "GET /r/azorius/v/a0824eb087c7 HTTP/1.1" 402 1808 "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:140.0) Gecko/20100101 Firefox/140.0"
46.183.108.190 625.91µs humungus.tedunangst.com [2025/07/08 13:41:33] "GET /r/azorius/v/9ca7ed390641 HTTP/1.1" 402 1808 "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:140.0) Gecko/20100101 Firefox/140.0"
46.183.108.190 757.497µs humungus.tedunangst.com [2025/07/08 13:41:33] "GET /r/humungus/v/67e77258e203 HTTP/1.1" 402 1809 "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:140.0) Gecko/20100101 Firefox/140.0"
46.183.108.190 841.494µs humungus.tedunangst.com [2025/07/08 13:41:33] "GET /r/vertigo/v/6b3ffb3b21f5 HTTP/1.1" 402 1808 "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:140.0) Gecko/20100101 Firefox/140.0"
I think there’s a common perception that 10 req/s just isn’t that much, based on some simple benchmarks. But that doesn’t account for TLS, etc. 10 handshakes/s requires a bit more juice than GET hello. I’ve worked to keep response times in the low millisecond range, just seems like good sense, but I think people should be allowed to program in slower languages and frameworks. You shouldn’t need a fairly substantial EPYC server like I have, either.
And there’s always stuff happening in the background. Mastodon hits me once a second every time anybody deletes something. Lemmy hits me twice a second every time somebody likes anything. There’s a bunch of nitwits with misconfigured RSS readers.
A 4x overhead in one area doesn’t matter, but a 1/4 as powerful CPU, with 1/4 as many cores, and 1/4 as fast language, all of which are entirely realistic, and pretty soon we’re running close to the edge.