It’s the internet, so there’s crawlers, and it’s the future, so they’re mindless wannabe chatbot scrapers, and it’s the cyberpunk world we always dreamed of, so the cool thing to do is to write your own force field to keep the bots out. Which I did.
For background, I have not actually noticed any load from bot scrapers (other than the google go cache, different story), but found them a different way. There’s a bug in humungus which I’m too lazy to fix that causes it to 500 error whenever a file revision that doesn’t exist is requested. There’s a second bug I’m too lazy to even look for that generates these links for the bots to find. But the punchline is I have a bunch of 500 errors in my log file. The robots file excludes these URLs because I know they’re useless for a crawler. I’m trying to help you, stupid bot, but some bots are beyond help, so we need a bigger hammer. Initially, I banned netblocks in pf, and after eliminating all of BabaWei’s IP blocks, we’re down to Brazilian ISPs, which I don’t want block at the network level.
anticrawl is a simple go http handler. Stick it in the affected service. Then configure a regex because I like problems. I don’t care if you scrape my README 700 times, that’s what it’s for, but leave the other junk alone. Also, I’d rather not bother humans, even a little bit, until they start clicking around deeper.
The challenge is super easy. If you have javascript, you have to find some 42s. (Why is it always zeroes we’re forced to search for?) If you don’t, you have to solve the riddle of the llama. Either way, it’s super trivial because the adversary isn’t exactly basilisk class AI. I was told cookies are evil, so the state is just stored server side. I’m thinking I might change the design so it’s even easier to bypass by starting at a normal entry point. So far, it appears very effective.
There’s also a standalone proxy server for people who can never run enough servers.
Posted 17 Apr 2025 16:53 by tedu Updated: 17 Apr 2025 16:53
Tagged:
project