what the go proxy has been doing
A follow up to what is the go proxy even doing? now with more answers. Russ Cox (rsc) and I traded a few emails trying to work out what was going on. They’re going to make a few changes, and I also learned a few things it would have been helpful to know.
The first issue is that whenever the proxy refreshes a module hosted by mercurial, it performs a full clone. Changing that to reuse the existing clone is tracked in issue 75119. So after go 1.26 comes out and the proxy is updated, problem solved. A second issue, issue 75191, also notes that for some lightly accessed modules, the refresh traffic can exceed organic traffic by quite a lot.
In the mean time, humungus has been added to the skip list. (And instructions for adding yourself to the skip list added to the proxy web site.)
On to the main event, the thundering herd. Unfortunately, I posted some logs from an event that happened in May. At the time, I was looking into a variety of things, noticed the surge, and stashed those log events with the intention of writing it up, but got distracted and next thing I know it’s August. New issue: calendar seems to skip months. I still had the log extracts, and figured it was kinda interesting even delayed, so I posted it. I wasn’t really expecting much action, or I would have tried harder not to resurrect a cold case.
On the bright side, shortly after Russ emailed me, it happened again. The proxy started cloning my repo every seven seconds for 15 minutes. This time it wasn’t even connected to anything I was doing.
I later noticed the proxy also had a tremendous hunger for honk. This repo is somewhat larger, and slower to serve, such that it really would become a problem if this kept up. Thankfully, I think we’re close to a solution.logs
172.253.192.245 808.60907ms humungus.tedunangst.com [2025/08/23 01:41:06] "GET /r/webs HTTP/1.1" 200 167250 "" "mercurial/proto-1.0 (Mercurial 5.6.1)"
172.253.192.114 770.224567ms humungus.tedunangst.com [2025/08/23 01:41:12] "GET /r/webs HTTP/1.1" 200 167250 "" "mercurial/proto-1.0 (Mercurial 5.6.1)"
172.253.7.55 764.764434ms humungus.tedunangst.com [2025/08/23 01:41:19] "GET /r/webs HTTP/1.1" 200 167250 "" "mercurial/proto-1.0 (Mercurial 5.6.1)"
172.253.195.239 777.648443ms humungus.tedunangst.com [2025/08/23 01:41:26] "GET /r/webs HTTP/1.1" 200 167250 "" "mercurial/proto-1.0 (Mercurial 5.6.1)"
172.217.40.170 780.867724ms humungus.tedunangst.com [2025/08/23 01:41:33] "GET /r/webs HTTP/1.1" 200 167250 "" "mercurial/proto-1.0 (Mercurial 5.6.1)"
74.125.72.46 770.989918ms humungus.tedunangst.com [2025/08/23 01:41:41] "GET /r/webs HTTP/1.1" 200 167250 "" "mercurial/proto-1.0 (Mercurial 5.6.1)"
172.253.214.44 756.97786ms humungus.tedunangst.com [2025/08/23 01:41:49] "GET /r/webs HTTP/1.1" 200 167250 "" "mercurial/proto-1.0 (Mercurial 5.6.1)"
74.125.17.221 767.122926ms humungus.tedunangst.com [2025/08/23 01:57:53] "GET /r/webs HTTP/1.1" 200 167250 "" "mercurial/proto-1.0 (Mercurial 5.6.1)"
172.253.249.87 770.427036ms humungus.tedunangst.com [2025/08/23 01:58:03] "GET /r/webs HTTP/1.1" 200 167250 "" "mercurial/proto-1.0 (Mercurial 5.6.1)"
172.253.214.43 805.764069ms humungus.tedunangst.com [2025/08/23 01:58:09] "GET /r/webs HTTP/1.1" 200 167250 "" "mercurial/proto-1.0 (Mercurial 5.6.1)"
172.253.214.50 784.409488ms humungus.tedunangst.com [2025/08/23 01:58:17] "GET /r/webs HTTP/1.1" 200 167250 "" "mercurial/proto-1.0 (Mercurial 5.6.1)"
74.125.115.13 1.41371139s humungus.tedunangst.com [2025/08/28 05:11:02] "GET /r/honk HTTP/1.1" 200 1667100 "" "mercurial/proto-1.0 (Mercurial 5.6.1)"
74.125.115.26 1.562936385s humungus.tedunangst.com [2025/08/28 05:11:09] "GET /r/honk HTTP/1.1" 200 1667100 "" "mercurial/proto-1.0 (Mercurial 5.6.1)"
74.125.76.104 1.499745566s humungus.tedunangst.com [2025/08/28 05:11:18] "GET /r/honk HTTP/1.1" 200 1667100 "" "mercurial/proto-1.0 (Mercurial 5.6.1)"
172.253.251.5 1.502836868s humungus.tedunangst.com [2025/08/28 05:16:32] "GET /r/honk HTTP/1.1" 200 1667100 "" "mercurial/proto-1.0 (Mercurial 5.6.1)"
74.125.189.66 1.471472244s humungus.tedunangst.com [2025/08/28 05:16:40] "GET /r/honk HTTP/1.1" 200 1667100 "" "mercurial/proto-1.0 (Mercurial 5.6.1)"
74.125.187.224 1.599957929s humungus.tedunangst.com [2025/08/28 05:16:49] "GET /r/honk HTTP/1.1" 200 1667100 "" "mercurial/proto-1.0 (Mercurial 5.6.1)"
Russ was able to correlate with logs on their side. Well, it turns out somebody has written a python script which attempts to download every tag of a module for reasons unknown. This is the mystery “intern” who’s been causing me so much trouble. Thanks.
Something I learned is that the go proxy includes an auxiliary index service; see proxy.golang.org for details. Whenever a new tag hits the proxy, it gets announced here too, which explains why I get a burst of traffic after retrieving a new tag. The instructions say one should use the /cached-only
endpoint for bulk downloads, but not everyone reads the manual.
What’s been happening is I push a tag, then I pull that tag, which informs the mod proxy that there’s a new tag, which gets published in the index, finally unleashing the hungry hungry python. Technically, the proxy has been trying to download different tags, but the traffic all looks the same.
This is also where two choices I had made amplified the trouble. Ordinarily, the proxy should be caching these requests. But because I didn’t have a recognized LICENSE file, they were not. And as a result of returning 429s, some of the tags that might have been cached had never been fetched, resulting in a new request when this script started enumerating everything. I don’t think I did anything wrong, but the design of the system probably didn’t anticipate this. End result is that when somebody scrapes the proxy, they end up scraping my site, literally by proxy.
I need to mention Russ really went out of his way to download all my repos and check their licenses and tags to determine which were in good shape and which were not.
First thing to change was to add a LICENSE to every repo. I usually stick the license in each file (when I remember). The go team documents the license requirements on pkg.go.dev but I think the full implications for how this affects the module proxy are not spelled out. I don’t particularly care that some personal modules may not have their nonexistent documentation displayed, but letting the proxy know that it’s cool to keep my code cached would be nice.
I can’t reasonably add a LICENSE to 100 past tags, so we’ll need to see what happens during Mr. Python’s future visits, but for now I’ve eliminated the 429 responses which (hopefully?) leads to overall less traffic in the future, once we reach equilibrium. Also, once the change from clone to pull lands, this will all be trivial. (I keep saying they’re going to switch from clone to pull, even though the technical implementation will be somewhat different, but same effect.)
My advice to anyone hosting any go code, make sure you’ve got a LICENSE file. Also, make sure the text of the file is very simple, just the license, and not some commentary on why you’ve chosen the license, or it won’t count.
The go side of the story is in issue 75120.
Current status is everything has a LICENSE file and there are no 429 limits. I’m going to let this run for a while, and then reanalyze logs. Now that I know more about the nature of the python script poking the proxy, if it’s still a problem, I can deal with that in a more targeted manner. So far, I can still see it poking around, when I get three clones of the same repo six seconds apart, but the long runs of a dozen plus clones are gone.