commit: 9de9ddf072fc1752dee896c5db35b974c438938d parent ccb5bbe41197b48e59d650f57266dd42e04e9ecb Author: Drew DeVault <sir@cmpwn.com> Date: Wed, 25 May 2022 19:43:22 +0200 Thanks GoogleDiffstat:
A | content/blog/Google-has-been-DDoSing-sourcehut.md | 88 | +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
1 file changed, 88 insertions(+), 0 deletions(-)
diff --git a/content/blog/Google-has-been-DDoSing-sourcehut.md b/content/blog/Google-has-been-DDoSing-sourcehut.md @@ -0,0 +1,88 @@ +--- +title: Google has been DDoSing SourceHut for over a year +date: 2022-05-25 +--- + +Just now, I took a look at the HTTP logs on git.sr.ht. Of the past 100,000 HTTP +requests received by git.sr.ht (representing about 2½ hours of logs), 4,774 have +been requested by GoModuleProxy — 5% of all traffic. And their requests +are not cheap: every one is a complete git clone. They come in bursts, so every +few minutes we get a big spike from Go, along with a constant murmur of Go +traffic. + +This has been ongoing since around the release of Go 1.16, which came with some +changes to how Go uses modules. Since this release, following a gradual ramp-up +in traffic as the release was rolled out to users, git.sr.ht has had a constant +floor of I/O and network load for which the majority can be attributed to Go. + +I started to suspect that something strange was going on when our I/O alarms +started going off in February 2021 (we eventually had to tune these alarms up +above the floor of I/O noise generated by Go), correlated with lots of activity +from a Go user agent. I was able to narrow it down with some effort, but to the +credit of the Go team they did [change their User-Agent to make more apparent +what was going on][0]. Ultimately, this proved to be the end of the Go team's +helpfulness in this matter. + +[0]: https://github.com/golang/go/issues/44468 + +I did narrow it down: it turns out that the Go Module Mirror runs some crawlers +that periodically clone Git repositories with Go modules in them to check for +updates. Once we had narrowed this down, I filed [a second ticket][1] to address +the problem. + +[1]: https://github.com/golang/go/issues/44577 + +I came to understand that the design of this feature is questionable. For a +start, I never really appreciated the fact that Go secretly calls home to Google +to fetch modules through a proxy (you can set [GOPROXY=direct][2] to fix this). +Even taking the utility at face value, however, the implementation leaves much +to be desired. The service is distributed across many nodes which all crawl +modules independently of one another, resulting in very redundant git traffic. + +[2]: https://drewdevault.com/2021/08/06/goproxy-breaks-go.html + +``` +140 8a42ab2a4b4563222b9d12a1711696af7e06e4c1092a78e6d9f59be7cb1af275 + 57 9cc95b73f370133177820982b8b4e635fd208569a60ec07bd4bd798d4252eae7 + 44 9e730484bdf97915494b441fdd00648f4198be61976b0569338a4e6261cddd0a + 44 80228634b72777eeeb3bc478c98a26044ec96375c872c47640569b4c8920c62c + 44 5556d6b76c00cfc43882fceac52537b2fdaa7dff314edda7b4434a59e6843422 + 40 59a244b3afd28ee18d4ca7c4dd0a8bba4d22d9b2ae7712e02b1ba63785cc16b1 + 40 51f50605aee58c0b7568b3b7b3f936917712787f7ea899cc6fda8b36177a40c7 + 40 4f454b1baebe27f858e613f3a91dfafcdf73f68e7c9eba0919e51fe7eac5f31b +``` + +This is a sample from [a larger set][3] which shows the hashes of git +repositories on the right (names were hashed for privacy reasons), and the +number of times they were cloned over the course of an hour. The main culprit is +the fact that the nodes all crawl independently and don't communicate with each +other, but the per-node stats are not great either: each IP address still clones +the same repositories 8-10 times per hour. [Another user][4] hosting their own +git repos noted a single module being downloaded over 500 times in a single day, +generating 4 GiB of traffic. + +[3]: https://paste.sr.ht/~sircmpwn/b46ad0b13e864923df80cb8e8285bf1661e6f872 +[4]: https://github.com/golang/go/issues/44577#issuecomment-851079949 + +The Go team holds that this service is not a crawler, and thus they do not obey +robots.txt — if they did, I could use it to configure a more +reasonable "Crawl-Delay" to control the pace of their crawling efforts. I also +suggested keeping the repositories stored on-site and only doing a git fetch, +rather than a fresh git clone every time, or using shallow clones. They could +also just fetch fresh data when users request it, instead of pro-actively +crawling the cache all of the time. All of these suggestions fell on deaf ears, +the Go team has not prioritized it, and a year later I am still being DDoSed by +Google as a matter of course. + +I was banned from the Go issue tracker for mysterious reasons,[^1] so I cannot +continue to nag them for a fix. I can't blackhole their IP addresses, because +that would make all Go modules hosted on git.sr.ht stop working for default Go +configurations (i.e. without GOPROXY=direct). I tried to advocate for Linux +distros to patch out GOPROXY by default, citing privacy reasons, but I was +unsuccessful. I have no further recourse but to tolerate having our little-fish +service DoS'd by a 1.38 trillion dollar company. But I will say that if I was in +their position, and my service was mistakenly sending an excessive amount of +traffic to someone else, I would make it my first priority to fix it. But I +suppose no one will get promoted for prioritizing that at Google. + +[^1]: In violation of Go's own Code of Conduct, by the way, which requires that participants are notified moderator actions against them and given the opportunity to appeal. I happen to be well versed in Go's CoC given that I was banned once before without notice — a ban which was later overturned on the grounds that the moderator was wrong in the first place. Great community, guys.