| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by larkost 267 days ago

Some years back, at a previous employer I had a related thundering herd problem: I was running an automated testing lab, and if a new job came in after a period of idleness, then we would have 100+ computers all downloading 3 (or more) multi gigabyte files at the same time (software-under-test, symbols files, and compiled tests).

To make matters worse, due to the budget for this lab, we had just three servers that the testing computers could download from. In the worst case the horrible snarl-up would cause computers to wait for as much as two hours before they got the materials needed to run the tests.

My solution was to use peer-to-peer BitTorrent (no Trackers involved), with HTTP seeding. So the BitTorrent files had no trackers listed, but the three servers listed as HTTP seeds, and the clients were all started with local peer discovery. So the first couple of computers to get the job would pull most/all of the file contents from our servers, and then the rest of the computers would wind up getting the file chunks mostly from their peers.

I did need to do some work so that the clients would first try a URL on the servers that would check for the .torrent file, and if it did not exist, build it (sending the clients a 503 code, causing them to wait a minute or two before retrying).

There are lots of things I would do differently if I rebuilt the system (write my own peer-to-peer code), but the result meant that we rarely had systems waiting more than a few minutes to get full files. It took the thundering heard and made it its own solution.

2 comments

raffraffraff 267 days ago

Cool. That's reminds of an approach I took back in 2011 when implementing a Linux build / update system in a (small) bank. 8000 machines across hundreds of branches, no servers in the branches, no internet access, limited bandwidth. The goal was to wake one machine (WOL) which detects an update (via LDAP attribute) and then rsyncs repo update + torrent file. Once complete, that machine would load the torrent, verify the synced files, update it's version in LDAP and wake all of its peers. Each peer host would also query LDAP, detect the need to update, but also notice a peer with the latest version, so skip repo rsync and grab the torrent file and load it. So a branch with hundreds of hosts would torrent the repo update pretty quickly to each other. Pretty cool tbh, you could PXE boot and rebuild a bunch of hosts remotely, and once built, any one of them could act as an installation point. I even used this to do a distribution change, switching from SLES to Ubuntu.

achalshah 267 days ago

Uber built Kraken to solve the same problem with distributing images: https://github.com/uber/kraken