Hacker News new | ask | show | jobs
by KMag 3732 days ago
Former LimeWire engineer here. I implemented randomized chunk selection for swarmed downloads in LimeWire.

I think the proper swarmed streaming solution is to make the percentage chance of requesting the rarest chunk be a smooth function of the number of replicas of the rarest chunk. If the rarest chunk has only one source, the probability should be 1.0. If there are multiple chunks that are all the most rare, you probably want to randomly select which chunk to request next, with an approximately exponential distribution rather than uniform distribution. Y probability of selecting the first chunk, Y^2 for the second, Y^3 for the third, etc. You'd want to run some simulations to fine-tune the probability function and also the Y percentage.

What I did for LimeWire was (1) if the MIME type wasn't on a streamable whitelist, download all chunks in randomized order. (2) if the file was a streming type and less than 10% complete, start downloading the available chunk closest to the front of the file that isn't currently in progress (3) if the file is a streamable type and 10% to N% complete, randomly select either in-order or randomized selection with probability X. (4) beyond N% complete, always use random chunk selection. I'm pretty sure N was 50 and nearly certain X was 0.5. I originally proposed making X a smooth function of the % downloaded instead of 0, 0.5. 1.0 stair-steps, but the lead developer strongly preferred stair-steps.

The random selection algorithm actually tried to keep the number of ranges of bytes (extents) below 5. So even after 50% downloaded, you still have a 25% chance of getting in-order downloading.

The reason I used randomization instead of rarest-first was that it was my first change to LimeWire, and this was the least invasive change to make. At that time, LimeWire had a global list of verified downloaded chunks and a global list of in-progress chunks, but no global counter for number of replicas.

2 comments

Oh, and if the user was idle more than something like 5, 15, or 30 minutes, LimeWire would switch to random chunk requests regardless of MIME type, assuming the user didn't need a streaming download. I hope full-screen media players prevented the user from being counted as idle.

I vaguely seem to remember WMV and ASF also needing some information from footers in the file, and therefor also prioritizing the last MB of the file.

This was all implemented using the Strategy object oriented design pattern, to make it easier to play around with many alternatives and make specialized strategies for specific MIME types.

> Former LimeWire engineer here

Wow that's an entity I haven't heard in a long time. It would be super interesting to hear more about the engineering behind the software, do you have any blogposts anywhere?

No blog entries specifically about LimeWire, but I do have a few observations that maybe I'll blog about:

    (1) Merkle trees are tough to get right
      (a) Bittorrent's BEP 30 is vulnerable
      (b) A small tweak would have allowed Gnutella's THEX to carry a proof of file length [0]
      (c) Use the Sakura tree construction [0]
      (d) There was an attack against LW where one could respond quickly with a bogus THEX root for a popular SHA-1
      (e) The THEX root should have been the unique identifier in both DHT and query responses
    (2) Using HTTP for data transfer was definitely the right choice
      (a) It uses X-alts and X-nalts "experimental" HTTP headers for swarm control
      (b) I prototyped an Apache plugin to allow it to transparently participate in Gnutella swarms
      (c) HTTP/2.0 would be ideal now
    (3) Gnutella uses query broadcast
      (a) exponential fan-out means most traffic is in the last hop
      (b) if the fanout is 19:1, 95% of traffic is the last hop
      (c) LW used Bloom Filters to often skip the last hop
      (d) We should have used mulitple hash functions in the Bloom filter
      (e) Adding new hash functions is backward-compatible, at the cost of increased query traffic during transition
    (4) LW connection handshake includes the 32-bit serial number of the latest XML version message
      (a) The message is signed using DSA
      (b) Newly signed XML messages propagate to 95% of the network within 60 seconds
      (c) We accidentally DDoSed our servers by having everyone come for updates at the same time
      (d) So we added user alert time randomization parameters in the XML message
      (e) There was no mechanism to roll over or expand version message serial numbers.
      (f) We could have locked ourselves out of asking users to upgrade by signing an INT_MAX serial XML message.
    (4) We wrote a minimal C++ agent capable of downloading the latest free LW version from LW nodes
      (a) SHA-1 of the free installer is part of the signed XML version message above
      (b) SHA-1 was checked before running the full installer, preventing malware injection
      (c) It was great for saving bandwidth and reducing legacy support
    (5) I misplaced a paren in LimeWire QueryKey crypto code (later fixed)
      (a) QueryKeys prevent turning the LW network into a DDoS botnet
      (b) I knew the code wasn't behaving quite right
      (c) I convinced myself that my reasoning was wrong and the code must be right
    (6) Random seeks are tough on equipment
      (a) Apache would kernel-panic OSX on random HTTP range requests (ca 2006)
      (b) Anecdotally, random block download order wasn't great for hard drive life
      (c) Random download order code tried to minimize number of file extents
         (i) Saves bandwidth in describing what you have
         (ii) Might be better for hard drive life 
[0] http://kmagsoftware.blogspot.hk/2016/02/on-content-addressed...