| HN Mirror

That's wild (in a good/interesting way) to me.

I've found the BitTorrent protocol tries to be more suited to accessing popular data on-demand (i.e. streaming a popular file) vs. archival.

IPFS' BitSwap protocol strikes me as trying to be optimized for longer-term preservation (higher latency time to first byte in exchange for more resilient pinning/discovery/propagation of rare data).

It's cool you're observing the opposite. I've had a growing suspicion that both protocols haven't quite realized the benefits they were hoping to get from the trade-offs they made in their transfer/discovery protocols.

Would love to compare notes at some point if you'd be open to it.

We've been playing around with both BitTorrent and IPFS. Some of the datasets we are working towards supporting are approaching the scale you work at (100TB archives).

Ultimately both BitTorrent and IPFS have fallen short for me when trying to seed 100TB datasets.

I've got a hunch that we're going to need to roll a new protocol to tackle these larger datasets that merges some of HTTP's, BitTorrent's, and IPFS' approaches to sharing content.

I have personal R&D list for pushing a file sharing protocol past the 100TB limit (not in any particular order):

* Better chunking using a mix of:

  * Rolling hashes

  * File boundary splitting

  * (should enable deduplication of identical files across nonhomogeneous archives, and allow for adding content to an archive with without losing the existing seeders)

  * (inspired by prior art in container storage: https://github.com/hinshun/ipcs)

* "online" deterministic archive formats w/ detached metadata

  * Ability to share a directory as an archive, or partial slice of an archive, without having to generate the archive on disk. (Announce a "tarball" like archive on the DHT without having to generate it by being able to generate the "chunks" on demand from the directory)

  * Detatch the manifest containing the archive's contents from the archive, so you can download/parse the manifest without downloading the full dataset. (You can use this to find the chunks specific files are in. So you can download a single file from a 1TB archive, and the client can seed that file back to the network as part of the archive.)

  * Chunking of manifest files for large datasets, since the manifest itself might grow to many GBs in size (manifest resolution inspired by IPLD's data structure)

  * Normalize file metadata in the archive header so timestamps etc. don't muck up your CIDs

  * Deterministic ordering of files in an archive

* Chunking/Transfers/Announcing/Discovering

  * Supporting increasing the chunk sizes for large files past 1MB. A 100TB dataset w/ 1MB chunks requires ~209M CIDs just for the chunks, that's a lot of load on the DHT and a lot of work on the seeding node to keep the data available.

  * Support interruptible/resumable/recoverable downloads from peers using something similar HTTP RANGE header semantics

  * Merge BitTorrent's DHT query approach w/ IPFS' DHT query approach, asking connected peers for CIDs and tit-for-tat reciprocity while simultaneously hedging your bet by kicking off the slower DHT traversal to find more peers

* Connectivity

  * Bringing mobile devices and browser tabs into the fold as first class peers that can both download and seed content

  * (i.e. WebRTC: https://github.com/libp2p/rust-libp2p/tree/master/examples/browser-webrtc)

  * (proof-of-concept NAT hole punching appliance for end-users: https://github.com/retrohacker/turn-it-up)

Thank you for everything you do