Hacker News new | ask | show | jobs
by lobsterslive 990 days ago
It works well in practice. The DHT protocol includes announce messages that broadcast when new files are shared on BitTorrent. It then includes a "geometric" way to find people who are sharing those files. It doesn't include the files themselves, just the torrents which include a file list and location hashes.

If you listen to BitTorrent's DHT network, you'll build an index of everything shared on BitTorrent (over time), this will include commercial movies and such.

2 comments

>It works well in practice.

Hi, I worked on gnutella and lots of P2P systems in the early 00s. This will devolve into noise and spam as the number of users who adopt this feature pass a critical mass. With a fully decentralized system, there are no gatekeepers, and as such, there is no way to filter counterfiet items. While your client will present with you the data you are searching for, you will find out (usually hours later) that your supposed pirated download is actually just a 2hour loop of Rick Astley (still piracy though, so you are still winning.. i think?).

I don't think this project changes any of this? Torrents have been around for decades and this hasn't been a problem yet. We can't rule it out entirely but it does seem unlikely at this point to be worthwhile doing otherwise we'd see more exploitation.

If the criticism is that a DHT crawler is going to be more subject to this than a website where people submit upload torrents, that may be the case, but I think the author of this project underestimates the DHT crawling going on. I believe the torrent ecosystem is largely automated and there's little in the way of manual submission or human review going on.

The "problem" is that most users aren't crawling the DHT to find torrents, right now. The more people start using DHT crawlers as their primary way of finding new torrents, the more incentive there is to spam the DHT with junk, malware, etc. (because there will be more eyeballs on it)

That is, the usefulness of DHT crawling is inversely proportional to how many people are doing it.

But my second point is that I really think they are crawling the DHT, albeit indirectly. There are many torrent websites and they tend to have the same content. It seems fairly clear to me that this is what most torrent sites are doing. Maybe not the major names that users might submit to, but the long tail of other torrent search indexes certainly. It also seems to be what Popcorn Time does.
While you're technically correct, the protocol is resilient to such attack, as the number of people participating in a particular torrent is a good indicator of its validity. After all, everyone who was fooled will delete and stop sharing such items.

New releases of something that just came out tend to suffer from this, though. Sometimes the counterfeits reach escape velocity - the rate of people joining in downloading the counterfeit exceed the rate of people realizing and stopping, thus giving the illusion of a legit torrent.

Currently this problem is being solved by torrent sites' reputation and comment systems. If we imagine a world where only decentralized indexes like Bitmagnet exist, your prediction is 100% accurate. This only works if reputation from a reliable site is bootstrapping the initial popularity of a torrent.

(btw my comment was/is about the DHT crawler)

You are describing a pay-to-play model. The validator is if the seeder/leech count is high. Well does DHT provide aggregate bandwidth of each torrent? If not, you can easily spin up 1000+ nodes and connect to your torrent. Tada fake popularity. If bandwidth is known, then you simply raise your costs a bit by running fake clients. There are anti-piracy groups who's entire mandate is to provide noise in the piracy ecosystem. Food for thought: bandwidth costs for this would be a rounding error for e.g. MGM, Universal, or any major content creator.

DHT does not offer any sort of reputation or comment system. Back to centralized torrenting which is why I suspect DHT crawling has not been a very popular feature

> If not, you can easily spin up 1000+ nodes and connect to your torrent. Tada fake popularity. If bandwidth is known, then you simply raise your costs a bit by running fake clients.

Sure, but like the other commenter said, this has been possible for years, and yet public trackers aren't swamped with fake torrents. I think in all my years of using BitTorrent I've only ever found a single fake torrent, where the content was inside an encrypted RAR with no key (obviously there was no way to know it was encrypted ahead of time).

You are making my point. A decentralized system will be abused with spam and fraud. A centralized system allows you to moderate the results.
It seems like it would be pretty easy to make it appear that your spam torrent is highly active.
You are correct.
Once you've discovered a torrent being seeded, is there no way to interrogate the seeders and/or the DHT itself, to find out the oldest active seeder registration on that torrent hash; and then use the time-of-oldest-observed-registration to rank torrents that claim to be "the same thing" in their metadata, but which have different piece-trie-hash-root?

I ask, because a similar heuristic is used in crypto wallet software, visibility-weighting the various "versions" of a crypto token with the same metadata, by (in part) which were oldest-created. (The logic being: scam clones of a thing need to first observe the real thing, before they can clone it. So the real thing will always come first.)

Of course, I'm assuming here that you're searching for an "expected to exist" release of a thing by a specific distributor, where the distributor has a known-to-you structured naming scheme to the files in their releases, and so you'll only be trying to rank "versions" of the torrent that all have identical names under this naming scheme, save for e.g. the [hash] part of the file name being different to match the content. This won't help if you're trying to find e.g. "X song by Y artist, by any distributor."

Gatekeeping is just a bad moderation method in the first place.

What you need is sorting and categorization. If you really want to involve authoritative opinions on metadata, then use a web of trust.

I've yet to see a moderation method that works better than gatekeeping.
But you can still pick the option with the most seeders, which should get you what you're looking for most of the time.

The spam problem isn't nonexistent within the centralized services either.

Hehe in a popular P2P client from the '03-'05 period, we said the same thing. Turns out there are groups with large amounts of funding which will provide a fake seed count. Either just faking metadata making it seem there was a high seed count but bogus nodes which would refuse connections (which was actual behavior from clients with bad ISPs - which we saw valid cases in asia or east europe) or would actually stream data (and some of them were on good hosts seeding multi mbps of bad data)

What i'm saying is it becomes a numbers game and those fake seeders usually have deep pockets financed by the content creators themselves

The way to filter out garbage is to download things with lots of seeds, and if you still happen to download garbage, to immediately stop sharing it.
Chicken/egg problem... as mentioned by someone else above...

https://news.ycombinator.com/item?id=37779341

> New releases of something that just came out tend to suffer from this, though. Sometimes the counterfeits reach escape velocity - the rate of people joining in downloading the counterfeit exceed the rate of people realizing and stopping, thus giving the illusion of a legit torrent.

It's possible. I never follow new releases. But back in the ed2k days, I'd say about half of just about any file you cared you search for was fake, regardless of age.
You are what’s called, an edge case. A statistical anomaly. While that is great, you are far from the norm and not the target of this product (or even this particular thread :)
>If you listen to BitTorrent's DHT network, you'll build an index of everything shared on BitTorrent (over time),

Correct me if I'm wrong but as far as I understand, passively listening on DHT would only mean you build up a list of infohashes of everything shared on BitTorrent. You'd actually have to reach out to your DHT peers to know what files the infohashes actually represents.

Wrapping back to grandparent's question of

>Also what happens if illegal content gets scooped up into the index?

I think this could get dicey if someone announces something very illegal like CP, and your crawler starts asking every peer that announced the infohash about it's contents with this[0] protocol. This would put your IP into a pretty awful exclusive club of

A, other crawlers

B, actual people wanting downloading said CP

[0]: https://www.bittorrent.org/beps/bep_0009.html

> Correct me if I'm wrong but as far as I understand, passively listening on DHT would only mean you build up a list of infohashes of everything shared on BitTorrent. You'd actually have to reach out to your DHT peers to know what files the infohashes actually represents.

Yes, you're correct! I should have stated that, you still need to resolve the metadata from the peers that have the infohashed files hosted. That's a separate operation from downloading the file's content.

Would this get hashes of items shared on private trackers too?
No because private trackers enforce that all torrents uploaded have DHT,PEX and LPD disabled. Usually done by a single tickbox that says “Make torrent private” in the client.

Of course, respecting these options in the torrent file is still up to the client. This is one of the reasons why all private trackers have a client whitelist too.