Hacker News new | ask | show | jobs
by mikece 2418 days ago
The peer-to-peer storage and sharing of video files isn't the hard part: it's the ability to search for and discover content that makes YouTube compelling in my opinion. Are any of these p2p YouTube replacements presenting a compelling way to search/discover content in a distributed way? Or if the search index is centralized, do any of them have a compelling model for staying funded without selling everyone's search/catalog information to data brokers?
5 comments

> Are any of these p2p YouTube replacements presenting a compelling way to search/discover content in a distributed way?

Being "distributed" over p2p/federated architecture opposes the end-user convenience of search/discovery/ranking/recommendations because of speed-of-light limitations. I wrote a previous comment about this: https://news.ycombinator.com/item?id=17578332

Also, the previous last reply by posting2fast of a partial-centralized server doesn't really replace Youtube because his proposed idea creates new problems of spam videos and untrusted/fake videos. E.g. the central index of metadata says "www.johndoehomeserver.com" has a tutorial video for Algebra but when you actually stream the video from "johndoehomeserver.com", you get a spam video for Viagra instead of math instruction. Therefore, users will naturally gravitate toward the centralized servers that have both the metadata and the actual video content. This emergent group behavior of preferences would end up recreating another "Youtube"-like clone.

p2p architecture and torrents works well for things like pirated Photoshop or ripped Marvel Avengers movies because the users already have the content's title _preloaded_ in their brain and therefore a centralized index for discovery/serendipity of unknown content isn't necessary.

It does not necessarily oppose it. All there needs to be done is to expose a static, daily generated JSON file that contains all videos on the instance. This has nothing to do with the speed of light.

Anyone then could build a search index and build a good search experience.

To combat spam, instances should reveal up/downvotes to indicate quality, I guess your fake math video would not get much love from the community.

>This has nothing to do with the speed of light.

Please take extra care to correctly parse what I actually wrote in response to the gp. Yes, speed-of-light is still a limitation based on the gp's constraint of "search/discovery in a _distributed_ way" which means the search algorithm avoids central servers and loops through a bunch of remote p2p nodes to parse a bunch of exposed JSON manifest files.

If instead, the search algorithm loops through data in a cached index server, that's no longer "search in a distributed way" that the gp was originally wondering about. That's the particular point I was responding to.

>Anyone then could build a search index and build a good search experience.

Now, as to the issue with that "cache index server" that pre-parses the JSON files...

The cache server that also contains the actual video data will naturally attract the most users because when they hit the "play" button on their smartphone, the video starts immediately instead of waiting or suffering stuttering from somebody's flakey home video server.

So, the index server with the "good experience" as perceived by users will be the one that also includes the actual videos -- basically acts as a CDN -- and this emergent behavior of user preferences defeats the decentralized ideals of p2p video.

We see that p2p of things like illegal software already works and is proven. However, p2p of mainstream videos has massive technical hurdles that oppose how typical users like to discover content and play them with immediate gratification.

> If instead, the search algorithm loops through data in a cached index server, that's no longer "search in a distributed way" that the gp was originally wondering about.

So DNS isn't distributed because my computer caches queries?

I think this is arguing semantics rather than practicalities. Centralization isn't binary -- it's a continuum, and we care about it because of the benefits it provides, not because we think it's an end in and of itself. What we care about is the ability to aggregate search results from multiple places, to bypass search if we have a specific video URL that's being shared, and to build our own search engines without running into copyright problems.

If all of those goals can be accomplished with a caching server, then does anyone actually care if it's technically decentralized?

> So, the index server with the "good experience" as perceived by users will be the one that also includes the actual videos -- basically acts as a CDN -- and this emergent behavior of user preferences defeats the decentralized ideals of p2p video.

My reading of this argument is I might as well just host my blog on Medium, because Google search is just another point of centralization. And after all, for speed reasons users will prefer to use a search engine that hosts both the blog and the search results -- so eventually Google search is definitely going to lose to Medium anyway.

But of course Medium isn't going to unseat Google, because in the real world speed improvements are relative, and at a certain point users stop caring, or at least other concerns like range of accessible content and network effects begin to matter a lot more.

> Centralization isn't binary -- it's a continuum

It's both I would argue. Distributed systems professor here. My lab has been working on a "academically pure" distributed Youtube for 14 years and 7 months now. That means no central servers, no web portals, and no discovery website. Pure Peer-to-Peer and lawyer-proof hopefully. Distributing everything usually means developer productivity drops by roughly 95%. Plus half of our master-level students are not capable of significantly contributing. Decentralised==hard. This is something the "Distributed Apps" generation is re-discovering after the Napter-age Devs got kids/s

> All there needs to be done is to expose a static, daily generated JSON file that contains all videos on the instance.

Or simply make it real-time gossip. Disclaimer; promoting our work here. We implemented a semantic clustered overlay back in 2014 for decentralised video search, that could make it just as fast as Google Servers[1]. This year we finished implementing a real-time channel feed of Magnet links protocol + deployment to our users. Our 51k concurrent users ensure that we can simply re-seed a new Bittorrent hash with 1 million hashes, then everybody updates. Complete research portfolio, including our decentralised trust function [2].

> does anyone actually care if it's technically decentralized?

That is an interesting question. Our goal is real Internet freedom. In our case, logically decentralisation is a hard requirement. Our users often don't care. Caching servers quickly introduce brittleness into your architecture and legal issues.

[1]https://www.usenix.org/system/files/conference/foci14/foci14... [2]https://github.com/Tribler/tribler/wiki#current-items-under-...

>So DNS isn't distributed because my computer caches queries?

Again, I'm not talking about a technical engineering component. I'm talking about users aggregate behaviors. Please see my other reply of how we seem to be talking at different abstraction levels.

>Centralization isn't binary -- it's a continuum, and we care about it because of the benefits it provides, not because we think it's an end in and of itself.

Right, but that's not what I'm arguing. I'm talking about centralization as a emergent phenomenon that bypasses the ideals decentralized protocols that the protocol's designers didn't intend.

>If all of those goals can be accomplished with a caching server, then does anyone actually care if it's technically decentralized?

I guess I don't understand the premise then because if that were true, why would the adjective "distributed" even be mentioned in the question "search/discovery in a _distributed_ way?" To me, something about distributed/decentralized as a characteristic in the technical implementation is very important to the person asking the question.

EDIT: here's another example of that type of "search without central indexing server" question: https://news.ycombinator.com/item?id=20282397

> I'm talking about users aggregate behaviors.

So am I.

For example, Github currently hosts the majority of Git repositories online, and I've heard people argue that this means Git isn't really decentralized, because the user behavior is to stick everything into a central repository on a central server. But when Microsoft bought Github, lots of people migrated to Gitlab, and (issues notwithstanding) it was easy for them to do so because of Git's distributed architecture. Git was decentralized enough that pivoting from a bad event was still way easier than it would have been with a different architecture.

When I talk about decentralization as a practical concern, I'm not worried about users aggregating around good services. I'm worried about whether the architecture supports moving away from or augmenting those services if something goes wrong in the future.

And what I mean when I talk about centralization as a continuum is that the social aggregated behaviors you're worried about are still strictly better under a PeerTube system than they are under a Youtube system -- so there's no point in bashing PeerTube just because it doesn't solve literally every problem.

If I'm removed from a centralized PeerTube indexing service, my video is still online under the same URL, and I can still point users at a different indexing service. If censorship becomes problematic or widespread, users will move to different indexes because the network lock-in of an indexer is less than the lock-in of a social platform. As far as speed concerns go, users can fall back on slower indexers only when fast ones fail. All of this is workable.

But if I'm removed from Youtube, I have to start over from scratch with a new URL on a different site with different features that doesn't play nicely with any of the existing tools or infrastructure.

> I'm talking about centralization as a emergent phenomenon

The emergent phenomenon you're talking about is that sometimes better, faster services have more users than bad services. That's not a problem with decentralization, and that's not a problem decentralization is trying to solve. Decentralization is only trying to mitigate the harmful effects of that phenomenon.

It is not a desirable goal of decentralization to make every node in a graph have the same traffic levels -- and I mean that both on a technical and on a cultural level.

You're using "search/discovery in a distributed way" in a very literal manner. I interpreted the question as being one much more meaningful to the vast majority of users: "can a user search for content across the various distributed servers?"

That's all an end-user cares about.

Indexing videos once a day (or once an hour or whatever) would be very feasible. Indeed, different servers could create their own indexes, and some might be better at sorting for relevance than others.

>That's all an end-user cares about.

I imagined gp (mikece) as a HN techie (not an oblivious end-user) and thought he was wondering about how to use programming technology to avoid central servers ... and therefore, me interpreting "search/discovery in a distributed way" in a very literal manner was the appropriate level of abstraction to mikece. Avoiding central servers (if possible) is an interesting goal to discuss because they have a tendency to attract disproportionate users which defeats the goals of decentralization.

>Indexing videos once a day (or once an hour or whatever) would be very feasible.

And here, you're interpreting what's feasible only at the level of the technical stack instead of considering several chess moves ahead to emergent group behaviors which renders the metadata-only type of index a solution as not end-user friendly.

>, and some might be better at sorting for relevance than others.

And that's the server that would end up becoming a defacto "centralized" server that people were trying to avoid. This is especially true if that superior server also includes the video data.

Consider that the http protocol itself is already decentralized. If that's true why do people perceive Youtube and Facebook as centralized when they're only nodes on a http network? Because decentralized protocols don't stop emergent group behavior towards centralization.

There could be multiple search servers. I don't understand how that goes against the centralized nature, links to the video hosting instances would still be essential. Distributed services still benefit the traffic from Google, DDG, etc. Why this project is special?

I fail to see how an search index would be bad user experience. Compare that to the current situation of 61 isolated, unsearchable PeerTube instances.

> Being "distributed" over p2p/federated architecture opposes the end-user convenience of search/discovery/ranking/recommendations because of speed-of-light limitations.

Speed of light is not the bottleneck in reaching 1000ms search response time anywhere on earth. Calling it a speed-of-light limitation does a disservice to your point, which really is that querying many peers for search results is slow, for reasons that have nothing to do with the speed of light.

> E.g. the central index of metadata says "www.johndoehomeserver.com" has a tutorial video for Algebra but when you actually stream the video from "johndoehomeserver.com", you get a spam video for Viagra instead of math instruction.

That some video content may not reflect its supposed category or title is not a new problem, is it?

> discover content

Discover heavily sponsored content from content farms. As an experiment, even with an old account, start just browsing the content Youtube highlights. You will soon end with a recommendation page full of shit with 500K+ views using the same template.

I don't understand your point; garbage in, garbage out.

I have a very old account and browse suggestions using my brain, not randomly. While not perfect, almost every recommendation right now looks like something I could watch.

Bicycling, civil engineering, cat toys, and weird metal music mashups, which are all similar to things I intentionally watched, but haven't watched.

You could look at LBRY (https://lbry.tech https://lbry.com https://beta.lbry.tv).

LBRY is similar to PeerTube in open and decentralization, but all content metadata is written to a blockchain, which means everyone/anyone can access the index. This blockchain can then be searched (https://github.com/lbryio/lighthouse) or extracted to SQL (https://github.com/lbryio/chainquery).

Sadly, LBRY isn't able to track view counts. Which makes organic, network effects really difficult.

Maybe they could implement a naive "view tracking" by having the client do small proof of work when they interact with content?

Similar to voting in NotaBug? https://github.com/notabugio/notabug

Well that's the real problem when you start competing with Google and the like. They're data behemoths. As long as you don't have as much data collected, they'll have a significant advantage. The solution to this is really simple and hardly conceivable at the same time: make competitively important data public. In this case, that's anonymized video browsing/viewing data. Market competition would flourish, the big guys would lose the monopoly.
>In this case, that's anonymized video browsing/viewing data

So the view count publicly available and something like the Google search analytics already available? Personalized data offers a huge competitive advantage.

Personalized (as in per user) and anonymized are not mutually exclusive.
Sure, but non-anonymized personalized data still provides a competitive advantage.
I'd say the use for personal identification is cross-platform linking of users. I.e., this user on Google Search and this user on Google Maps are the same person. I agree that this is a competitive advantage, because, in the same vein, it gives the platform owners more data. Technically, anonymized cross-platform-linked data is conceivable.

I think if the legislation ball ever gets rolling two things we're likely to see, because they're low-hanging fruit, are the end of mass tracking on the internet and a meaningful shift in who controls the data gathered.

I can imagine a platform akin to internet banking where you manage your data and its usage.

Something I'd love to see is a "publication" of big-data algorithms. A private entity designs the algorithm for profit and leases it and you run it in your (trusted) environment, owning both the input and output. Nothing leaks.

>I'd say the use for personal identification is cross-platform linking of users.

Its "this person watched this, so they would also be interested in this video and this ad." You can't make this anonymous and near as useful, and it is currently YouTube and Google's premium money maker.

Most other data is already available with a little work, providing the data you describe doesn't help competition that much.

But then who will be incentivized to gather the data?
Anyone who needs more, I guess? Or, is the premise that gathering data is resource intensive (like innovation and patents for example)?

These aren't big problems, I think. The problem is that the people who hold the monopolies on data at the moment are also the people who are extremely powerful lobyists.

> Are any of these p2p YouTube replacements presenting a compelling way to search/discover content in a distributed way?

Ideally we’d see blogs return as the curated content mediators.