Hacker News new | ask | show | jobs
by nightpool 595 days ago
But why is it required? Do you really need a copy of everyone's data locally? If the only way to self-host bluesky is to have an entire copy of the entire database, that seems like it's really bad from a scaling perspective.
3 comments

What else would "self-hosting all of Bluesky" mean other than a copy of the entire site? If you just want to participate in the network host a PDS, which only stores your own posts.
Surely there's some middle ground between only hosting your own data and being reliant on another site to keep track of your following / followers and hosting a duplicate copy of the entire network?
For sure. If you just want to host your own data, you can do that. A PDS for you and maybe some friends is very small and cheap to host.
My understanding though is that having a PDS on its own is useless without an AppView to collect the data from the relay? Or am I misunderstanding the architecture here? https://docs.bsky.app/docs/advanced-guides/federation-archit...
I'm talking about the case where you wanted to run your own PDS and use all of the other infrastructure being run by Bluesky.

If you fully want your own copy of everything, then you'd want to run a copy of everything. But you don't have to. It really depends on what your goals are. That's why the post is about the maximal scenario. "Just your own PDS" is the minimalist scenario. But I think it's the one that makes sense for 95% of users who want to self-host.

Right, and I'm saying "surely there must be a middle ground between "using all of Bluesky's infrastructure" and "having a 4.5tb copy of every post ever made on the network""
Your following list is stored in your own repo, so it lives on your PDS. You can theoretically have partial replicas of the network but nobody has bothered yet; if you want to make software like that, a good start would be subscribing to the firehose and filtering down to DIDs you care about / supplying the watched DIDs parameter to a Jetstream instance
The middle ground you're looking for is impossible in the AT protocol, it is however what the Nostr protocol is aiming towards.
"self host an entire copy of all user data" is a pretty cool capability to have, kind of proof that the infrastructure is really open and forkable. you seem to have misunderstood OPs goals. Serving your own data from a personal data server is a much less arduous affair.
Uh, it is not required. You can run only a PDS if you want to self host your data and everything will work.

But it is indeed very cool that you can actually host a relay if you want (for fun, learning, or whatever reason)