But why is it required? Do you really need a copy of everyone's data locally? If the only way to self-host bluesky is to have an entire copy of the entire database, that seems like it's really bad from a scaling perspective.
What else would "self-hosting all of Bluesky" mean other than a copy of the entire site? If you just want to participate in the network host a PDS, which only stores your own posts.
Surely there's some middle ground between only hosting your own data and being reliant on another site to keep track of your following / followers and hosting a duplicate copy of the entire network?
I'm talking about the case where you wanted to run your own PDS and use all of the other infrastructure being run by Bluesky.
If you fully want your own copy of everything, then you'd want to run a copy of everything. But you don't have to. It really depends on what your goals are. That's why the post is about the maximal scenario. "Just your own PDS" is the minimalist scenario. But I think it's the one that makes sense for 95% of users who want to self-host.
Your following list is stored in your own repo, so it lives on your PDS. You can theoretically have partial replicas of the network but nobody has bothered yet; if you want to make software like that, a good start would be subscribing to the firehose and filtering down to DIDs you care about / supplying the watched DIDs parameter to a Jetstream instance
"self host an entire copy of all user data" is a pretty cool capability to have, kind of proof that the infrastructure is really open and forkable. you seem to have misunderstood OPs goals. Serving your own data from a personal data server is a much less arduous affair.
My point is not the current size, it's the eventual size if bluesky succeeds. Facebook ingests 100TB/day. Self-hosting a bluesky relay isn't (won't be) a thing.
It could be a thing. Not for individual tinkerers but for companies. The fact that today, with already 14 million users, is still possible for an individual to host it is amazing.