Hacker News new | ask | show | jobs
by papaver-somnamb 873 days ago
Tried and rejected SeaweedFS due to Postgres failing to even initialize itself on a POSIX FS volume mounted over SeaweedFS' CSI driver. And that's too bad, because SeaweedFS was otherwise working well!

What we need and haven't identified yet is an SDS system that provides both fully-compliant POSIX FS and S3 volumes, is FOSS, a production story where individuals can do all tasks competently/quickly/effectively (management, monitoring, disaster recovery incl. erasure coding and tooling), and CSI drivers that work with Nomad.

This rules out Ceph and friends. GarageFS, also mentioned in this thread, is S3 only. We went through everything applicable on the K8S drivers list https://kubernetes-csi.github.io/docs/drivers.html except for Minio, because it claimed it needed a backing store anyways (like Ceph) although just a few days ago I encountered Minio being used standalone.

While I'm on this topic, I noticed that the CSI drivers (SeaweedFS and SDS's in general) use tremendous resources when effecting mounts, instantiating nearly a whole OS (OCI image w/ functional userland) just to mount over what appears to be NFS or something.

3 comments

You do know that you cannont implement a fully-compliant POSIX FS with only the S3 API? None of the scalalbe SDS' support random writes. Atomic rename (for building transactional systems like lakehouse table formats) is not there. Listing of files is often eventually consistent. The closest functional API to a posix-compliant one in scalable SDS' is the HDFS API. Only ADLS supports that. But then again, they are the only one who enable you to fuse mount a directory for local FS read/write access. All of the S3 fuse mount stuff is fundamentally limited by the S3 API.
This is where we learned that! Ceph does it, because separate components are responsible for each of underlying storage, S3 API, and FS API. We tripped on the Seaweed FS and the Garage FS indicia, where "FS" in these contexts typically means File System. But, neither SeaweedFS nor GarageFS is a File System at all; with grace and lenience they could be mildly regarded as Filing Systems, but the reality is that they are actually object stores. SeaweedOS? SeaweedS3?
running something like postgres over a networked filesystem sounds very wrong
There was some work done to add a S3 storage backend for ZFS[1], precisely with the goal of running PosgreSQL on effectively external storage.

A key point was to effectively treat S3 as a huge, reliable disk with 10MB "sectors". So the bucket would contain tons of 10MB chunks and ZFS would let S3 handle the redundancy. For performance it was coupled with a large, local SSD-based write-back cache.

Sadly it seems the company behind this figured it needed to keep this closed-source in order to get ROI[2].

[1]: https://youtu.be/opW9KhjOQ3Q

[2]: https://github.com/openzfs/zfs/issues/12119

Neon does this for PostgreSQL and it's open source (more like code-dump though)
But it also sounds like a dream if it could actually work. If you have enough local, performant disk that you are sharing with the cluster you should be able to get good performance and rely on the system to provide resilience and extra space.
In practice you can't get high-availability this way without additional logic and circuit breakers. Running multiple postgres with postgres-aware replication and failover is safer, faster, and more performant (though harder to set up).
What about JuiceFS?

I've never used it myself and just learned about it from this thread but it seems to fit the bill.