Hacker News new | ask | show | jobs
by chrislusf 867 days ago
Thanks for sharing! I work on SeaweedFS.

SeaweedFS is built on top of a blob storage based on Facebook's Haystack paper. The features are not fully developed yet, but what makes it different is a new way of programming for the cloud era.

When needing some storage, just fallocate some space to write to, and a file_id is returned. Use the file_id similar to a pointer to a memory block.

There will be more features built on top of it. File system and Object store are just a couple of them. Need more help on this.

2 comments

what makes it different is a new way of programming for the cloud era.

just fallocate some space to write to, and a file_id is returned. Use the file_id similar to a pointer to a memory block.

How is that not mmap?

Also what is the difference between a file, an object, a blob, a filesystem and an object store? Is all this just files indexed with sql?

> How is that not mmap?

The allocated storage is append only. For updates, just allocate another blob. The deleted blobs would be garbage collected later. So it is not really mmap.

> Also what is the difference between a file, an object, a blob, a filesystem and an object store?

The answer would be too long to fit here. Maybe chatgpt can help. :)

> Is all this just files indexed with sql?

Sort of yes.

I, too, am interested in your views on the last 2 questions, since your views, not chatGPT's, are what informed the design. Part of learning from others' designs [0] is understanding what the designers think about their own design, and how they came about it.

Would you mind elaborating on them? HN gives a lot of space, and I'm confident you can find a way to summarize without running out, or sounding dismissive (which is what the response kind of sounds like now).

0 – https://aosabook.org/en/

The blob storage is what SeaweedFS built on. All blob access has O(1) network and disk operation.

Files and S3 are higher layers above the blob storage. They require metadata to manage to the blobs, and other metadata for directories, S3 access, etc.

These metadata usually sit together with the disks containing the files. But in highly scalable systems, the metadata has dedicated stores, e.g., Google's Colossus, Facebook's Techtonics, etc. SeaweedFS file system layer is built as a web application of managing the metadata of blobs.

Actually SeaweedFS file system implementation is just one way to manage the metadata. There are other possible variations, depending on requirements.

There are a couple of slides on the SeaweedFS github README page. You may get more details there.

Thank you, that was very informative. I appreciate your succinct, information dense writing style, and appreciate it in the documentation, too, after reviewing that.
You made the claim:

what makes it different is a new way of programming for the cloud era.

but you aren't even explaining how anything is different from what a normal file system can do, let alone what makes it a "new way of programming for the cloud era".

Sorry it was not so clear. Previously fallocate just allocate disk space for a local server. Now SeaweeedFS can allocate a blob on a remote storage.
What is the difference between a blob and a file and what is the difference between allocating a blob on remote storage or a file on remote storage?
A large file can be chunked into blobs.
First, the feature set you have built is very impressive.

I think SeaweedFS would really benefit from more documentation on what exactly it does.

People who want to deploy production systems need that, and it would also help potential contributors.

Some examples:

* It says "optimised for small files", but it is not super clear from the whitepaper and other documentation what that means. It mostly talks about about how small the per-file overhad is, but that's not enough. For example, on Ceph I can also store 500M files without problem, but then later discover that some operations that happen only infrequently, such as recovery or scrubs, are O(files) and thus have O(files) many seeks, which can mean 2 months of seeks for a recovery of 500M files to finish. ("Recovery" here means when a replica fails and the data is copied to another replica.)

* More on small files: Assuming small files are packed somehow to solve the seek problem, what happens if I delete some files in the middle of the pack? Do I get fragmentation (space wasted by holes)? If yes, is there a defragmentation routine?

* One page https://github.com/seaweedfs/seaweedfs/wiki/Replication#writ... says "volumes are append only", which suggests that there will be fragmentation. But here I need to piece together info from different unrelated pages in order to answer a core question about how SeaweedFS works.

* https://github.com/seaweedfs/seaweedfs/wiki/FAQ#why-files-ar... suggests that "vacuum" is the defragmentation process. It says it triggers automatically when deleted-space overhead reaches 30%. But what performance implications does a vacuum have, can it take long and block some data access? This would be the immediate next question any operator would have.

* Scrubs and integrity: It is common for redundant-storage systems (md-RAID, ZFS, Ceph) to detect and recover from bitrot via checksums and cross-replica comparisons. This requires automatic regular inspections of the stored data ("scrubs"). For SeaweedFS, I can find no docs about it, only some Github issues (https://github.com/seaweedfs/seaweedfs/issues?q=scrub) that suggest that there is some script that runs every 17 minutes. But looking at that script, I can't find which command is doing the "repair" action. Note that just having checksums is not enough for preventing bitrot: It helps detect it, but does not guarantee that the target number of replicas is brought back up (as it may take years until you read some data again). For that, regular scrubs are needed.

* Filers: For a production store of a highly-available POSIX FUSE mount I need to choose a suitable Filer backend. There's a useful page about these on https://github.com/seaweedfs/seaweedfs/wiki/Filer-Stores. But they are many, and information is limited to ~8 words per backend. To know how a backend will perform, I need to know both the backend well, and also how SeaweedFS will use it. I will also be subject to the workflows of that backend, e.g. running and upgrading a large HA Postgres is unfortunately not easy. As another example, Postgres itself also does not scale beyond a single machine, unless one uses something like Citus, and I have no info on whether SeaweedFS will work with that.

* The word "Upgrades" seems generally un-mentioned in Wiki and README. How are forward and backward compatibility handled? Can I just switch SeaweedFS versions forward and backward and expect everything will automatically work? For Ceph there are usually detailed instructions on how one should upgrade a large cluster and its clients.

In general the way this should be approached is: Pretend to know nothing about SeaweedFS, and imagine what a user that wants to use it in production wants to know, and what their followup questions would be.

Some parts of that are partially answered in the presentations, but it is difficult to piece together how a software currently works from presentations of different ages (maybe they are already outdated?) and the presentations are also quite light on infos (usually only 1 slide per topic). I think the Github Wiki is a good way to do it, but it too, is too light on information and I'm not sure it has everything that's in the presentations.

I understand the README already says "more tools and documentation", I just want to highlight how important the "what does it do and how does it behave" part of documentation is for software like this.