| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bnewbold 867 days ago

SeaweedFS does the thing: I've used it to store billions of medium-sized XML documents, image thumbnails, PDF files, etc. It fills the gap between "databases" (broadly defined; maybe you can do few-tens-KByte docs but stretching things) and "filesystems" (hard/inefficient in reality to push beyond tens/hundreds of millions of objects; yes I know it is possible with tuning, etc, but SeaweedFS is better-suited).

The docs and operational tooling feel a bit janky at first, but they get the job done, and the whole project is surprisingly feature-rich. I've dealt with basic power-outages, hardware-caused data corruption (cheap old SSDs), etc, and it was possible to recover.

In some ways I feel like the surprising thing is that there is such a gap in open source S3 API blob stores. Minio is very simple and great, but is one-file-per-object on disk (great for maybe 90% of use-cases, but not billions of thumbnails). Ceph et al are quite complex. There are a bunch of almost-sort-kinda solutions like base64-encoded bytes in HBase/postgresql/etc, or chunking (like MongoDB), but really you just want to concatenate the bytes like a .tar file, and index in with range requests.

The Wayback Machine's WARC files plus CDX (index files with offset/range) is pretty close.

5 comments

seized 867 days ago

GarageS3 is a nice middle ground, it is not file on disk per object but it's simpler than SeaweedFS as well.

https://garagehq.deuxfleurs.fr/

mdaniel 866 days ago

One will want to be cognizant that Garage, like recent MinIO releases, is AGPL https://git.deuxfleurs.fr/Deuxfleurs/garage/src/tag/v0.9.1/L...

I'm not trying to start trouble, only raising awareness because in some environments such a thing matters

anonzzzies 866 days ago

Yes, garage sourcecode is very easy to read and understand. Didn’t read seaweed yet.

ddorian43 866 days ago

Garage has no intention to support erasure coding though.

no_wizard 867 days ago

Written in Go no less, a GC language!

I was expecting C/C++ or Rust, pleasantly surprised to see Go.

maayank 866 days ago

Why pleasantly surprised compared to Rust? What’s the significance of GCing?

anonzzzies 866 days ago

A lot of people regard GCs as something one should not use for low level components like file systems and databases. So that this performs so well might be the surprise for GP.

Varriount 866 days ago

Which is annoying, as there are various GC systems that are near, or even equal to, performance of comparable non-GC systems. (I personally blame Java for most of this)

maeln 866 days ago

Yes and no. While for most application, the GC is hardly an issue and is fast enough, the problem is for application where you need to be able to control exactly when and how memory/objects will be freed. These will never do well with any form of GC. But a looot of software can perform perfectly fine with a GC. If anything, it is mostly Go error handling that is the bigger issue...

vendiddy 866 days ago

Why is Go error handling the bigger issue?

riku_iki 867 days ago

> almost-sort-kinda solutions like base64-encoded bytes in HBase/postgresql/etc

why you would base64 encode them, they all store binary formats?

pilgrim0 867 days ago

I was quite surprised to discover that minio is one file per object. Having read some papers about object stores, this is definitely not what I expected.

blr_lpm 866 days ago

What are the pros/cons of storing one file per object? As a noob in this domain, this made sense to me.

It will be great if you can share name or reference of some papers around this. Thank you in advance.

dspillett 866 days ago

For many small objects a generic filesystem can be less efficient than a more specialised store. Things are being managed that aren't needed for your blob store, block alignment can waste a lot of space, there are often inefficiencies in directories with many files leading to a hierarchical splitting that adds more inefficiency through indirection, etc. The space waste is mitigated somewhat by some filesystems by supporting partial blocks, or including small files directly in the directory entry or other structure (the MFT in NTFS) but this adds an extra complexity.

The significance of these inefficiencies will vary depending on your base filesystem. The advantage of using your own storage format rather than naively using a filesystem is you can design around these issues taking different choices around the trade-offs than a general filesystem might, to produce something that is both more space efficient and more efficient to query and update for typical blob access patterns.

The middle ground is using a database rather than a filesystem is usually a compromise: still less efficient than a specially designed storage structure, but perhaps more so than a filesystem. They tend to have issues (it just inefficiencies) with large objects though, so your blob storage mechanism needs to work around those or just put up with them. A file-per-object store may have a database also anyway, for indexing purposes.

A huge advantage of one file per object is simplicity of implementation. Also for some end users the result (a bunch of files rather than one large object) might better fit into their existing backup strategies¹. For many data and load patterns, the disadvantages listed above may hardly matter so the file-per-object approach can be an appropriate choice.

--

[1] Assuming they are not relying on the distributed nature of the blob store² which is naive³ age doesn't protect you against some thinks a backup does unless the blob store implements features to help out there (minimum distributed duplication guarantee any given peice of data, keeping past versions etc).

[2] Also note that not all blob stores are distributed, and many are but support single node operation.

[3] Perhaps we need a new variant if the "RAID is not a backup" mantra. "Distributed storage properties are not, by themselves, a backup" or some such.

pilgrim0 866 days ago

The other commenter already outlined the main trade-offs, which boils down to increased latency and storage overhead for the file-per-object model. As for papers, I like the design of Haystack.

https://www.usenix.org/legacy/event/osdi10/tech/full_papers/...

ddorian43 866 days ago

When using HDDs, you want to chunk files at about 1MB-10MB. This helps with read/write scaling/throughput etc.

XorNot 866 days ago

I imagine very large objects you'd like to be able to shard across multiple servers.

vdm 865 days ago

This has not been true since 2021. https://blog.min.io/minio-optimizes-small-objects/

kyledrake 866 days ago

When you had corruption and failures, what was the general procedure to deal with that? I love SeaweedFS and want to try it (Neocities is a nearly perfect use case), but part of my concern is not having a manual/documentation for the edge cases so I can figure things out on the fringes. I didn't see any documentation around that when I last looked but maybe I missed something.

(If any SeaweedFS devs are seeing this, having a section of the wiki that describes failure situations and how to manage them would be a huge add-on.)

tempest_ 866 days ago

The dev is suprisingly helpful but yeah I agree the wiki is in need of some beefing up w.r.t operations.