Hacker News new | ask | show | jobs
by pjc50 113 days ago
Several things going on here:

- concurrency is very hard

- .. but object storage "solves" most of that for you, handing you a set of semantics which work reliably

- single file throughput sucks hilariously badly

- .. because 1Gb is ridiculously large for an atomic unit

- (this whole thing resembles a project I did a decade ago for transactional consistency on TFAT on Flash, except that somehow managed faster commit times despite running on a 400Mhz MIPS CPU. Edit: maybe I should try to remember how that worked and write it up for HN)

- therefore, all of the actual work is shifted to the broker. The broker is just periodically committing its state in case it crashes

- it's not clear whether the broker ACKs requests before they're in durable storage? Is it possible to lose requests in flight anyway?

- there's a great design for a message queue system between multiple nodes that aims for at least once delivery, and has existed for decades, while maintaining high throughput: SMTP. Actually, there's a whole bunch of message queue systems?

2 comments

> The broker runs a single group commit loop on behalf of all clients, so no one contends for the object. Critically, it doesn't acknowledge a write until the group commit has landed in object storage. No client moves on until its data is durably committed.
Yea, the group commit is the real insight here.

I read this blog post and to help wrap my head around it I put together a simple TCP-based KV store with group commit, helped make it click for me.

https://github.com/a10y/group-commit/

AFAIK you can kinda "seek" reads in S3 using a range header, WCGW? =D
You can, and it's actually great if you store little "headers" etc to tell you those offsets. Their design doesn't seem super amenable to it because it appears to be one file, but this is why a system that actually intends to scale would break things up. You then cache these headers and, on cache hit, you know "the thing I want is in that chunk of the file, grab it". Throw in bloom filters and now you have a query engine.

Works great for Parquet.

Yep! Other than random reads (~p99=200ms on larger ranges), it's essential to get good download performance of a single file. A single (range) request can "only" drive ~500 MB/s, so you need multiple offsets.

https://github.com/sirupsen/napkin-math

Amazon S3 Select enables SQL queries directly on CSV, JSON, or Apache Parquet objects, allowing retrieval of filtered data subsets to reduce latency and costs
S3 Select is, very sadly, deprecated. It also supported HTTP RANGE headers! But they've killed it and I'll never forgive them :)

Still, it's nbd. You can cache a billion Parquet header/footers on disk/ memory and get 90% of the performance (or better tbh).

Caching Parquet headers/footers sounds super interesting. Can you say more about how you implemented it?
Currently there's nothing in my headers, but the footer is straightforward. There's the schema, row group metadata, some statistics, byte offsets for each column in a group, page index, etc. It's everything you'd want if you wanted to reject a query outright or, if necessary, query extremely efficiently.

min/max stats for a column are huge because I pre-encode any low-cardinality strings into integers. This means I can skip entire row groups without every touching S3, just with that footer information, and if I don't have it cached I can read it and skip decoding anything that doesn't have my data.

Footers can get quite large in one sense - 10s-100s of KB for a very large file. But that's obviously tiny compared to a multi-GB Parquet file, and the data can compress extremely well for a second/ third tier cache. You can store 1000s of these pre-parsed in memory no problem, and store 10s of thousands more on disk.

I've spent 0 time optimizing my footers currently. They can get smaller than they are, I assume, but I've not put much thought. In fact, I don't have to assume, I know that my own custom metadata overlaps with the existing parquet stats and I just haven't bothered to deal with it. TBH there are a bunch of layout optimizations I've yet to explore, like using headers would obviously have some benefits (streaming) whereas right now I do a sort of "attempt to grab the footer from the end in chunks until we find it lol". But it doesn't come up because... caching. And there are worse things than a few spurious RANGE requests.

Have you tried AWS s3 tables which is a manged iceberg service?
Wow I didn't know that. To be fair now that S3 tables exists it is rather redundant.