Hacker News new | ask | show | jobs
by sujayakar 807 days ago
this is a really cool idea.

one followup I was thinking of is whether this can generalize to queries other than key value point lookups. if I'm understanding correctly, the article is suggesting to take a key value store, and for every `(key, value)` in the system, split `value` into fragments that are stored on different shards with some `k` of `M` code. then at query time, we can split a query for `key` into `k` subqueries that we send to the relevant shards and reassemble the query results into `value`.

so, if we were to do the same business for an ordered map with range queries, we'd need to find a way to turn a query for `interval: [start, end]` into some number of subqueries that we could send to the different shards and reassemble into the final result. any ideas?

2 comments

There https://ydb.tech/ open source db that uses erasure coding for replication in single zone/region.
In YDB with block 4+2 erasure coding, you need half the disk space compared to mirror-3-dc schema. Meanwhile CPU usage is just a little bit higher, thus in high throughput tests mirror-3-dc wins. Indeed as mentioned in the post there might be a tail latency win in latency runs, but if your task is throughput with a reasonable latencies, replication might be a better choice.
If you only care about throughput, just fetch the data and read should be the same speed as triple replicated.

For writing, triple-rep has to write 2x as much data or more, so it's going to be slower unless your CPUs are horribly slow compared to your drives.

I expect it to save a lot of CPU by only needing 1/3x of compactions. You might want to do a benchmark on that ;). An example is quickwit (building inverted indexes is very expensive).
> so, if we were to do the same business for an ordered map with range queries, we'd need to find a way to turn a query for `interval: [start, end]` into some number of subqueries that we could send to the different shards and reassemble into the final result. any ideas?

Dbs that are backed by s3-like-storage, the storage does this for you, but for blocks of, say, 1MB, and not per-kv (high overhead).

Think you use rocksdb in your db, and erasure-code the sstables.