| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hnkimb3558 3704 days ago

(employee/founder here)

Anything under 64K is perfectly reasonable to store in a CockroachDB column. Between 64K and maybe 1M is trending towards trouble. Values greater than this are going to go through CockroachDB like a goat through a python.

Why is this the case? For starters, at the level of RocksDB, values greater than 64K are not jammed into SSTables (to avoid constantly rewriting them during compactions of the LSM tree). Instead, individual files are created. Also, CockroachDB has quite a lot of write amplification, which is generally OK for structured relational data, but becomes progressively more terrible for large blobs. Write amplification comes from the Raft log, as well as RocksDB's write-ahead log.

What we really need is an integrated storage system for immutable blobs, something we're taking very seriously. Roughly half of the original team which built Colossus at Google are working on CockroachDB, so there's some knowledge of how to go about building such a system.

While we're not sure where it would fall on our roadmap, the idea is that large blob values would be efficiently replicated and maintained through a separate subsystem. The blob column itself would just contain a pointer to the blob. The value in tight integration (a single CockroachDB cluster providing both OLTP SQL database as well as a distributed blob store) would be one deployment & admin console, and transactionally consistent blob column values (e.g. no fighting s3 eventual consistency).

1 comments

LoSboccacc 3704 days ago

Thanks! That's fantastic to hear even if it's going to materialize later/eventually/never at least it's great to know the need is recognized.

Disregarding write amplification issues for a sec, would it make things better to split binaries in 64k chunks and have them in a chunk table keyed by name and offset?

link

hnkimb3558 3704 days ago

I'm not really sure which strategy would benchmark best between 64K chunks, 1M chunks, or even 8M chunks. I think this requires some experimentation. Pushing them all through as 64K chunks has a lot of overhead, and you'd reap the full write amplification. Could you tell me a bit more about your use case? You can email me spencer at cockroachlabs.com.

I've been meaning to work on the "CockroachDB Egg Store" (my disgusting name for a blob storage subsystem) as a Free Fridays side project for a while, but have been distracted with all manner of other enticing options. There aren't enough hours in a week...

link