| (employee/founder here) Anything under 64K is perfectly reasonable to store in a CockroachDB column. Between 64K and maybe 1M is trending towards trouble. Values greater than this are going to go through CockroachDB like a goat through a python. Why is this the case? For starters, at the level of RocksDB, values greater than 64K are not jammed into SSTables (to avoid constantly rewriting them during compactions of the LSM tree). Instead, individual files are created. Also, CockroachDB has quite a lot of write amplification, which is generally OK for structured relational data, but becomes progressively more terrible for large blobs. Write amplification comes from the Raft log, as well as RocksDB's write-ahead log. What we really need is an integrated storage system for immutable blobs, something we're taking very seriously. Roughly half of the original team which built Colossus at Google are working on CockroachDB, so there's some knowledge of how to go about building such a system. While we're not sure where it would fall on our roadmap, the idea is that large blob values would be efficiently replicated and maintained through a separate subsystem. The blob column itself would just contain a pointer to the blob. The value in tight integration (a single CockroachDB cluster providing both OLTP SQL database as well as a distributed blob store) would be one deployment & admin console, and transactionally consistent blob column values (e.g. no fighting s3 eventual consistency). |
Disregarding write amplification issues for a sec, would it make things better to split binaries in 64k chunks and have them in a chunk table keyed by name and offset?