| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lobster_johnson 5440 days ago

Yes, one thing that is not apparent from the documentation is that a "bucket" is merely a namespacing thing: Internally, all data is stored in one big bucket, with bucket names prefixing keys.

This means that if you have buckets A (1 million items) and B (5 items), sequentially scanning bucket B will take just as long as scanning bucket A -- because Riak has to scan through the entire store. In other words, it's not enough to say that one should avoid scanning a bucket because it's slow when you have lots of stuff in a bucket; it's always too slow to be practically usable in any situation where you have more than a few hundred keys.

I think calling buckets is a big mistake because they create the expectation that they really are separate things. "Namespace" or "keyspace" would have been more appropriate. (Can buckets have different replication semantics? If not, that's even worse.)

Cassandra is loosely based on the same tech as Riak, and supports sequential range queries very well, I hear.