| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by devongall 5441 days ago

Definitely agree on the expense of a list-keys operation, be sure to avoid at all costs.

Some of the Riak documentation was incomplete/incorrect which made implementation a little sticky, but the mailing list is extremely responsive and helpful.

Otherwise, have had a great experience with Riak thus far. Looking forward to the ease of scaling as well!

2 comments

willbmoss 5440 days ago

It's worth noting that not only is the list keys operation expensive, but since it uses Bloom filters, it's not guaranteed to returns all keys.

My sources at Basho tell me that this is fixed in 1.0, but until that's officially released, basically don't try to list keys.

link

seancribbs 5440 days ago

Yes, the problem was not necessarily using a Bloom filter, but that it was too small. However, 1.0 is smarter about which vnodes it sends list-keys requests to and thus obviates the need for the Bloom filter (at least for that operation).

link

spahl 5440 days ago

Another operation to avoid is map/reduce over a whole bucket (so called "bucket scans"). It is extremely heavy and can often be replaced with a more clever schema.

link

lobster_johnson 5440 days ago

Yes, one thing that is not apparent from the documentation is that a "bucket" is merely a namespacing thing: Internally, all data is stored in one big bucket, with bucket names prefixing keys.

This means that if you have buckets A (1 million items) and B (5 items), sequentially scanning bucket B will take just as long as scanning bucket A -- because Riak has to scan through the entire store. In other words, it's not enough to say that one should avoid scanning a bucket because it's slow when you have lots of stuff in a bucket; it's always too slow to be practically usable in any situation where you have more than a few hundred keys.

I think calling buckets is a big mistake because they create the expectation that they really are separate things. "Namespace" or "keyspace" would have been more appropriate. (Can buckets have different replication semantics? If not, that's even worse.)

Cassandra is loosely based on the same tech as Riak, and supports sequential range queries very well, I hear.

link