Hacker News new | ask | show | jobs
by jemfinch 4790 days ago
The article's credibility is significantly reduced by referring to Judy Trie lookups as O(log n) operations. They're actually O(log w) operations: the number of operations is bounded by the array's word size, not the size of the Judy array. A million elements in a Judy32 array will cause at most 7 node hops, not the 20 you'd expect from a binary tree.
3 comments

This is the genius of tries. We had to code up the data structures for Twitter for our coursework, for some reason as if it was all in memory. I found a suffix tree (not as clever, it branches on character, so normalised there were 36, and is designed for full text searching) was a really clever way to store that data because search time for a phrase doesn't increase as your data set does. Unrealistic for something like Twitter, but a really useful trick that this article definitely misses.
Twitter's data is all in memory. That memory may not be random-access memory, though.

A pet peeve of mine is how people can do all the CS necessary to get nice data structures for their in-RAM data, but seem to forget everything they know and use very bad structures when they spill to flash or disk storage.

Oops, of course. I meant we didn't have to set up/have access to persistent storage like a database.

That sounds basically exactly like our course so far. I'm only in the first year though, so I guess it's a bit early to comment on course coverage. But yep, all RAM data structures so far.

Can you recommend or link to introductory material on data structures optimized for flash and disk?
You could start by looking at databases.

One useful data structure is the B+ tree http://en.wikipedia.org/wiki/B%2B_tree

Also log-structured merge trees:

http://en.wikipedia.org/wiki/Log-structured_merge-tree

There're a bunch of fascinating performance tradeoffs between B+ trees and LSM trees, which come out when you're choosing between something like MySQL InnoDB or PostGres (B+ tree) vs. LevelDB or Apache Cassandra (LSM trees). LSM trees tend to be faster for inserts and for write-heavy applications, and they offer very fast reads and writes for frequently-accessed data. They also depend upon the bandwidth of the disk rather than the seek time, and bandwidth has of late been increasing significantly faster than seek times. OTOH, they offer very variable latency, as an insert might trigger a major compaction, while B+ trees have a bounded latency limit. A lot of the distributed systems work at Google (a big user of LSM trees via BigTable) comes from the need to work around the poor 99th percentile latencies of LSM trees.

You might want to look for Cache Oblivious Data Structures (and algorithms).
Actually its O(w) and so are optimal hash-tables (since the hashing function must be O(w).

Interestingly enough you can consider hash tables and Tries to be O(log n) since in order to have N distinct keys, the maximum key length must be at least log(N).

7 node hops? It should be at most 4. Judy32 has a maximum branching factor of 256, and log base 256 of 2^32 is 4.
Iirc tries are not dense