| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jemfinch 4837 days ago
	The article's credibility is significantly reduced by referring to Judy Trie lookups as O(log n) operations. They're actually O(log w) operations: the number of operations is bounded by the array's word size, not the size of the Judy array. A million elements in a Judy32 array will cause at most 7 node hops, not the 20 you'd expect from a binary tree.

3 comments

mcintyre1994 4837 days ago

This is the genius of tries. We had to code up the data structures for Twitter for our coursework, for some reason as if it was all in memory. I found a suffix tree (not as clever, it branches on character, so normalised there were 36, and is designed for full text searching) was a really clever way to store that data because search time for a phrase doesn't increase as your data set does. Unrealistic for something like Twitter, but a really useful trick that this article definitely misses.

link

Peaker 4837 days ago

Twitter's data is all in memory. That memory may not be random-access memory, though.

A pet peeve of mine is how people can do all the CS necessary to get nice data structures for their in-RAM data, but seem to forget everything they know and use very bad structures when they spill to flash or disk storage.

link

mcintyre1994 4837 days ago

Oops, of course. I meant we didn't have to set up/have access to persistent storage like a database.

That sounds basically exactly like our course so far. I'm only in the first year though, so I guess it's a bit early to comment on course coverage. But yep, all RAM data structures so far.

link

snprbob86 4837 days ago

Can you recommend or link to introductory material on data structures optimized for flash and disk?

link

dantheman 4837 days ago

You could start by looking at databases.

One useful data structure is the B+ tree http://en.wikipedia.org/wiki/B%2B_tree

link

nostrademons 4837 days ago

Also log-structured merge trees:

http://en.wikipedia.org/wiki/Log-structured_merge-tree

There're a bunch of fascinating performance tradeoffs between B+ trees and LSM trees, which come out when you're choosing between something like MySQL InnoDB or PostGres (B+ tree) vs. LevelDB or Apache Cassandra (LSM trees). LSM trees tend to be faster for inserts and for write-heavy applications, and they offer very fast reads and writes for frequently-accessed data. They also depend upon the bandwidth of the disk rather than the seek time, and bandwidth has of late been increasing significantly faster than seek times. OTOH, they offer very variable latency, as an insert might trigger a major compaction, while B+ trees have a bounded latency limit. A lot of the distributed systems work at Google (a big user of LSM trees via BigTable) comes from the need to work around the poor 99th percentile latencies of LSM trees.

link

eru 4837 days ago

You might want to look for Cache Oblivious Data Structures (and algorithms).

link

aidenn0 4837 days ago

Actually its O(w) and so are optimal hash-tables (since the hashing function must be O(w).

Interestingly enough you can consider hash tables and Tries to be O(log n) since in order to have N distinct keys, the maximum key length must be at least log(N).

link

AaronFriel 4837 days ago

7 node hops? It should be at most 4. Judy32 has a maximum branching factor of 256, and log base 256 of 2^32 is 4.

link

jamesaguilar 4836 days ago

Iirc tries are not dense

link