Hacker News new | ask | show | jobs
by Peaker 4790 days ago
Twitter's data is all in memory. That memory may not be random-access memory, though.

A pet peeve of mine is how people can do all the CS necessary to get nice data structures for their in-RAM data, but seem to forget everything they know and use very bad structures when they spill to flash or disk storage.

2 comments

Oops, of course. I meant we didn't have to set up/have access to persistent storage like a database.

That sounds basically exactly like our course so far. I'm only in the first year though, so I guess it's a bit early to comment on course coverage. But yep, all RAM data structures so far.

Can you recommend or link to introductory material on data structures optimized for flash and disk?
You could start by looking at databases.

One useful data structure is the B+ tree http://en.wikipedia.org/wiki/B%2B_tree

Also log-structured merge trees:

http://en.wikipedia.org/wiki/Log-structured_merge-tree

There're a bunch of fascinating performance tradeoffs between B+ trees and LSM trees, which come out when you're choosing between something like MySQL InnoDB or PostGres (B+ tree) vs. LevelDB or Apache Cassandra (LSM trees). LSM trees tend to be faster for inserts and for write-heavy applications, and they offer very fast reads and writes for frequently-accessed data. They also depend upon the bandwidth of the disk rather than the seek time, and bandwidth has of late been increasing significantly faster than seek times. OTOH, they offer very variable latency, as an insert might trigger a major compaction, while B+ trees have a bounded latency limit. A lot of the distributed systems work at Google (a big user of LSM trees via BigTable) comes from the need to work around the poor 99th percentile latencies of LSM trees.

You might want to look for Cache Oblivious Data Structures (and algorithms).