|
|
|
|
|
by MichaelGG
2962 days ago
|
|
Sometimes it's a huge advantage. I wrote a network search engine. On a single 1TB spinning disk, I could handle 5TB of traffic, stored and indexed, per day. That's around 2 billion packets indexed. The key was having an log/merge system with only a couple bits of overhead per entry, and compressed storage of chunks of packets for the actual data. (This was before LevelDB and Elasticsearch.) In practice the index overhead per packet was only 2-3 bits. This was accomplished by lossy indexes, using hashes of just the right size to minimise false hits. The trade-off being that an occasional extra lookup is worth the vastly reduced size of compressed indexes. To this day, I'm not sure of general purpose, lossy, write-once hashtables that get close to such little overhead. Competitors would use MySQL and insert per packet. The row overhead was more than my entire index. But it worked out: just toss 50k of hardware at it. But... It does take over a lot of engineering time writing such bespoke software. Just compressing the hashes (a common info retrieval problem) is a huge area, now with SIMD optimised algorithms and everything. |
|