| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by MichaelGG 3009 days ago

Sometimes it's a huge advantage. I wrote a network search engine. On a single 1TB spinning disk, I could handle 5TB of traffic, stored and indexed, per day. That's around 2 billion packets indexed. The key was having an log/merge system with only a couple bits of overhead per entry, and compressed storage of chunks of packets for the actual data. (This was before LevelDB and Elasticsearch.)

In practice the index overhead per packet was only 2-3 bits. This was accomplished by lossy indexes, using hashes of just the right size to minimise false hits. The trade-off being that an occasional extra lookup is worth the vastly reduced size of compressed indexes.

To this day, I'm not sure of general purpose, lossy, write-once hashtables that get close to such little overhead.

Competitors would use MySQL and insert per packet. The row overhead was more than my entire index. But it worked out: just toss 50k of hardware at it.

But... It does take over a lot of engineering time writing such bespoke software. Just compressing the hashes (a common info retrieval problem) is a huge area, now with SIMD optimised algorithms and everything.

2 comments

tambourine_man 3009 days ago

That’s fascinating. I guess you probably can’t open source it, but I’d love to read a blog post about it.

link

MichaelGG 3005 days ago

Here's a version https://github.com/michaelgg/cidb -- Just some of the raw integer k-v storage part. It assumes you already have the hashed entries (you truncate them and the compression takes it from there). It is really what you should expect more from a college course IR project but since I never went to school... oh well.

I used this same library to encode telephone porting (LNP) instructions. That is a database of about 600M entries, mapping one phone number to another. With a bit of manipulation when creating the file, you go from 12GB+ naive encoding as strings (one client was using nearly 50GB after expanding it to a hashtable) to under a GB. Still better than any RMDBS can do and small enough to easily toss this in-RAM on every routing box.

Some day I'd like to write it in Rust and implement vectorized encoding and more compression schemes. Like an optimized SSTable just for integers.

link

CyberDildonics 3009 days ago

I'm going to go out on a limb and guess that it would have been cheaper to upgrade the hardware.

link

MichaelGG 3009 days ago

Depends on scale. At higher end, it was near impossible to scale when you're e.g. inserting a MySQL row per packet. But maybe good enough for a viable business. I would probably try to take it as far as possible on Elastic if I were to write it today.

Same thing if you read the Dremel paper. Worrying about bits helps when scaling.

link

fizx 3009 days ago

Because Lucene wasn't good at near-realtime in 2009 or so, Twitter's original (acquired via summize) search was written in mysql. It might have even been a row for every token, not quite sure.

IIRC, when we moved to a highly-customized-Lucene-based system in 2011, we dropped the server count on the cluster from around 400 nodes to around 15.

link

teraflop 3009 days ago

You can only upgrade hardware so much. If by doing a lot of low-level optimizations, you can remove (or delay) the need to build a complex distributed system, then the optimizations end up paying big dividends beyond just the cost of the machines.

link

mmt 3009 days ago

I think it's also important to know when this occurred. I've found there's a general tendency among software engineers to (surprise!) believe that it's easier/cheaper to solve the problem of scale in software rather than hardware, and it's often fueled by the misconception that the alternative to doing so is a complex, distributed system.

This is a false dichotomy.

Maybe during the days of the dot-com boom, it was was true enough because scaling a single server "vertically" became cost prohibitive very quickly, especially since truly large machines came only from brand-name vendors. That was, however, a very long time ago.

A naive interpretation of Moore's law implies CPU performance today is in the high hundreds of times as fast as back then. Even I/O throughput has improved something like a multiple of mid-10s, IIRC. More importantly, cost has come down, too.

The purchase price premium for getting the highest-performance CPUs (and mobo they need) in a server over the lowest cost per performance option is about 3x. Considering that this is, necessarily, a single [1] server, the base for that premium isn't exactly tremendous. The total cost would seem to be on the same order of magnitude as a team of programmers.

Of course, in the instant example, the database was particularly specialized, including what strikes me as a unique feature, a lossy index. I'd expect data integrity to be one of the huge challenges of databases, which, if relaxed, makes writing a custom one a more reasonable proposition.

[1] Or a modest number, on the order of a dozen, for something like read slaves, rather than the multiple dozens if not hundreds the distributed system.

link

Retric 3008 days ago

It's not an either or situation. Often there is a ~10-1,000x performance gains to be had in software from the initial production version to the optimized version. Similarly you can often get ~10-1,000x speed bump from better hardware.

But, the gains become more expensive as you move up the scale. So, at least starting down the software path is often very cheap with many large gains to be had. Similarly, it's at least looking at the software before you scale to the next level of hardware tends to be a great investment.

It's not about always looking at software, it's a question of regularly going back to software as it's better to regularly do so rather than as a one time push.

link

mmt 3008 days ago

> It's not an either or situation.

I'm a bit confused.. are you agreeing or disagreeing? My point was to call out a false dichotomy and offer a third option.

> It's not about always looking at software

Yet that's exactly what happens. Software engineers completely dominate the field, including management, so they always look at software and only software.

link

Shikadi 3008 days ago

Moore's law is about transistor count, not performance

link

mmt 3008 days ago

I'm well aware, which is why I specified a naive interpretation. Still, are you actually saying that transistor count increase in the range of a multiple of 1024 hasn't been matched by comparable CPU performance improvements during that time?

link