| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by codahale 4327 days ago

You separate your load generation tools from your system under test to eliminate confounding factors of shared resources like CPU, thread scheduler, memory allocator, disk, etc. This benchmark doesn’t do that, which means it’s impossible to distinguish between a queue system which is being saturated with work and a queue system which is out-competing the load generators for CPU. Also, it’s a laptop running OS X. Do you plan on fielding a queue system in a DC built out with Macbooks? No? Then this benchmark might as well be on a phone for all the inferential value it provides to people running Linux servers.

A single producer and single consumer means zero contention for either for most implementations. How well does this scale to your actual workload? What’s the overhead of a producer or a consumer? What’s the saturation point or the sustainable throughput according to the Universal Scalability Law? It’s impossible to tell, since this is a data point of one, which means it can’t distinguish between a system which has a mutex around accepting producer connections and a purely lock-free system. And that’s a shame because wow would those two systems have very different behaviors in production environments w/ real workloads.

Finally, measuring the mean of the latency distribution is wrong. Latency is never normally distributed, and if they recorded the standard deviation they’d notice it’s several times larger than the mean. What matters with latency are quantiles, ideally corrected for coordinated omission (http://www.infoq.com/presentations/latency-pitfalls).

This is not a benchmark, this is a diary entry.

5 comments

hox 4326 days ago

Not only that, but there is no attempt made to gather metrics based on similar configurations / feature sets. The author specifically mentions that the AMQP systems persist their messages to disk by default, and this was the configuration used for the testing. How, then, are the "benchmarks" even comparable to the ephemeral message queues that don't provide any sort of persistence? Why wasn't persistence turned off to provide more comparable tests?

And why were nanomgs / 0mq even included?

I despise articles like this. They only serve to clutter up useful communication on such technologies.

link

tylertreat 4326 days ago

Persistence was disabled. Nano and ZMQ are in their own group and aren't even attempted to be compared to other groups.

link

adanto6840 4327 days ago

Do you know of any similar posts/links/sources that have more accurate & realistic benchmarks of the same software / messaging queues? That'd be immensely helpful...

We're attempting to optimize this aspect of our stack currently, and I'm sure many others face very similar challenges right now. It's proven to be quite difficult & time-consuming to accurately measure this stuff -- any insight into more accurate/reasonably realistic benchmarks of this type of MQ software would be awesome. :-)

link

rdtsc 4326 days ago

I think to find out you'd need to at least measure on the same OS and hardware. A lot of things happen between the physical hardware and the kernel socket layer and those might be different between operating systems.

Some of the stuff is difficult and time-consuming because "messaging" is generic enough to be configured and used differently by different users.

Obviously you can cut away some of the choicesright of the bat if you are worried about support for some OS (Like you have to ship on HP UX), or you need to have durability and acknowledgement and high availability, or you want a project with a certain level of maturity and stability and so on. That cuts the number of systems to test.

Then of course there are things like, well how do they handle concurrency. Just because a single producer and single consumer can do 500K messages per second (which maybe a small benchmark on a co-workers laptop will show), doesn't mean that the whole thing won't blow up and crash in a burning mess if there are 1000 consumers and producers.

link

mrottenkolber 4327 days ago

Kudos for summing that up nicely. Collecting an interpreting data for studies comes with great responsibility.

link

tylertreat 4327 days ago

Yes, you're correct. "Benchmark" is a very unfortunate misnomer.

link

alexk 4327 days ago

Thanks for the link, do you have more of those?

link