Hacker News new | ask | show | jobs
by dormando 2950 days ago
How is elasticache so slow? what instances does it run on?

edit: r4.4xlarge as per the link. 16vcpu? You should be able to beat on latency but beating on throughput means elasticache is misconfigured, likely. Or you're putting on way too much set traffic (think I saw you set the bench to 1:1 ratio of gets to sets?)

1 comments

I wouldn't characterize Elasticache as running slow, a single instance in this case is handling 1.3M request/sec. But we can be 9X faster by batching multiple requests per packet and then offloading the TCP network stack and memcached operations to the FPGA. The FPGA allows us to handle the requests at network line-rate, even with small 100-byte requests. On Elasticache, past a certain point these small requests start to overload the CPU.

The interesting part is the FPGA could still do much more computation (for example, compression or encryption) while maintaining the same throughput due to hardware pipelining. We described this concept further in the blog post I linked to.

I characterize it as slow because I know it can saturate the packet rate AWS gives it with software memcached. If the packet rate were much higher then you might win out.

The only reason why you can claim 9x latency is because you've saturated the worker threads. You should still win on latency even if it were properly bottlenecking on the network, but 9x throughput and 9x latency is completely false as a capacity limit in this test.

The other issue is 100 bytes isn't typical. It's common but almost every user has a varied workload. Deploying FPGA's for the larger cache values ends up being a waste. I designed a new storage system based off of offloading larger cold keys to flash, even.

Do you have a source showing elasticache running faster than this? For example, Redis labs was only able to achieve 10M req/sec by using 6 m4.16xlarge instances which are double the price of the CPU instance we used: https://dzone.com/articles/10m-opssec-1msec-latency-with-onl...

100-500 byte values are the majority of requests at companies like Facebook and Lyft for their key value clusters. For large value sizes the network interface becomes the bottleneck so FPGAs won’t be able to help.

I wrote the current version of OSS memcached. I don't know how elasticache is configured, but as I said memcached itself can definitely saturate the network from that instance. Either the version they run is too old or it's misconfigured. If I were to compare a "custom FPGA caching service" vs something memcached like, I would take the same 4xlarge instance and just run memcached on it.

On a large enough machine I've gotten it up past 55 million read ops/sec. It's quite good at read throughput.

I'm also familiar with the cache clusters at major companies.

Our assumption was that elasticache would be highly optimized by Amazon. Remember that these are virtual machines which means limitations such as packet per second throttling. What specific configuration options do you think are missing?

In the latest version of memcached have you added support for batching/pipelining multiple requests per packet? Because this was crucial for achieving high requests/sec in this example.

Were the 55M requests/sec coming from another machine? Even with small 100B values you would need a minimum of a 44 Gbps network link. How many cores were required? In our benchmark we wanted a fair comparison between instances of similar price and RAM size.

They stopped updating it a few years ago; it's probably also not as well tuned as you think. I'd need to see the output of "stats settings" from a running instance to know for sure. I also have no idea if it's a hacked fork or not.

Odds are pretty good it's left at the default of 4 worker threads... so on a 16 vcpu instance that's not going to reach great heights. Since it's a 1.4.x version (years old), it's missing some newer features that both help in average latency and memory efficiency. Or rather, a lot of them are there but disabled by default.

Memcached has allowed pipelining since it was created. For the ASCII protocol, packing multiple responses into single packets is done via a straight multiget. You can send multiple requests in a single packet for any protocol and any command.

My stress utility (https://github.com/memcached/mc-crusher) has options for pipelining requests, and using multigets ascii packed get responses. I test to the limit of lock scaling for each individual subsystem.

The 55M test required running mc-crusher via localhost, there's no network that can go that fast. My point is you're limited by the network throughput, not the CPU. In that particular 55M test, all cores were used, but ~7-8 of them were used by mc-crusher... so the real limit for the machine is even higher. It did have a lot of cores. 48ish?

You can still do apples/apples with instance sizes... but given everything I know about this thing, unless those cores are extremely slow, hitting 11m ops/sec shouldn't be an issue. Or at least, with minimal fiddling it should hit 6-8m, which doesn't give you a crazy 9x figure.

You do need to stop doing 1:1 get/set ratio though. Sets don't scale very well since I've generally never had complaints about the speed. I'd say a highly conservative test would be 5:1 get/set. Production workloads are typically even higher than that. (that said I do intend to speed them up more, it's just lowish priority.. the LRU locks are highly granular, so spreading sets across different slab classes can help mutation perf a lot).