Hacker News new | ask | show | jobs
by andrewcanis 2944 days ago
Do you have a source showing elasticache running faster than this? For example, Redis labs was only able to achieve 10M req/sec by using 6 m4.16xlarge instances which are double the price of the CPU instance we used: https://dzone.com/articles/10m-opssec-1msec-latency-with-onl...

100-500 byte values are the majority of requests at companies like Facebook and Lyft for their key value clusters. For large value sizes the network interface becomes the bottleneck so FPGAs won’t be able to help.

1 comments

I wrote the current version of OSS memcached. I don't know how elasticache is configured, but as I said memcached itself can definitely saturate the network from that instance. Either the version they run is too old or it's misconfigured. If I were to compare a "custom FPGA caching service" vs something memcached like, I would take the same 4xlarge instance and just run memcached on it.

On a large enough machine I've gotten it up past 55 million read ops/sec. It's quite good at read throughput.

I'm also familiar with the cache clusters at major companies.

Our assumption was that elasticache would be highly optimized by Amazon. Remember that these are virtual machines which means limitations such as packet per second throttling. What specific configuration options do you think are missing?

In the latest version of memcached have you added support for batching/pipelining multiple requests per packet? Because this was crucial for achieving high requests/sec in this example.

Were the 55M requests/sec coming from another machine? Even with small 100B values you would need a minimum of a 44 Gbps network link. How many cores were required? In our benchmark we wanted a fair comparison between instances of similar price and RAM size.

They stopped updating it a few years ago; it's probably also not as well tuned as you think. I'd need to see the output of "stats settings" from a running instance to know for sure. I also have no idea if it's a hacked fork or not.

Odds are pretty good it's left at the default of 4 worker threads... so on a 16 vcpu instance that's not going to reach great heights. Since it's a 1.4.x version (years old), it's missing some newer features that both help in average latency and memory efficiency. Or rather, a lot of them are there but disabled by default.

Memcached has allowed pipelining since it was created. For the ASCII protocol, packing multiple responses into single packets is done via a straight multiget. You can send multiple requests in a single packet for any protocol and any command.

My stress utility (https://github.com/memcached/mc-crusher) has options for pipelining requests, and using multigets ascii packed get responses. I test to the limit of lock scaling for each individual subsystem.

The 55M test required running mc-crusher via localhost, there's no network that can go that fast. My point is you're limited by the network throughput, not the CPU. In that particular 55M test, all cores were used, but ~7-8 of them were used by mc-crusher... so the real limit for the machine is even higher. It did have a lot of cores. 48ish?

You can still do apples/apples with instance sizes... but given everything I know about this thing, unless those cores are extremely slow, hitting 11m ops/sec shouldn't be an issue. Or at least, with minimal fiddling it should hit 6-8m, which doesn't give you a crazy 9x figure.

You do need to stop doing 1:1 get/set ratio though. Sets don't scale very well since I've generally never had complaints about the speed. I'd say a highly conservative test would be 5:1 get/set. Production workloads are typically even higher than that. (that said I do intend to speed them up more, it's just lowish priority.. the LRU locks are highly granular, so spreading sets across different slab classes can help mutation perf a lot).

I'm still seeing similar results (~1M req/sec) after compiling your latest version of memcached from github and running with 16 worker threads. I just spun up two r4.4xlarge instances (one for client and one for the memcached server). I'm using memtier_benchmark with pipelining of 16 requests, 100B values, 10:1 get/set ratio. I compiled mc-crusher but you'll have to let me know the command to run because the readme wasn't clear.

One main constraint here is that we are using AWS virtual machine instances on the cloud. My guess is your previous experience is with physical servers. The FPGA performance is also significantly better when you can use the physical board with a direct ethernet connection, pipelining isn't required in this case the FPGA can handle minimum sized ethernet packets at line rate.

Another question, in your experience is compression/encryption used much with memcached? Because this is another area where the FPGA can compute much faster.

mis-threaded my response below (didn't have a reply button?), so see that too.

Just signed up for a personal AWS account and manually started an r4.4xlarge for target and c5.4xlarge for source (same CPU's and networking capability?, but it wasn't allowing me to just start two r4.4xlarge...).

got it up to 15M hits/sec for pure mget test.

results: https://gist.github.com/dormando/910134e85279710b970bd2c8af8...

Thanks for the details on how to use your benchmark script and for taking the time to investigate this. I hadn’t heard of your benchmark before and mc-crusher seems to work a bit differently than memtier_benchmark.

First a few significant differences:

1) Your value size is 10B which completely changes the results. Let’s keep the value size at 100B, which is more realistic.

2) The ratio of gets to sets significantly affects the requests per sec. We were assuming 1:1 ratio when we did our measurements. Increasing the percentage of gets really speeds up req/sec. We didn’t observe this effect on elasticache. Is this a recent improvement in the github version of memcached?

3) Your benchmark is using multiple keys in the same get command. What memtier does is pipeline multiple get commands each with one key. This seems more realistic.

4) We pipelined 16 get commands per packet while your configuration had 50 keys per get command.

I was able to reproduce the same setup as we had with ~1.2M req/sec with your mc-crusher benchmark using the following config. This has 1:1 get to set ratio with pipeline 16 and value size 100B.

send=ascii_set,recv=blind_read,conns=50,key_prefix=foobar,key_prealloc=0,pipelines=16,value_size=100 send=ascii_set,recv=blind_read,conns=50,key_prefix=foobar,key_prealloc=0,pipelines=16,value_size=100,thread=1 send=ascii_set,recv=blind_read,conns=50,key_prefix=foobar,key_prealloc=0,pipelines=16,value_size=100,thread=1 send=ascii_set,recv=blind_read,conns=50,key_prefix=foobar,key_prealloc=0,pipelines=16,value_size=100,thread=1 send=ascii_get,recv=blind_read,conns=50,pipelines=16,key_prefix=foobar,key_prealloc=1 send=ascii_get,recv=blind_read,conns=50,pipelines=16,key_prefix=foobar,key_prealloc=1,thread=1 send=ascii_get,recv=blind_read,conns=50,pipelines=16,key_prefix=foobar,key_prealloc=1,thread=1 send=ascii_get,recv=blind_read,conns=50,pipelines=16,key_prefix=foobar,key_prealloc=1,thread=1 send=ascii_get,recv=blind_read,conns=50,pipelines=16,key_prefix=foobar,key_prealloc=1,thread=1

I used the github memcached on an r4.4xlarge. I ran memcache-top on the server instance to measure the requests per second, showing about 750k gets/sec and 600k sets/sec.

With a ratio of 10:1 gets to sets I’m seeing about 3.5M req/sec which seems better than elasticache.

Try 12 workers to start, there are some bg threads.

I've plenty of experience with both hardware and virtual machines, I just don't use AWS myself much. I can get something like 800k read ops/sec from a 4core raspberry pi2, and I hope the AWS instance isn't that terrible.

with mc-crusher:

./mc-crusher conf/someconfigfile ipaddress port

https://github.com/memcached/mc-crusher/blob/master/conf/asc... - this is a decent read test with pipelining (give the test a few seconds to get through its sets). The inbound requests are pipelined, but it'll still send each get response in individual packets. This is what I use to test syscall/interrupt overhead.

https://github.com/memcached/mc-crusher/blob/master/conf/mge... this is the same thing, but with mgets. I'd copy the set line from ascii too:

send=ascii_set,recv=blind_read,conns=10,key_prefix=foobar,key_prealloc=0,pipelines=4,stop_after=200000,usleep=1000,value_size=10 send=ascii_mget,recv=blind_read,conns=50,mget_count=50,key_prefix=foobar,key_prealloc=1

can vary the value_size to and mget_count to see how that changes things. You can also pre-warm with the 'bench-warmer' script that comes with it, or remove stop_after and adjust usleep to adjust get/set ratios.

Watch top on the client host, and if mc-crusher is capping out its CPU cores, add more lines to the test but with the (confusing, sorry) threading enabled:

send=ascii_set,recv=blind_read,conns=10,key_prefix=foobar,key_prealloc=0,pipelines=4,stop_after=200000,usleep=1000,value_size=10 send=ascii_mget,recv=blind_read,conns=50,mget_count=50,key_prefix=foobar,key_prealloc=1 send=ascii_mget,recv=blind_read,conns=50,mget_count=50,key_prefix=foobar,key_prealloc=1,thread=1

That puts the first two tests on the "main" thread, then spawns an extra thread for the third test. you can keep copy/pasting that last line until the client or the server are saturated.

edit: sorry, the enc/compression question:

1) compression is typically done in the client to reduce bandwidth overhead. It's not very useful in the server.

2) encryption is becoming more popular, but doesn't currently exist much. The mainline OSS doesn't even have TLS support yet. Almost all use cases are on internal networks. FPGA's could potentially help there... aes-ni on intel cpu's isn't awful though.