| Hi! I'd like to offer some hopefully useful information if any Slack folks end
up reading this, or anyone else with a similar infrastructure. I'll start with
some tech and make a separate philosophical comment. Also caveat: I have no deep view into Slack's infrastructure so anything I say
here may not even be relevant. YMMV. First some self promotion: https://github.com/memcached/memcached/wiki/Proxy
memcached itself is shipping router/proxy software. Mcrouter is difficult to
manage and unsupported. This proxy is community developed, more
flexible, likely faster, and will support more native features of memcached.
We're currently in a stabilization round ensuring it won't eat pets but all of
the basic features have been in for a while. Documentation and example
libraries are still needed but community feedback help speed those up
tremendously (or any kind of question/help request). It's not clear to me why memcached is being managed like this; mcrouter seems
to only be used to abstract the configuration from the clients. It has a lot
of features for redundant pools and so on. Especially with what sounds like
globally immutable data and the threat of cascading failures during rolling
upgrades it sounds like it would be very helpful here. If cost or pool sizes are the main reasons why the structure is flat, using
Extstore (https://github.com/memcached/memcached/wiki/Extstore) can likely
help. Even if object value sizes are in the realm of 500 bytes, using flash
storage can still greatly reduce the amount of RAM necessary or reduce the
pool size (granted the network can still keep up) with nearly identical
performance. Extstore takes a lot of tradeoffs (ie; keeping keys in RAM) to
ensure most operations don't actually write to flash or double-read.
Extstore's in use in tons of places and everyone's immediately addicted. Finally, the Meta Protocol
(https://github.com/memcached/memcached/wiki/MetaCommands) can help with
stampeding herds to help keep DB load from exploding without adding excess
network roundtrips under normal conditions. I've seen lots of workarounds
people build but this protocol extension gives a lot of flexibility you can
use to help survive degraded states: anti-stampeding herd, serve-stale, better
counter semantics, and so on. |