Improving distributed caching performance and efficiency at Pinterest

Y	Hacker News new \| ask \| show \| jobs

	Improving distributed caching performance and efficiency at Pinterest (medium.com)
	42 points by jparise 1449 days ago

5 comments

myrion 1448 days ago

I wish Pinterest were less effective at making google image search useless, not more efficient...

link

xvello 1448 days ago

Just use the two following uBlock Origin / Adguard rules, the first one for text results, the second one for images:

    google.*##.g:has(a[href*=".pinterest."])
    google.*##a[href*=".pinterest."]:upward(1)

You can build custom rules for other websites and other search engines at https://letsblock.it/filters/search-results

link

xdfgh1112 1448 days ago

You can also use the uBlacklist extension which supports many websites and can block arbitrary domains.

link

anothernewdude 1448 days ago

Between Pinterest and now Britannica filling up normal google search, I've started using Bing unironically.

link

ignoramous 1448 days ago

> Today, Pinterest's memcached fleet spans over 5000 EC2 instances across a variety of instance types optimized along compute, memory, and storage dimensions. Collectively, the fleet serves up to ~180 million requests per second and ~220 GB/s of network throughput over a ~460 TB active in-memory and on-disk dataset, partitioned among ~70 distinct clusters.

Wow.

Assuming $0.09 per GB egress on EC2, that's $51,321,600/mo. Of course, they must be on some Enterprise plan of some sort, but how much discount must they get to make it affordable?

By comparison, 180m requests per second on an "egress-free" serverless compute like Workers would cost $77,760,000/mo (assuming 6m per $1) or $233,280,000/mo (2m per $1).

Cloud is wild.

link

LINKIWI 1447 days ago

Hi, I'm the original author of this article, though I have left Pinterest since this article was published.

Many customers with large AWS footprints, Pinterest included, have enterprise plans with highly custom pricing. It is often the case that general public pricing isn't directly comparable to enterprise pricing on an individual component level.

On the topic of network transfer, many of our highest network bandwidth memcached clusters are replicated with an egress routing policy that exercises an availability zone affinity [0]. For the most efficient clusters, this means that 99.9+% of network bandwidth remains in the client-colocated AZ (within the same region and VPC), which is free [1].

[0] https://pin.it/scaling-cache-infrastructure

[1] https://aws.amazon.com/blogs/architecture/overview-of-data-t...

link

jonatron 1448 days ago

I wouldn't expect memcached servers to have any egress. Still I'd bet they're paying a lot for whatever their egress actually is.

link

rasz 1447 days ago

It doesnt really matter how much they pay for cloud as long as its less than what they get from Google for the image spam/pollution.

link

infocollector 1448 days ago

I do hope Pinterest would fund memcached development going forward, especially the language clients for memcached. I've been using pylibmc to access memcached fast, and that project seems to be almost dead (https://github.com/lericson/pylibmc/issues).

link

dormando 1448 days ago

client ecosystem is definitely a sore point now. I've just sort of started working on a replacement for libmemcached to hopefully cut down on the complexity... but then that's a migration and nobody wants to do that.

Pinterest should drop me a line if they're interested in sponsoring work though :)

link

infocollector 1448 days ago

Thanks for memcached! I am surprised how pylibmc even works at this point - its last released version was in 2019. I do hope Pinterest sponsors.

link

mandeepj 1448 days ago

> the fleet serves up to ~180 million requests per second

For comparison, Google serves about 63k queries per second. I hope there's not a typo in the above line in Pinterest's blog

link

LINKIWI 1447 days ago

Hi, I'm the original author of this article, though I have since left Pinterest.

180M/s is the peak throughput I've observed from the entire fleet, and is an accurate figure.

It's worth noting that:

(1) When it comes to caching workloads, there is often a wide spread of request amplification factor from a single inbound user request to the site. For example, a search query on Pinterest might internally fan out to an order of magnitude more requests to memcached across several different services along the request path used for servicing that query.

(2) There are many systems outside of the online critical path that use caching, which add load on the system independent of the rate at which users are posting or viewing content on the Pinterest site.

As another reference point, Facebook, who developed mcrouter, shared that their memcached deployment serves on the order of billions of requests per second [0]. And this figure is from 2014; I imagine it's grown a lot since then.

[0] https://www.usenix.org/system/files/conference/nsdi13/nsdi13...

link

divyekapoor 1448 days ago

Google serves up way way more than 63k queries per second (several million requests/s is common for the critical services around search). Source: I worked there.

However, your main point is valid (180 million/s seems way too high). I've started a thread internally to double check these numbers. Please wait for an update.

link

ddorian43 1448 days ago

It's 180M/s memcached requests. A single page view probably hits 10s of requests. Just like a google search may hit 1K+ servers

link

mandeepj 1447 days ago

> Google serves up way way more than 63k queries per second (several million requests/s is common for the critical services around search). Source: I worked there.

All right. But, here's my source - https://www.google.com/search?q=how+many+requests+does+googl... . Maybe, it's flawed

link

mutreta 1448 days ago

Their numbers seem reasonable.

As an anecdote, I've worked at a scale up that handled ~20M HTTP(S) req/min on their service mesh.

The rule of thumb was that our Redis cache layer would have 10x the number of HTTP request, so 200M req/min.

Note that I'm using minutes, not seconds on the units. But it wouldn't surprise me that Pinterest handles 60x the load that we had back then.

link

ksec 1447 days ago

It is 180 Million divided by 5000 instances, or 36K request per instance.

Honestly not that high of a number from memcached [1] prospective. It could easily handle 10x that even with SSD extstore.

[1] https://memcached.org/blog/nvm-multidisk/

link

LINKIWI 1447 days ago

Yeah, 36k per instance (especially on an xlarge or 2xlarge EC2 instance) is well within the serving capacity of memcached. While it depends a lot on the workload profile for a specific cluster, some clusters serve on the order of ~5k/instance while others are as high as ~100k/instance. We've done a lot of experimentation with extstore as well; it certainly eats up more compute cycles on average than an equivalent in-memory only cluster, but is still quite efficient.

link

aparsons 1448 days ago

180 million a day sounds about right

link

divyekapoor 1448 days ago

I've started a thread internally to double check these numbers. Please wait for an update.

link

hw 1447 days ago

Curious on why self hosted MC vs Elasticache MC?

link

Nextgrid 1447 days ago

Better control over it would be my guess, as well as disaster recovery.

Control over the OS, kernel and server source code would expose values you can tune to make it perform better under their specific workload (where as managed services tend to strike a balance between a wide range of workloads).

For disaster recovery, a previous client of mine got bitten by an RDS instance that was stuck in “modifying” state for 12+ hours (presumably until an AWS engineer manually fixed the problem). Being able to SSH into the machine as root would’ve saved us quite a bit of time (we ended up starting a new RDS and restoring from a - thankfully very recent - backup to get the service back online).

link