| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by singron 2369 days ago

Keeping GC off for a long running service might become problematic. Also, the steady state might have few allocations, but startup may produce a lot of garbage that you might want to evict. I've never done this, but you can also turn GC off at runtime with SetGCPercent(-1).

I think with that, you could turn off GC after startup, then turn it back on at desired intervals (e.g. once an hour or after X cache misses).

It's definitely risky though. E.g. if there is a hiccup with the database backend, the client library might suddenly produce more garbage than normal, and all instances might OOM near the same time. When they all restart with cold caches, they might hammer the database again and cause the issue to repeat.

1 comments

ignoramous 2369 days ago

> ...all instances might OOM near the same time.

CloudFront, for this reason, allocates heterogeneous fleets in its PoPs which have diff RAM sizes and CPUs [0], and even different software versions [1].

> When they all restart with cold caches, they might hammer the database again and cause the issue to repeat.

Reminds me of the DynamoDB outage of 2015 that essentially took out us-east-1 [2]. Also, ELB had a similar outage due to unending backlog of work [3].

Someone must write a book on design patterns for distributed system outages or something?

[0] https://youtube.com/watch?v=pq6_Bd24Jsw&t=50m40s

[1] https://youtube.com/watch?v=n8qQGLJeUYAt=39m0s

[2] https://aws.amazon.com/message/5467D2/

[3] https://aws.amazon.com/message/67457/

link

singron 2368 days ago

Google's SRE book covers some of this (if you aren't cheekily referring to that). E.g. chapters 21 and 22 are "Handling Overload" and "Addressing Cascading Failures". The SRE book also covers mitigation by operators (e.g. manually setting traffic to 0 at load balancer and ramping back up, manually increasing capacity), but it also talks about engineering the service in the first place.

This is definitely a familiar problem if you rely on caches for throughput (I think caches are most often introduced for latency, but eventually the service is rescaled to traffic and unintentionally needs the cache for throughput). You can e.g. pre-warm caches before accepting requests or load-shed. Load-shedding is really good and more general than pre-warming, so it's probably a great idea to deploy throughout the service anyway. You can also load-shed on the client, so servers don't even have to accept, shed, then close a bunch of connections.

The more general pattern to load-shedding is to make sure you handle a subset of the requests well instead of degrading all requests equally. E.g. processing incoming requests FIFO means that as queue sizes grow, all requests become slower. Using LIFO will allow some requests to be just as fast and the rest will timeout.

link

ignoramous 2365 days ago

Your comment reminds me of this excellent ACM article by Facebook on the topic: https://queue.acm.org/detail.cfm?id=2839461

I've read the first SRE book but having worked on large-scale systems it is impossible to relate to the book or internalise the advice/process outlined in it unless you've been burned by scale.

I must note that there are two Google SRE books in-circulation, now: https://landing.google.com/sre/books/

link