Hacker News new | ask | show | jobs
by jordanthoms 2134 days ago
There was an outage August 19th, 2019 - almost 1 year ago to the day. As I posted at the time: "Google often has a outage or two around this time of the year when all the US schools come back and millions of students log in at the same time."

My pet theory wasn't too popular but I'm going to stick with it :)

1- https://news.ycombinator.com/item?id=20740997

8 comments

I work on the educational part of the product my company develops and I can attest that school start is a stressful day with login attempts, assignments lookups and other setup activities for the school period.

I wouldn't doubt that Google Classroom and other systems that use Google's SSO will be under strain from millions of students.

Google does a once per year disaster recovery training... They do things like deliberately turn off datacenters with no warning. Sometimes failover systems don't work as intended.

Was that this week?

It was not this week, sorry!
Every year around the same time people have to work on Perf (internal performance review), maybe people were more focused on that rather than keeping the systems up.... or maybe they needed to push the latest update to be included in their perf...
I like this theory too - but is performance review this week?
Yes.
See you in August 2021, good sir!
US schools don't all start on the same day though- its pretty staggered with some starting in early-mid august, and most in the Northeast start right after Labor day.
It's still probably a normalish distribution
Right- which I would expect Google or any half decent service to be able to withstand easily. Its not a sudden spike that happens under a few minutes to several orders of magnitude above the average weekly peak, this is a fairly gentle sloping upward.

And if this happened last year too, you would think this would be on top of the list of things to watch for next year and add capacity for. Amazon and Walmart start planning and drilling now for their holiday season.

That's an interesting theory because the timing does correlate.

A lot of people would immediately dismiss it because Google has the resources to scale up. But having resources doesn't guarantee someone actually turns the knob that increases the number of instances. (Whether automatic or manual, the adjustment could be too slow to match an unanticipated spike in demand.)

But there's another reason I don't think that's the explanation. Gmail has 1.5 billion active users[1]. Millions of students logging in at the same time sounds like a lot, but if Gmail has 100 million more active users today than yesterday, that's not even a 10% increase!

---

[1] Source: https://en.wikipedia.org/wiki/Gmail

I don't think it's the load on Gmail that's an issue. I'd point more to Google Drive, Docs and the underlying shared storage infrastructure. Also keep in mind most of those 1.5 billion users won't be very active - a few million users that have no usage at all for a few months and then all come back to being extremely active within a few days can be pretty disruptive!

IMO it's not really about having the resources to scale, but the unpredictable emergent behaviours which can happen when the load profile suddenly changes

Actually a legit observation. Sorry to say but the naysayers on here are super dumb and missing the point with their comparison to other Google services or claiming lack of evidence.
Millions of people are searching simultaneously at google.com or youtube.com but servers are not crashing. Issue is not traffic overload but something else.
These are not the same products nor infrastructure
But I'm sure similar infrastructure architecture was applied to gmail.com as it was to google.com and youtube.com.

And similar concepts of maintaining by sysadmins are practiced.

Hah...

Press and hold the F5 key on your keyboard for 2 minutes while on gmail.com. You will get a "service unavailable" error. About 500 other people whose data happens to be cohosted with you will also get the same error, and all of you will be unable to send or receive email, even by IMAP, for about 10 mins while your particular corner of the data store is restarted and the data integrity checked.

That doesn't happen on Google.com

Ok, I definitely want to know how you discovered that... (and found one of those 500 people to verify?)
Not sure if this is still the case, but if you did this a couple of times, your account data would be permanently migrated to an instance with more CPU and RAM allocated - you'd also be in with all the other badly behaved accounts, so reliability goes down lots. The benefit was much quicker complex searches, and being able to bulk label or delete emails without it taking minutes or hours.

Don't believe me how slow it is on a regular instance? Try going to "All mail", selecting all of your emails, and applying a label to them all. In my experience, it can only label about 50 mails per second, so it can take hours to do them all. It will keep going if you quit the browser, but will stop if the gmail devs do a software update, which they seem to do on usually tuesdays, but never fridays or the weekends.

Interesting. I held F5 down for less than a minute and I got an "Unusual usage - account temporarily locked down" message. Disappeared after a few seconds pause though...
Drive and Gmail are not the same thing as search. The bottlenecks are different, the architecture and problem spaces aren't the same either.
Primarily cache hits with no state.
This. And to expand on it:

for Gmail, you have to log in. It has to know it is you. It has to know that everything it is serving you is only for you. It has to retain copies of documents that -- as far as it knows -- are unique to you or whomever you share with. It has to do all this while keeping that information safe from other people who might want to take a look at it.

And that's just the items I can think of in real time while typing.