Hacker News new | ask | show | jobs
by pwf 3477 days ago
50k seems like a low bar to start losing messages at. If this was done with Celery and a decently sized RabbitMQ box, I would expect it to get into the millions before problems started happening.
4 comments

These machines do more than just push. They also buffer messages for each individual user to "potentially" push if they don't read them on the desktop client. This happens before the flow this article talks about.

We currently have 3 machines doing this for millions of concurrent users. At the writing of this article it was 2 machines.

What size machines are these? I'm shocked that this volume is your max handling with Erlang unless your using a smaller T series AWS instance for this.
These are n1-standard8 on GCE.

These are getting easily over 30,000 requests a second each about updating queues for new messages. And also are subscribed to presence events from our presence system to millions of people. It is a very busy service ensuring we only deliver messages to people not at their computer.

So if they are delivering 30k a second per box and the max "backlog" you allow to build is ~50k, then you cap your backlog at under 2sec worth of delivery? Or am I missing something?
At some point, when a system has entered a failure mode for a while, it makes sense to start shedding load, rather than attempting to deliver every single push notification. Also worth mentioning, a minute of downtime is already a million backed up pushes. Beyond that, it becomes infeasible to attempt deliver them.

Edit: Also worth mentioning, the 50k buffer is for a single server, we run multiple push servers in the cluster.

At 15k notifications per minute, a million notifications would take 1hr to clear before the queue returns to normal. I would imagine they prefer to shed load early so notifications don't get delayed, hence the small buffer.
The issue was not the ability of their servers to handle the load, but the ability of Firebase to ingest the notifications - at least, that's how I read it.