Hacker News new | ask | show | jobs
by hotdogs 3473 days ago
"Obviously a few notifications were dropped. If a few notifications weren’t dropped, the system may never have recovered, or the Push Collector might have fallen over."

How many is a few? It looks like the buffer reaches about 50k, does a few mean literally in the single digits or 100s?

2 comments

Good question. We don't have metrics on the exact number dropped. We're using an earlier version of GenStage that doesn't give any information about dropped events. Once we upgrade we'll have a better idea.
That seems too important to have zero visibility on to me. Just eyeing the graphs, your queue size grew at 750m/s from 17:49 to 17:50. You then starting shedding at 17:50 for 40s. Assuming the ingress rate was roughly linear (which it looks like it was) you shed ~30,000 requests out of 3-4M. Does that not seem high to you?

This system seems great for at most once delivery. I wish I had more problems to solve with that constraint.

Yep, such is life with new tech. GenStage was 0.3.0 when we started using it. We'll be able to get this visibility once we update to a newer version (hasn't been prioritized).

FWIW, the buffer only fills up to the peak about once a month. Load shedding is the last ditch effort in catastrophic situations so that everything doesn't fall apart.

I don't find load shedding as acceptable, perhaps if there was a criteria for dropping based on priority it would make more sense.
I think it depends on your problem space.
Curious why you don't have metrics on the other end (whatever is sending to the "Push Collector")?

What if the Push Collector is down or has a random bug where it throws away XX% of requests for no good reason? How would you know if you don't instrument the other end? Something like StatsD works fantastic for this, but also just logging those failures and using a log search/aggregation tool like Kibana or Splunk would be a step in the right direction.

Good point, we actually do. We could do the math.
There's another important question: How will the clients deal with the fact that they did not get a notification delivered? Will that mean they probably never receive a chat message? That could in some cases be catastrophic for the user. Or would it only mean that they may not get something instantly, which would not be too bad if the client would also poll the server or also try to catch up on notifications on reconnects.
When the push notifications hit FCM, Firebase do not guarantee delivery of those messages to clients (usually iOS or android devices). There are quite a few reasons that FCM/APNS might fail to deliver a message, so applications almost never have functionality depend on them.

As you say, you might not get the notification pushed to the device, but you should still see the message if you open the messaging app as normal.

This is indeed the case. Our real time system is outside of firebase and APNS and it handles the actual real time updates of chat state once the app is launched. We also have a delivery system that accounts for network cuts/switches and the like.
Sounds like delayed delivery of messages?

If the buffers are filling faster than the servers can clear them, then you're headed for "catastrophic" failure anyway and notifications are going to get dropped regardless. You can handle it more or less gracefully while you spin up more capacity.

The other choice is to always keep around more spare capacity, but that can get expensive if as they describe these drastic peaks happen once a month.

I like to differentiate into 2 categories of "catastrophic": The one you mention is that the service can't keep up with the demand and that is has to take some actions to stay alive and not crash. That's and important thing which should be incorporated into the design.

The other category which I meant is that in such situations the system should not run into inconsistent/weird behavior which is catastrophic from the end user point of view (not the system). E.g. in a chat application if user A sends a message to B and on his client he gets an acknowledgement that the message has been delivered. However if the server under high load simply drops the message before forwarding it to B that user might never get it. A sees that something was delivered while in reality it was not. If A depends on that information it surely might be catastrophic to him.

All in all you should have a complete system design that even under high pressure works deterministically for the users. E.g. user A only gets an ACK that the message was sent after it was somehow persisted on the server. And if the server can't deliver the message to the other client because it was dropped it is still marked somewhere for retry later on. Or it will get fetched at a later time by the client through some poll operation.

Can you explain why it's necessary that some notifications were dropped?
2 reasons.

- To avoid OOMing the Erlang VM. - If the notification queue is backed up then older notifications are not worth delivering if we can speed up delivering of more recent ones.

Except that's not what's happening here if I'm reading right. They are throwing away the newer ones and not the older ones because the ones rejected by the Push Collector are the most recent ones.
I was wondering the same thing. Dropping an unknown number of requests isn't all that impressive. It seems like a simpler approach would have been to use a Message Queue of some sort with pushers pulling items from the queue.