| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by hotdogs 3519 days ago
	"Obviously a few notifications were dropped. If a few notifications weren’t dropped, the system may never have recovered, or the Push Collector might have fallen over." How many is a few? It looks like the buffer reaches about 50k, does a few mean literally in the single digits or 100s?

2 comments

Sikul 3519 days ago

Good question. We don't have metrics on the exact number dropped. We're using an earlier version of GenStage that doesn't give any information about dropped events. Once we upgrade we'll have a better idea.

link

teej 3519 days ago

That seems too important to have zero visibility on to me. Just eyeing the graphs, your queue size grew at 750m/s from 17:49 to 17:50. You then starting shedding at 17:50 for 40s. Assuming the ingress rate was roughly linear (which it looks like it was) you shed ~30,000 requests out of 3-4M. Does that not seem high to you?

This system seems great for at most once delivery. I wish I had more problems to solve with that constraint.

link

Sikul 3519 days ago

Yep, such is life with new tech. GenStage was 0.3.0 when we started using it. We'll be able to get this visibility once we update to a newer version (hasn't been prioritized).

FWIW, the buffer only fills up to the peak about once a month. Load shedding is the last ditch effort in catastrophic situations so that everything doesn't fall apart.

link

henrygrew 3518 days ago

I don't find load shedding as acceptable, perhaps if there was a criteria for dropping based on priority it would make more sense.

link

Sikul 3518 days ago

I think it depends on your problem space.

link

jsjohnst 3519 days ago

Curious why you don't have metrics on the other end (whatever is sending to the "Push Collector")?

What if the Push Collector is down or has a random bug where it throws away XX% of requests for no good reason? How would you know if you don't instrument the other end? Something like StatsD works fantastic for this, but also just logging those failures and using a log search/aggregation tool like Kibana or Splunk would be a step in the right direction.

link

Sikul 3519 days ago

Good point, we actually do. We could do the math.

link

Matthias247 3519 days ago

There's another important question: How will the clients deal with the fact that they did not get a notification delivered? Will that mean they probably never receive a chat message? That could in some cases be catastrophic for the user. Or would it only mean that they may not get something instantly, which would not be too bad if the client would also poll the server or also try to catch up on notifications on reconnects.

link

estel 3519 days ago

When the push notifications hit FCM, Firebase do not guarantee delivery of those messages to clients (usually iOS or android devices). There are quite a few reasons that FCM/APNS might fail to deliver a message, so applications almost never have functionality depend on them.

As you say, you might not get the notification pushed to the device, but you should still see the message if you open the messaging app as normal.

link

jhgg 3518 days ago

This is indeed the case. Our real time system is outside of firebase and APNS and it handles the actual real time updates of chat state once the app is launched. We also have a delivery system that accounts for network cuts/switches and the like.

link

munchbunny 3519 days ago

Sounds like delayed delivery of messages?

If the buffers are filling faster than the servers can clear them, then you're headed for "catastrophic" failure anyway and notifications are going to get dropped regardless. You can handle it more or less gracefully while you spin up more capacity.

The other choice is to always keep around more spare capacity, but that can get expensive if as they describe these drastic peaks happen once a month.

link

Matthias247 3518 days ago

I like to differentiate into 2 categories of "catastrophic": The one you mention is that the service can't keep up with the demand and that is has to take some actions to stay alive and not crash. That's and important thing which should be incorporated into the design.

The other category which I meant is that in such situations the system should not run into inconsistent/weird behavior which is catastrophic from the end user point of view (not the system). E.g. in a chat application if user A sends a message to B and on his client he gets an acknowledgement that the message has been delivered. However if the server under high load simply drops the message before forwarding it to B that user might never get it. A sees that something was delivered while in reality it was not. If A depends on that information it surely might be catastrophic to him.

All in all you should have a complete system design that even under high pressure works deterministically for the users. E.g. user A only gets an ACK that the message was sent after it was somehow persisted on the server. And if the server can't deliver the message to the other client because it was dropped it is still marked somewhere for retry later on. Or it will get fetched at a later time by the client through some poll operation.

link

bcherny 3519 days ago

Can you explain why it's necessary that some notifications were dropped?

link

Vishnevskiy 3519 days ago

2 reasons.

- To avoid OOMing the Erlang VM. - If the notification queue is backed up then older notifications are not worth delivering if we can speed up delivering of more recent ones.

link

jsjohnst 3519 days ago

Except that's not what's happening here if I'm reading right. They are throwing away the newer ones and not the older ones because the ones rejected by the Push Collector are the most recent ones.

link

DougN7 3518 days ago

I was wondering the same thing. Dropping an unknown number of requests isn't all that impressive. It seems like a simpler approach would have been to use a Message Queue of some sort with pushers pulling items from the queue.

link