Yeah, that's true. But my environment is such that any one of 100 or so app servers has a significantly lower chance of running out of memory than the Redis server does.
The 0MQ high water mark is set high enough so that it's virtually impossible not to fix a broken DB by the time messages on the client side create an OOM condition being queued in memory.
Ultimately, it's all about the odds. That's what HA, replication, and DR are all about. It's so statistically unlikely for certain things to happen, they just fall out of the realm of reason. Most operations folk I've talked to don't even consider their disaster recovery plans to be within the realm of feasibility. The chances of a catastrophic event rendering the owners of the system defunct is many orders of magnitude more likely than an event that breaks the standard data fail-safes that most datacenters have in place.
Unless there are so many items per sec that your persistence can't keep up. Wouldn't this kind of create the same situation: You cant accept all the new items and have to throw some away. Only, now everything is slower. A lot slower.
Ok, the first scenario is caused by the workers being too slow, so it's not exactly the same :)
Yes, persistent queues have issues, which is why 0MQ exists. But using a non persistent queue to deal with overflow just delays the problem, which was why I asked...