| HN Mirror

The use case in mainly deduplication of a high volume data stream (though it's got a few other uses). The write volume is fairly stable so it's sized in such away that we'll never emit dupes even when the upstream source crashes and needs to be rebuilt from backups (for this case that means > a billion cache entries). Something like the opposite of a bloom filter (https://www.somethingsimilar.com/2012/05/21/the-opposite-of-...) didn't work because we don't want false negatives either. Since the cache is fed by a Kafka log HA is achieved simply by having multiple consumers individually populating their own cache instance. The persistence mechanisms are to allow for code deployments that don't blow away the cache, not HA.

We actually experimented with grid caches (ignite in particular since it offers off-heap in memory storage as well), but the performance simply isn't there. At the volume we're writing even millisecond latency is a non-starter. We did explore both memcached and redis, but we need strict FIFO and both of those solutions provide nondeterministic LRU.