Hacker News new | ask | show | jobs
by potatosareok 3734 days ago
Wondering what your use case is for such a large cache that is not clustered (unless I misunderstood the code). Both in terms of single process dying = lost 10gb of data and also what single process needs a 10gb cache. I do see ability to read from disk listed.

How would you compare your cache to open source data grids that also provide off heap (infinispan, geode/gemfire, ignite) or just other general cache solutions (redis/memcache)?

1 comments

The use case in mainly deduplication of a high volume data stream (though it's got a few other uses). The write volume is fairly stable so it's sized in such away that we'll never emit dupes even when the upstream source crashes and needs to be rebuilt from backups (for this case that means > a billion cache entries). Something like the opposite of a bloom filter (https://www.somethingsimilar.com/2012/05/21/the-opposite-of-...) didn't work because we don't want false negatives either. Since the cache is fed by a Kafka log HA is achieved simply by having multiple consumers individually populating their own cache instance. The persistence mechanisms are to allow for code deployments that don't blow away the cache, not HA.

We actually experimented with grid caches (ignite in particular since it offers off-heap in memory storage as well), but the performance simply isn't there. At the volume we're writing even millisecond latency is a non-starter. We did explore both memcached and redis, but we need strict FIFO and both of those solutions provide nondeterministic LRU.