Hacker News new | ask | show | jobs
by TexanFeller 1127 days ago
Yeah from your description it sounds like those problems are solved by Spark. Spark doesn't persist intermediate state to Cassandra which might make it better since its in-memory(normally, you can allow spill to disk) persistence mechanisms(RDDs, Datasets) are fast, keep data near compute, and can scale up elasticity during a run.
1 comments

Regarding using in-memory storage. Early prototype of Capillaries used Redis for storage and the performance was stellar. I decided to drop it for two reasons. First, indexing mechanism required a root-level sorted set, and Redis cannot partition it. Second, most of intermediate data is supposed to be available until the end of the run, which means hours, and I was not sure that typical Capillaries users would agree to carry the cost of providing so much RAM vs disk space. Am I willing to return to the discussion about replacing Cassandra with some in-memory storage? Maybe.