Hacker News new | ask | show | jobs
by kleineshertz 1128 days ago
Nothing wrong with this question. I do not have any experience with Spark, but I guess Capillaries belongs to the same or similar ecosystem. My understanding is that Spark is way more generic framework that revolves around DAG-defined workflow and map/reduce-style functionality.

Capillaries is about:

- taking a very structured, stage-by-stage, approach to batch data processing with the possibility to control the results of a specific stage (although some kind of workflow DAG is there as well); - executing a SQL-style aggregation and denormalization on data in Cassandra; - executing workflows without actually writing code (besides one-liner Go expressions and Python math formulas when needed).

Sorry if I am missing the point with Spark, as I said - I never worked with it.

2 comments

should you have made something from scratch without having used its competitor to better understand the problemset/offerings out there?

"I've never used Postgres so I made my own SQL database"

Funny sentence, right?

Reasonable question, although stated kind of harshly by assuming the worst. Sometimes when you know exactly what you need, it’s reasonable to just build that rather than researching all the possibilities of things that could be adapted to your problem. Fear of reinventing the wheel can be a sort of analysis paralysis where you waste a lot of time looking for an overly-generic solution you will never need. It’s a balance.
It's always a balance. I have been working with teams on both side of the fence and I think I am well aware of the dangers of both: keeping the custom wheel running for years vs fighting the particularities of a third-party tool (up to the point they start dictating architectural decisions). Most of the operations Capillaries is intended to perform are row-based, and stellar Spark map-reduce capabilities were not a big selling point, while tech lock-in price seemed pretty high.

On a more general note (Spark discussion aside), I like working with third-party solutions that can do only one thing, but they do it perfectly. And I am ok supporting in-house-built frameworks that behave the same way and do not pretend to be a world peace solution.

If it's not invented here, it can't be any good.
Yeah from your description it sounds like those problems are solved by Spark. Spark doesn't persist intermediate state to Cassandra which might make it better since its in-memory(normally, you can allow spill to disk) persistence mechanisms(RDDs, Datasets) are fast, keep data near compute, and can scale up elasticity during a run.
Regarding using in-memory storage. Early prototype of Capillaries used Redis for storage and the performance was stellar. I decided to drop it for two reasons. First, indexing mechanism required a root-level sorted set, and Redis cannot partition it. Second, most of intermediate data is supposed to be available until the end of the run, which means hours, and I was not sure that typical Capillaries users would agree to carry the cost of providing so much RAM vs disk space. Am I willing to return to the discussion about replacing Cassandra with some in-memory storage? Maybe.