Hacker News new | ask | show | jobs
Show HN: Capillaries: Distributed data processing with Go and Cassandra (capillaries.io)
70 points by kleineshertz 1132 days ago
I started thinking about this approach after working on a large-scale project for a major financial company where our group developed a distributed in-house data processing solution. On a regular basis, it ingested a few gigabytes of financial data and, within a tight SLA time limit, produced a lot of enriched/aggregated/validated data for a number of customers. Sometimes, source data had errors, so operators with domain knowledge had to verify data validity at some checkpoints, immediately make corrections, and re-run parts of the workflow manually. The solution involved complex web service orchestration, custom database and was very demanding on the infrastructure availability.

Capillaries is a built from scratch, open-source Go solution that does just that: ingests data and applies user-defined transforms - Go one-liner expressions, Python formulas, joins, aggregations, denormalization - using Cassandra for intermediate data storage and RabbitMQ for task scheduling. End users just have to provide: - source data in CSV files; - Capillaries script (JSON file) that defines the workflow and the transforms; - Python code that performs complex calculations (only if needed).

The whole data processing pipeline can be split into separate runs that can be started independently and re-run by the user if needed.

The goal is to build a platform that is tolerant to database and processing node failures, and allows users to focus on data transform logic and data quality control.

“Getting started” Docker-based demo calculates ARK funds performance, using EOD holdings and transactions data acquired from public sources. There are also integration tests that use non-financial data. There is a test deploy tool that uses Openstack API for provisioning in the cloud.

6 comments

Surprised you didn't pick ScyllaDB over cassandra.. ScyllaDB has excellent out of the box support for Golang Change Data Capture and can handle much more load given the same hardware as Cassandra. It has nice integration with stuff like Confluent as well. Given ScyllaDB is mostly drop-in replacment I guess it would be quite straight-forward to swap it out if someone wanted, I guess
ScyllaDB is definitely on the radar. The main reason I picked Cassandra on the prototyping stage was because default Cassandra configuration gave me much better performance then ScyllaDB (I know, it is supposed to be vice versa). Another obvious reason was Cassandra's maturity and community support. If gocqlx is indeed a drop-in replacement for gocql, I can't see problems having a separate config/fork using ScyllaDB along with Cassandra.
Yea, gocql works fine.. the CDC on golang is pretty good as well

I asked for support on twitter for some specific load tests, go 2 detailed responses in a day.. only later did I realise, both by the CEO of the company! :D

At this point, I wonder if they should simply have picked a more memorable name. Yes its clever, but even I regularly forget its name. Never happens with Cassandra.
Oh this is funny for me, since I am founder of https://capillary.io (not related at all)
https://capillaries.io/ (Cassandra & RabbitMQ) reminds me of https://temporal.io/ (PostgreSQL)

Next up, TemporiarriesLite™

Go + SQLite using https://litestream.io for single-instance, low-power systems or server-less, seldom-used apps that need distributed backups and statefulness.

Jokes aside, these at-least-once operation state managers are really nice and help us avoid adding SQS / NATS / etc.. queues littered all over the place. The focus on data processing by Capillaries is nice. Looking forward to trying it out.

Temporal is a different ecosystem (and a much more ambitious solution), but one of the principles is the same: users want a platform that solves scalability issues and lets them focus on biz logic and customer value.
Temporal also shares the principle of being tolerant to database and processing node failures
Temporal has different database options: Cassandra, Postgres, MySQL, SQLite.

> source data in CSV files; - Capillaries script (JSON file) that defines the workflow and the transforms; - Python code that performs complex calculations (only if needed).

Temporal is more general purpose: source data anywhere, and you write code to define workflows and transforms instead of JSON, and the code can be in Go/Java/Python/JS/TS/.NET

Right. And this is where our paths go separate ways. As I see it, Capillaries users are not necessarily tech companies and they do not have much appetite for writing and maintaining a lot of code. All they want is to run, say, 50 kinds of workflows on a regular basis and to keep those workflow definitions very formalized and stable. It would be hard to sell the idea of maintaining 50 different codebases to their management.

As for more complex calculations: in Capillaries, Python is not a programming platform, it's just a scripting engine.

There are definitely ease of use benefits to more tailored solutions. If workflow definitions are really simple and don't change much, JSON might be easy. Most things I prefer the DX of writing the logic in code. And it wouldn't be 50 different codebases—it would usually be a single codebase with 50 different functions.
My suggestion… read/write Avro and Parquet files so that big data pipelines could use Capillaries. I was working on something along these lines, not sure if you support it or not. If not, you really should.
Parquet support is on the radar for sure, and I would like to have it before diving into database connector development.
Couldn't you just run it on Airflow / Luigi / Keboola / Dagster / Flyte?
Maybe. The scenarios Capillaries is intended for do not need complex/flexible workflow, we just need some basic dependency rules (easy to implement) and really reliable scheduling (RabbitMQ).
Maybe I’m asking the wrong question but how does it compare to Apache Spark etc?
Nothing wrong with this question. I do not have any experience with Spark, but I guess Capillaries belongs to the same or similar ecosystem. My understanding is that Spark is way more generic framework that revolves around DAG-defined workflow and map/reduce-style functionality.

Capillaries is about:

- taking a very structured, stage-by-stage, approach to batch data processing with the possibility to control the results of a specific stage (although some kind of workflow DAG is there as well); - executing a SQL-style aggregation and denormalization on data in Cassandra; - executing workflows without actually writing code (besides one-liner Go expressions and Python math formulas when needed).

Sorry if I am missing the point with Spark, as I said - I never worked with it.

should you have made something from scratch without having used its competitor to better understand the problemset/offerings out there?

"I've never used Postgres so I made my own SQL database"

Funny sentence, right?

Reasonable question, although stated kind of harshly by assuming the worst. Sometimes when you know exactly what you need, it’s reasonable to just build that rather than researching all the possibilities of things that could be adapted to your problem. Fear of reinventing the wheel can be a sort of analysis paralysis where you waste a lot of time looking for an overly-generic solution you will never need. It’s a balance.
It's always a balance. I have been working with teams on both side of the fence and I think I am well aware of the dangers of both: keeping the custom wheel running for years vs fighting the particularities of a third-party tool (up to the point they start dictating architectural decisions). Most of the operations Capillaries is intended to perform are row-based, and stellar Spark map-reduce capabilities were not a big selling point, while tech lock-in price seemed pretty high.

On a more general note (Spark discussion aside), I like working with third-party solutions that can do only one thing, but they do it perfectly. And I am ok supporting in-house-built frameworks that behave the same way and do not pretend to be a world peace solution.

If it's not invented here, it can't be any good.
Yeah from your description it sounds like those problems are solved by Spark. Spark doesn't persist intermediate state to Cassandra which might make it better since its in-memory(normally, you can allow spill to disk) persistence mechanisms(RDDs, Datasets) are fast, keep data near compute, and can scale up elasticity during a run.
Regarding using in-memory storage. Early prototype of Capillaries used Redis for storage and the performance was stellar. I decided to drop it for two reasons. First, indexing mechanism required a root-level sorted set, and Redis cannot partition it. Second, most of intermediate data is supposed to be available until the end of the run, which means hours, and I was not sure that typical Capillaries users would agree to carry the cost of providing so much RAM vs disk space. Am I willing to return to the discussion about replacing Cassandra with some in-memory storage? Maybe.
I would have had the same question with Apache Storm, it sounds to me that these tools would solve the described problem relatively well (and now that I think about it, Spark even has Python support).
Storm positions itself as a stream processing solution, while Capillaries is 100% batch-oriented.