Hacker News new | ask | show | jobs
by hichkaker 2137 days ago
Those tools are definitely vastly powerful. Have you used either of them?

TBH, I haven't, but judging from our current post-Informatica users and by reading questions on Informatica/Talend official user forums, I concluded that the diffing problem (to be specific – not only schema, but data diffing) is not directly addressed by them (the answers are in the realm of "there is no diff feature but you can write SQL..."

In general, we see data stacks becoming increasingly modularized and tools more specialized. For example, there are at least 20x more teams using OSS like Airflow/Luigi/Dagster for managing their data pipelines (and 2-5 other tools for the rest of the workflow) than using end-to-end platforms that you mentioned. We see Datafold as a regression testing tool in a modular stack.

1 comments

Thank you for your reply.

I have used Talend in great detail 3 years ago but I didn't have the usecase of schema diff at the time. But for data diff you can easily define workflow. And have to admit these workflows are crazy powerful and even can help the data fix with any transformation required (nocode or code)

However, Im seeing the usecase for a light weight tool with visual aspect. I like this. But will this problem be big enough for VC investment is the question ? I see schema diff can be just a plugin in one of the existing database tools. And if you are getting into data diff - you got to see what those tools do too.

> But will this problem be big enough for VC investment is the question?

That's a great question. Thinking about where problems arise in data pipelines, there are fundamentally two moving pieces: 1) Your data – you're continuously getting new data without a real ability to enforce your assumptions on its schema or shape. 2) Your code for ingestion and transformation that needs to evolve with the business and to adapt to changes in other parts of the infra.

Datafold's Diff tool currently mostly addresses #2. It can add value to any company that runs ETL pipelines but most impactful at large data engineering teams (similar story to CI or automated testing tools).

Regarding #1, wouldn't it be useful if we tracked ALL your datasets across time and alerted you on anomalies in those datasets? And I am not talking about rigid "unit" tests e.g. X <= value < Y, but actual stats-based anomaly detection, akin to what Uber does: https://eng.uber.com/monitoring-data-quality-at-scale/

So, with diff, we already compute and store detailed statistical profiles on every column in the table. Next, we are going to track those profiles across time.

Diff is just the first tool we've built to get a wedge into the workflows of high-velocity data teams and start adding value, but it's just the beginning of a more comprehensive and, hopefully, valuable product we aspire to deliver.

Much appreciate your response