| > But will this problem be big enough for VC investment is the question? That's a great question. Thinking about where problems arise in data pipelines, there are fundamentally two moving pieces:
1) Your data – you're continuously getting new data without a real ability to enforce your assumptions on its schema or shape.
2) Your code for ingestion and transformation that needs to evolve with the business and to adapt to changes in other parts of the infra. Datafold's Diff tool currently mostly addresses #2. It can add value to any company that runs ETL pipelines but most impactful at large data engineering teams (similar story to CI or automated testing tools). Regarding #1, wouldn't it be useful if we tracked ALL your datasets across time and alerted you on anomalies in those datasets? And I am not talking about rigid "unit" tests e.g. X <= value < Y, but actual stats-based anomaly detection, akin to what Uber does: https://eng.uber.com/monitoring-data-quality-at-scale/ So, with diff, we already compute and store detailed statistical profiles on every column in the table. Next, we are going to track those profiles across time. Diff is just the first tool we've built to get a wedge into the workflows of high-velocity data teams and start adding value, but it's just the beginning of a more comprehensive and, hopefully, valuable product we aspire to deliver. |