Hacker News new | ask | show | jobs
by georgewfraser 2881 days ago
Data pipelines are not a great subject for an open-source project. We've been building these for the last 3+ years at Fivetran, and I can tell you that the challenge is:

  - Studying each source to figure out the right data model
  - Chasing down a million weird corner cases
  - Working around dumb bugs in the data sources
This is the kind of problem where paying for software really works better. When people build data pipelines in-house, they tend to hack at it until it works for their use case and then stop. When we build data pipelines, we map out every feature of the data source, implement the whole thing at once, and then put it through a beta period with multiple real users. This is easy to do when you have a tight-knit dev team; much harder for a group of part-time open-source contributors.
4 comments

I think the point is to provide a set of tools for people that build data pipelines. Period. The software being open source don't reflect in any way WHO will use this tool. Depending on the success of this project, it might be that you could switch your team to this new tool at some point.

Personally I work as a "lone wolf" (to my own complains) because I'm in a small company that can't afford a huge team. Most of my (ETL) Transforms are done in SQL which happen to be pretty standardized as opposed to many ETL products I've seen so far.

This solution is probably far from being ready, but I find this approach quite interesting, because it look like a code based ETL that use SQL for transform (so I might be biased). Overall this might result in a more maintainable/versionable data pipeline model than GUI-first ETL which usually generate spaghetti code. Because you are usually forced to regularly adapt data-pipeline to unstable external inputs, being able to easily diff ETL process would be a blessing.

The scope of Meltano isn't limited to just data pipelines, though that is the first major part of it.

One thing that gets me really excited about it is the way we want to build version control in from the start. To give you an example of where that's really powerful - we have a bunch of dashboards in Looker. Right now, figuring out what Looks/Dashboards rely on a given field is very challenging. If I change a column in my extraction, right now I can fairly easily propagate it to my final transformed table (thanks to dbt!) and even to the LookML. But knowing what in Looker is going to change / break if I change the LookML is way harder.

But if everything was defined in code from extraction, loading, transformation, modeling, _and_ visualization, that'd be really powerful from my perspective.

The Meltano team has several user personas that they're looking at focusing on, data engineers are definitely one of them, but data analyst/BI users are as well, and we want the product to work well for the whole data team.

Can't agree more.

IMHO, if you want to make a dent in the space, figure out better debugging tools!

In particular, tools that explain how a certain (specific) value was calculated in the system, tools that let you bisect the source data in some way and let you focus on the source data that are likely to have a problem, tools that help you figure out that certain intermediate value in calculations is an outlier, tools that let you test certain assumptions about data over the whole pipeline..

You're talking about more debugging tools within the transformation steps of a pipeline, right? dbt is helping with that via data tests (see https://gitlab.com/meltano/analytics/tree/master/elt/dbt/tes...) as an example.

I'd love for a more robust way to test data pipelines and the data within them generally. I was at DataEngConf earlier this year and many people were talking about this problem exactly. One way we're trying to address it a bit is by using the Review Apps feature on Merge Requests within GitLab. Right now, when you open an MR on our repo it will create a clone of the data warehouse that's completely isolated from production. This, obviously, can't scale once the DW is beyond a certain size, but I think there are ways to keep this sort of practice going.

I kind of agree with this. To take an example outside of ETL/DW/BI, when I first saw Zapier I was skeptical of how many APIs they could support because I'd seen a decent amount of open source ESBs like Mulesoft run out of steam after a certain number of connectors. Zapier, being proprietary from day one (albeit less featureful than a full blown enterprise ESB) has done better than I expected. Still, they only support 100 or so datasources and the types of data/objects/triggers/whatever they support is limited at times. IMO at some point both open source and proprietary models fall apart in the face of the long tail. Amazon has tackled the long tail of ecommerce but that's an enormous market that allows them to employ hundreds of thousands of people to tackle that long tail. Tackling the long tail of connectors (whether it's for ESBs/SaaS integration or ETL/DW/BI) is just too expensive compared to the size of the markets that are willing to take a shot at it.
Thanks for the great advice! Obviously your years of experience, with trial and error, advice is greatly appreciated.

The idea is to give users a set of default extractors (which are the ones we use internally, so they are battle tested), along with loaders, transformers etc. With documentation on how to build their own. For our MVP, and possibly into the future, it will work similar to Wordpress plugins where you have an extractor directory that you place your extractor which is written following our protocol, and the UI will recognize it and give you choices of extractors to run, same for loaders, and so on.

We do not want to be chasing down every last corner case, for extractors (except for our own) because that's just not a good long term solution, needing constant maintenance (as we've seen already). With user contributions, I believe it can work.