Hacker News new | ask | show | jobs
by jumpman500 1649 days ago
I've always found ETL frameworks to have their own problems. They seem great on paper but usually they don't account for a specific source system, APIs, applications, data size, data distribution or scheduling situations. If your project is using it then developers end up hacking the frameworks instead of writing simple code that does the specific thing they need to do.

Before you know it you have super long and super inefficient code just to fit the framework. It takes about the same time to read and understand an ETL framework as it is to write your own python/bash script, and at least with your own code it's easier to see bottlenecks.

6 comments

I started writing and never completed a dead simple ETL framework that left most of the work up to the programmer. It was basically just an http server that you could hit with cron jobs, a DAG data structure for mapping the process, and some annotations that you could use to specify that a node/program step was concurrency safe, blocking, etc. You’re entirely right that there’s way more to ETL than meets the eye, but this still makes me want to dig it back up.
Prophecy.io let’s you create visual components from any Spark function. Same with Airflow. So you can use standard components (built-in or your new ones) without being restricted.

Founder here - we’re working to solve this exact problem.

What's the difference between Prophecy and the multitude of other ETL tools out there, like StreamSets, Talend, Ab Initio, and plenty more?
We’re very different from the ETL tools in that we’re bring software development best practices to data.

When you do visual drag and drop - prophecy is generating high quality code on git that is 100% open source (spark, airflow), you have tests and CI/CD - so you’re visually doing solid data engineering.

You can toggle between code and visual - so if you change the code (some), the visual graph updates - so small edits directly to git don’t break the visual layer.

All visual components are generated from a spec - think a Spark function with a some more info. So the data platform teams will create their own library/framework and roll it out to the wider teams. How it works us that in the visual editor, you start with standard Spark library, but can load visual components for delta, or encryption or data quality.

Our customers are typically fed up of these ETL tools and moving to us. We can also import the ETL formats (AbInitio, Informatica, …) in an automated way (we reverse engineered their formats and created source to source compilers)

Couple that with the way ETL frameworks quickly become undocumented, featuriferous that are opaque to anyone who isn't deeply embedded into the framework, yeah.
Not sure if this provides any insight or value.

But I had this same experience.

First exapmle was connecting Iterable - which looks like Airbyte supports - to bigquery.

In the past I had someone help me setup snowflake which was too complicated for me to maintain / understand myself especially AWS is too complicated for me compared to simpler google cloud.

Have also tried stich and fivetran at different times. Mostly to try to save time setting up non webhook syncs from FB marketing, Front. The volume of iterable data would be way hugely prohibitably expensive for us on those as paid platforms.

In the end I was able to do FB Marketing myself less than 1k lines of python modified from a script I found on github which used google cloud scheduler & function. I don't know python so that felt somewhat satisfying.

Another nuance in favor of a hosted/paid platform is that it looks like airbyte uses an api lookup sync instead of webhooks. That lets Airbyte get some more meta data to join to that I don't collect. That's valuable!

For iterable I ended up making a GAE app to accept incoming webhook data -> push to pubsub -> pushes to function -> which writes to bigquery.

The latency for bq writes was too much to try and do it all at once and i don't think iterable does webhook retries. Also Iterable is MEGA bursty like I've seen our GAE will scale up to somethings 40+ instances within minutes after we hit send on a campaign. That was the hardest problem to figure out getting the latency down for cold starts and scaling, cloud functions didn't work. It's not perfect but it's good enough for our needs. The simpler FB function grabs data 100% correct each day which feels good last I talked to some of the paid ETL it was flat $500 minimum a month not worth it.

From learning all this I've been able to reuse this gae, pubsub, function, bq/spanner pattern for other stuff I build and it has saved a lot of time and headache.

Indeed. A classic one is dealing with oauth2…

Airbyte docs:

> Note that the OAuth2Authenticator currently only supports refresh tokens and not the full OAuth2.0 loop.

I think this is saying that particular class expects to receive a refresh token as input. The "full oauth loop" means the UI needs to produce a refresh token via user consent in the browser.
I agree. I've used AWS Data Pipelines for some jobs but there is a steep learning curve. It is good for launching servers on demand to run your ETL jobs if you need that.

The best solution I have found is writing ETL scripts in Laravel which I use for most projects anyway. The framework has built-in scheduling and error reporting.

FYI i think Data Pipelines has not been actively developed for a while and is a 'zombie' product.