| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by greenie_beans 847 days ago
	this is what i try to do but i want to learn more about approaches like this, do you know any good resources about how to design ETL pipelines?

4 comments

jjice 847 days ago

I wish I did. I currently work at a startup with our core offering being ETL, so I've learned along the way as we've continued. If anyone has any, I'd love to hear as well.

Keeping raw data when possible has been huge. We keep some in our codebase for quick tests during development and then we keep raws from production runs that we can evaluate with each change, giving us an idea of the production impact of the change.

link

65 847 days ago

I built an ETL pipeline for a government client using just AWS, Node, and Snowflake. All Typescript. To cache the data I store responses in S3. If there's a cache available, use the S3 data, if not get the new data. We can also clean the old cache occasionally with a cron job. Then do transforms and put it in Snowflake. Sometimes we need to do transforms before caching the data in S3 (e.g. adding a unique ID to CSV rows), or doing things like splitting giant CSV files into smaller files that can then be inserted into Snowflake (Snowflake has a 50mb payload limit). We have alerts, logging, and metadata set up as well in AWS and Snowflake. Most of this comes down to your knowledge of cloud data platforms.

It's honestly not that difficult to build ETL pipelines from scratch. We're using a ton of different sources with different data formats as well. Using the Serverless framework to set up all the Lambda functions and cron jobs also makes things a lot easier.

link

greenie_beans 847 days ago

i appreciate you sharing all that, but it seems like we might be on similar levels of knowledge/experience. i've been a dev who does a lot data engineering for 5 years. i'm looking more for best practices and theory about designing the pipeline, how to arrange the order of operations, how to separate each step, logging practices, how to make it reproducible, how to restart when it fails halfway in without going back to the beginning, how many retries, what to do if a step gets stuck in failed state, how to flag that bad data, etc. so. many. questions. while i build these pipelines.

i have figured out these questions by seeing how more experienced devs do it and on my own, but i want to learn from a book or video series because you can only figure out so much yourself, eventually you need to seek out experts and sometimes the experts around you also figured it out themselves and you need to find an expert outside of your circle. unfortunately a lot of the "ETL experts" teaching stuff online are trying to sell me on prefect or airflow or snowflake etc

link

NortySpock 847 days ago

Disclaimer: previous job had a lot of cases where CSVs were dropped by SFTP, your milage may vary..., JSON APIs are said to be a different flavor of crazy...

Haven't heard much beyond "ask the Old Ones", but "Murphy's law strikes again", "eventually someone will want that data even though they swore it was unnecessary", "eventually someone will ask for a backfill/replay", "eventually someone will give you a duplicate file", "eventually someone will want to slice-and-dice the data a different way" and "eventually someone will change the schema without telling you" have been some things I have noticed.

Even de-duplicating data is, in a sense, deletion (or someone will eventually want to get at the data with duplicates -- e.g. for detecting errors or repeats or fraud or some other analysis that mirrors looking at the bullet holes in World War 2 bombers)

Store the data as close to the original form as you can. Keep a timestamp of when you landed the data. Create a UUID for the record. Create a hash of the record if you can. Create a batch_id if you load multiple things at once (e.g. multiple CSVs). Don't truncate and reload a table - rather, append to it. If you still need something that looks like atomic table changes, I've gotten away with something close: "a view that shows only the most recent valid batch". (Yes this is re-inventing the database wheel, but sometimes you make do with the tools you are forced to use.)

Someone, somewhere, will hand you a file that does not conform to the established agreement. You want to log that schema change, with a timestamp, so you can complain to them with evidence that they ain't sending you what they used to, and they didn't bother sending you an email beforehand...

They're not going to fix it on your timeline, so you're probably going to end up hacking your code to work a different way... Until, you know, they switch it back...

So, yeah. Log it. Timestamp it. Hash it. UUID it. Don't trust the source system to do it right, because they will eventually change the script on you. Keep notes, and plan in such a way that you have audit logs and can move with agility.

link

NortySpock 847 days ago

In conclusion...

I find, in data engineering ,the goal is not to prevent everything, it's to be flexible and prepared to handle lots of change, even silly changes, and be able to audit it, observe it, maneuver around it, and keep the mean-time-to-resolution low.

link

greenie_beans 846 days ago

sound advice. thank you.

link

computershit 847 days ago

There's quite a bit of new tooling in this space, selecting the right one is going to depend on your needs then you can spike from there. Check out Prefect, Dagster, Windmill, Airbyte (although the latter is more ELT than ETL).

link