| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jjice 856 days ago
	I've found this to be a good practice for ETL in general. Separate the steps, and save the raw data from "E" if you can because it makes testing and verifying "T" later much easier.

2 comments

iamacyborg 856 days ago

I realise from working a few places that this isn't entirely common practice, but when we built the data warehouse at a startup I worked at, we engaged with a consultancy who taught us the fundamentals of how to do it properly.

One of those fundamentals was separating out the steps of landing the data vs subsequent normalisation and transformation steps.

link

ethbr1 856 days ago

It's unfortunate that "ETL" stuck in mindshare, as afaik almost all use cases are better with "ELT"

I.e. first preserve your raw upstream via a 1:1 copy, then transform/materialize as makes sense for you, before consuming

Which makes sense, as ELT models are essentially agile for data... (solution for not knowing what we don't yet know)

link

dragonwriter 856 days ago

I think ETL is right from the perspective where E refers to “from the source of data” and L refers to “to the ultimate store of data”.

But the ETL functionality should itself lives in a (sub)system that has its own logical datastore (which may or may not be physically separate from the destination store), and things should be ELT where the L is with respect to that store. So, its E(LTE)L, in a sense.

link

appplication 855 days ago

For those confused as to whether ETL or ELT is ultimately more appropriate for you… almost everyone is really just doing ETLTLTLT or ELTLTLTL anyways. The distinction is really moot.

link

ethbr1 855 days ago

Maybe my understanding is incorrect, but expansion on the distinction.

Assumptions -- We're talking about two separate systems (source and destination) with non-neglible transfer time (although perhaps "quick")

ETL -- Performing the transform before/during the load, such that fields in the destination are not guaranteed to have existed in the source (i.e. 2 db model)

ELT -- Performing a 1:1 copy of source into an intermediary table/db (albeit perhaps with filtering), then performing a transform on the intermediary table/db to generate the destination table/db (either realized or materialized at query time), with the intermediary table/database history retained (i.e. 3 table/db model)

In short distinction, if regeneration or altering the destination is required, ETL relies on history being available in the upstream source.

ELT pulls control of that to the destination-owner, as they're retaining the raw data on their side.

link

greenie_beans 856 days ago

this is what i try to do but i want to learn more about approaches like this, do you know any good resources about how to design ETL pipelines?

link

jjice 856 days ago

I wish I did. I currently work at a startup with our core offering being ETL, so I've learned along the way as we've continued. If anyone has any, I'd love to hear as well.

Keeping raw data when possible has been huge. We keep some in our codebase for quick tests during development and then we keep raws from production runs that we can evaluate with each change, giving us an idea of the production impact of the change.

link

65 856 days ago

I built an ETL pipeline for a government client using just AWS, Node, and Snowflake. All Typescript. To cache the data I store responses in S3. If there's a cache available, use the S3 data, if not get the new data. We can also clean the old cache occasionally with a cron job. Then do transforms and put it in Snowflake. Sometimes we need to do transforms before caching the data in S3 (e.g. adding a unique ID to CSV rows), or doing things like splitting giant CSV files into smaller files that can then be inserted into Snowflake (Snowflake has a 50mb payload limit). We have alerts, logging, and metadata set up as well in AWS and Snowflake. Most of this comes down to your knowledge of cloud data platforms.

It's honestly not that difficult to build ETL pipelines from scratch. We're using a ton of different sources with different data formats as well. Using the Serverless framework to set up all the Lambda functions and cron jobs also makes things a lot easier.

link

greenie_beans 855 days ago

i appreciate you sharing all that, but it seems like we might be on similar levels of knowledge/experience. i've been a dev who does a lot data engineering for 5 years. i'm looking more for best practices and theory about designing the pipeline, how to arrange the order of operations, how to separate each step, logging practices, how to make it reproducible, how to restart when it fails halfway in without going back to the beginning, how many retries, what to do if a step gets stuck in failed state, how to flag that bad data, etc. so. many. questions. while i build these pipelines.

i have figured out these questions by seeing how more experienced devs do it and on my own, but i want to learn from a book or video series because you can only figure out so much yourself, eventually you need to seek out experts and sometimes the experts around you also figured it out themselves and you need to find an expert outside of your circle. unfortunately a lot of the "ETL experts" teaching stuff online are trying to sell me on prefect or airflow or snowflake etc

link

NortySpock 856 days ago

Disclaimer: previous job had a lot of cases where CSVs were dropped by SFTP, your milage may vary..., JSON APIs are said to be a different flavor of crazy...

Haven't heard much beyond "ask the Old Ones", but "Murphy's law strikes again", "eventually someone will want that data even though they swore it was unnecessary", "eventually someone will ask for a backfill/replay", "eventually someone will give you a duplicate file", "eventually someone will want to slice-and-dice the data a different way" and "eventually someone will change the schema without telling you" have been some things I have noticed.

Even de-duplicating data is, in a sense, deletion (or someone will eventually want to get at the data with duplicates -- e.g. for detecting errors or repeats or fraud or some other analysis that mirrors looking at the bullet holes in World War 2 bombers)

Store the data as close to the original form as you can. Keep a timestamp of when you landed the data. Create a UUID for the record. Create a hash of the record if you can. Create a batch_id if you load multiple things at once (e.g. multiple CSVs). Don't truncate and reload a table - rather, append to it. If you still need something that looks like atomic table changes, I've gotten away with something close: "a view that shows only the most recent valid batch". (Yes this is re-inventing the database wheel, but sometimes you make do with the tools you are forced to use.)

Someone, somewhere, will hand you a file that does not conform to the established agreement. You want to log that schema change, with a timestamp, so you can complain to them with evidence that they ain't sending you what they used to, and they didn't bother sending you an email beforehand...

They're not going to fix it on your timeline, so you're probably going to end up hacking your code to work a different way... Until, you know, they switch it back...

So, yeah. Log it. Timestamp it. Hash it. UUID it. Don't trust the source system to do it right, because they will eventually change the script on you. Keep notes, and plan in such a way that you have audit logs and can move with agility.

link

NortySpock 855 days ago

In conclusion...

I find, in data engineering ,the goal is not to prevent everything, it's to be flexible and prepared to handle lots of change, even silly changes, and be able to audit it, observe it, maneuver around it, and keep the mean-time-to-resolution low.

link

greenie_beans 855 days ago

sound advice. thank you.

link

computershit 856 days ago

There's quite a bit of new tooling in this space, selecting the right one is going to depend on your needs then you can spike from there. Check out Prefect, Dagster, Windmill, Airbyte (although the latter is more ELT than ETL).

link