| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by albertstanley 1043 days ago

Sure, I’ll clarify some of the terms used in that one-liner in case it’s helpful for anyone else as well.

ETL is the process of extracting transforming and loading data from a source to a destination in a data pipeline. Spark, an engine for large scale data processing, allows us to write code that can work with large amounts of data. dbt is a tool you can use to break up your SQL scripts into smaller “models” - other SQL scripts that can be reused and tested.

We described us as an end to end because we also have extractors and loaders, whereas dbt focuses on the T ( transformation step of ETL ). Each of our steps involved in extraction, transformation and loading correspond to a specific Python object defined in our Python framework. I have also updated the README in our repo to hopefully better explain how the config file links to user defined readers, writers, and transformers.

1 comments

mrwnmonm 1041 days ago

Thanks

link