| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mrwnmonm 1044 days ago
	> Serra is a low-code, object-oriented ETL framework that allows developers to write PySpark jobs easily—think end-to-end dbt with the benefits of object-oriented Spark. Could you please explain this as if I am three years old? (also, I don't know dbt)

2 comments

albertstanley 1043 days ago

Sure, I’ll clarify some of the terms used in that one-liner in case it’s helpful for anyone else as well.

ETL is the process of extracting transforming and loading data from a source to a destination in a data pipeline. Spark, an engine for large scale data processing, allows us to write code that can work with large amounts of data. dbt is a tool you can use to break up your SQL scripts into smaller “models” - other SQL scripts that can be reused and tested.

We described us as an end to end because we also have extractors and loaders, whereas dbt focuses on the T ( transformation step of ETL ). Each of our steps involved in extraction, transformation and loading correspond to a specific Python object defined in our Python framework. I have also updated the README in our repo to hopefully better explain how the config file links to user defined readers, writers, and transformers.

link

mrwnmonm 1041 days ago

Thanks

link

esafak 1043 days ago

If it is really a dbt clone it is an ELT tool not ETL:

https://en.wikipedia.org/wiki/Extract,_load,_transform

https://en.wikipedia.org/wiki/Data_build_tool

It's about (big) data munging.

link

Alanhlwang 1043 days ago

Thanks for these links! We consider ourselves an ELT and ETL tool—if you run a Serra job in your own warehouse (ie Databricks), you can easily specify extracting from AWS, loading the parquets into your warehouse, then transforming them with our config block approach (ELT).

The same is true for ETL. If you have a spark cluster separate from your warehouse, you can define your config file to run in the order E T L: you can extract from your data source, run the transformations on a separate cluster, then load it to your warehouse.

link