| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mulmen 506 days ago

> Extracting the data should be the most challenging aspect of an ETL pipeline.

Why should this be difficult? It’s the easiest part. You run SELECT * and you’re done.

The difficult part is transforming all the disparate upstream systems and their evolving schemas into a useful analytical model for decision support.

1 comments

bob1029 506 days ago

Not all data lives in a SQL database. Much of the extraction code I write does things like loading flat files from unusual sources and querying APIs.

If the source data is already in a SQL store, then the solution should be obvious. You don't need any other tools to produce the desired view of the business at that point. Transforming for an upstream schema is a select statement per target table. This doesn't need to be complicated.

link

mulmen 506 days ago

Yeah I extract a lot of data out of Dynamo. It’s still the easiest part. Change capture just isn’t complicated. You need some basic constructs and then you’re golden. The data mart design phase is orders of magnitude more effort.

link