| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cm 3610 days ago

I'm an engineer at Stitch. Our approach to transformation is to do just enough to move data from one system to another without losing precision or fidelity. So, we transform datatypes and structures into more appropriate forms for the target system, but we don't have any transformation operators like aggregation or windowing.

We have found that this approach works well for our users, who prefer to get the rawest possible data, and the systems we target like Redshift that are themselves powerful transformation engines. This gives the user unlimited flexibility for defining transformations, and a full audit trail for understanding how their data has changed.

We are always evolving, though, so if there's a use case that you think requires this approach, I would be eager to hear more about it.

2 comments

specialist 3610 days ago

I have no idea what you're talking about. Scanning your docs, I'm no more illuminated.

I've done a lot of ETL, mostly for healthcare.

Yes, engineers should be doing ETL work. Any "workflow engine" that promises patch cord or visual programming is hooey. At the end of the day, someone somewhere is gonna be writing some code. And its not the "business analyst" or "subject area expert". No, its a dev. And all that clever framework stuff is just an angry 800lb gorilla sitting between her and her work.

ETL is just fancy talk for data processing. Input, processing, output. Copy a string from a source, maybe mangle it a bit, paste that string somewhere else. Extra credit for type awareness, eg "oh! that string's a date!". Trophies for logging, alerts, and services which heal themselves.

link

Eridrus 3610 days ago

Do you have any sort of SDK for adding integrations that you do not support?

While this looks super useful if you support all of the integrations someone needs, it seems like the moment that's not the case someone needs to maintain a complete ETL pipeline for those data sources you don't support, and their load is only reduced by the fact that they have to maintain fewer data sources.

link

cm 3610 days ago

We do have an API for sending data into the Pipeline, documentation for it can be found here: https://docs.stitchdata.com/hc/en-us/categories/203326787-Im...

Additionally, we'll be releasing a Java client library any day now, with other languages and platforms to follow.

link