Hacker News new | ask | show | jobs
Ask HN: Is ETL (data integration in batch processing mode) really dead?
9 points by srigan 3483 days ago
I have recently come across this presentation https://www.infoq.com/presentations/etl-streams?utm_source=infoq&utm_medium=popular_widget&utm_campaign=popular_content_list&utm_content=homepage Should every data integration or data processing pipeline should be built based on stream processing architecture, even though there is no need for such a thing from day zero? The argument i hear for doing so is that in future we might have a need for real time processing. Would like to hear what others are thinking.
5 comments

Far from it. CSV is by far the most common data exchange format for ERP, CRM and business systems in general. EDI is another. Good luck communicating with SAP ERP or NetSuite without good old-fashioned SOAP. Judging from the documentation none of these seems to be supported by Confluent.

SOAP and CSV are not sexy. They have plenty of shortcomings. However, those are the formats that are used in the real world today (and for some time to come).

Stream processing is a very useful design pattern but like any design pattern it should be used carefully and only where appropriate (see: Microservices).

If I were to build a new complex ERP from the ground up I'd be remiss not to use something like Kafka or Confluent for data processing.

If I want to communicate with legacy systems though that's an entirely different matter. The same applies when targeting SMBs. You'd have a hard time explaining to small business owners why they suddenly need a newfangled stream processing architecture while their old "Export CSV and load that into Excel" process worked just fine.

Not all data comes in at a rate where streaming approaches are necessary. Sensors, click or IoT data perhaps, but for things like purchases, signups, or other "daily" activities batch processing is suitable and less complex to build.

I would wager that most data is not of a streaming nature, but as the ability to process live pipelines is relatively new, it gets more attention.

It is possible, but there are a lot of caveats.

For example, how do you detect when a source's semantics change? This will break any cleansing or transforms done in the stream platform. Until it gets fixed, data may be missing, wrong, or worse (e.g. corrupted) and propagated down stream.

When data is cleansed and transformed early, there is no way to go back to the raw data, unless you carry it forwards too.

Consider these sort of questions for your use case.

Isn't this a problem even with ETL based solutions too? Could you please explain this with an example?
Yes, especially with early cleansing and transformation. When source semantics are dynamic, try to build recognizers rather than expect sources to obey some semantic agreement. They won't. Cast what you can recognize into an intermediate shape. I like object, property, value (i.e triple) with metadata. Don't cleanse or transform the source data. Let the data be the data. Cleansed and transformed data fall into the category of assertions. Assertions can be made by humans and software, keep metadata. Allow your applications and your analysts to overlay their own semantic meaning on the triples. Naturally, consensus understanding of source semantics is desirable, but don't want to prevent analysts from using the raw or asserted data as they see fit. Still need software to analyze triples for the bad, etc. to make assertions that data is suspect. Otherwise, your downstream programs and analysts will need to make such assertions. Given such a process, applications and analysts can write queries to ignore suspect data, undesirable assertions, etc. if they so choose.
Of course one of the initial authors of Kafka will say, that ETL is dead, and streams only are the future... But that's not the reality in many cases.
I've implemented an ETL pipeline at one of my clients just this quarter that runs nightly and gives them the data they need in the format they want (Web UI + CSV exports). Just having this available at all is a huge win. Having the data be "fresher" would be nice but it's small margin of win compared to the original win.