Hacker News new | ask | show | jobs
by njd 3482 days ago
It is possible, but there are a lot of caveats.

For example, how do you detect when a source's semantics change? This will break any cleansing or transforms done in the stream platform. Until it gets fixed, data may be missing, wrong, or worse (e.g. corrupted) and propagated down stream.

When data is cleansed and transformed early, there is no way to go back to the raw data, unless you carry it forwards too.

Consider these sort of questions for your use case.

1 comments

Isn't this a problem even with ETL based solutions too? Could you please explain this with an example?
Yes, especially with early cleansing and transformation. When source semantics are dynamic, try to build recognizers rather than expect sources to obey some semantic agreement. They won't. Cast what you can recognize into an intermediate shape. I like object, property, value (i.e triple) with metadata. Don't cleanse or transform the source data. Let the data be the data. Cleansed and transformed data fall into the category of assertions. Assertions can be made by humans and software, keep metadata. Allow your applications and your analysts to overlay their own semantic meaning on the triples. Naturally, consensus understanding of source semantics is desirable, but don't want to prevent analysts from using the raw or asserted data as they see fit. Still need software to analyze triples for the bad, etc. to make assertions that data is suspect. Otherwise, your downstream programs and analysts will need to make such assertions. Given such a process, applications and analysts can write queries to ignore suspect data, undesirable assertions, etc. if they so choose.