Hacker News new | ask | show | jobs
by njd 3479 days ago
Yes, especially with early cleansing and transformation. When source semantics are dynamic, try to build recognizers rather than expect sources to obey some semantic agreement. They won't. Cast what you can recognize into an intermediate shape. I like object, property, value (i.e triple) with metadata. Don't cleanse or transform the source data. Let the data be the data. Cleansed and transformed data fall into the category of assertions. Assertions can be made by humans and software, keep metadata. Allow your applications and your analysts to overlay their own semantic meaning on the triples. Naturally, consensus understanding of source semantics is desirable, but don't want to prevent analysts from using the raw or asserted data as they see fit. Still need software to analyze triples for the bad, etc. to make assertions that data is suspect. Otherwise, your downstream programs and analysts will need to make such assertions. Given such a process, applications and analysts can write queries to ignore suspect data, undesirable assertions, etc. if they so choose.