Hacker News new | ask | show | jobs
by ekzhu 1676 days ago
There is no reason for both approaches to not coexist: a centralized catalog managed by a small team, setting the “gold standard” for the many decentralized data producers and curators, who are incentivized to maximize their impacts (i.e., usage) by having higher quality data following the standard.

Another thing to point out: besides relying on the future promises of ML, there are already many signals that can be used by a centralized catalog for data discovery. For example: data sketches (MinHash, Hyperloglog) for joinable datasets, social signals (likes, comments, stars, etc. see Alation and Select Star SQL), lineages through data movements (e.g., Azure Data Factory and Azure Purview). If the centralized catalog uses those signals, then the data producers are incentivized to provide them for better visibility.

3 comments

I agree that in theory they could both co-exist for the reasons you state, but in practice I think it's unlikely a company that invests in a data fabric (which is largely a technology cost) is going to simultaneously invest in the incentives for the data product creators that are necessary for the data mesh not to become the wild west.
I had never heard of Data Fabric before. Now that I have, I'm not sure they can exist without each other. In fact I would imagine that the metadata accumulated by through the data fabric would/could end up driving the data mesh implementation.

Perhaps apps and services will end up having to go through data-coverage and data-quality verification steps before being released. Analytics (and caching, joins, etc) as an after thought is unacceptable in this day and age.

Maybe it's good to think of "fabric", "mesh", "warehouse", and "lake" as design patterns for data.