There's been a growing amount of research on the topic of data-centric ai, now with software being dedicated to it. This one is super fresh in Neurocomputing, which is a Q1 publication.
In short, ydata-profiling is a Python tool that generates a detailed report about the data, including missing values, distribution of data, correlations, and data quality alerts, etc.
I work specifically in data quality (imbalanced and missing data) so I've been following the project for a while, but I'm curious whether you make a case of really exploring your data characteristics beforehand and how serious do you consider these alerts.
Do you think this shift towards a "data-centric" approach in AI is really set to be the "next big paradigm" in AI? It's cool to see it valued, but idk...
I guess LLMs have a huge potential, but they're super dependent on high-quality data, so in that perspective, its imperative to guarantee best practices.
Especially taking into account the new regulations and anti-bias concerns.
It's a bit scary to think of the widespread of LLMs just with random, untreated data :/
In short, ydata-profiling is a Python tool that generates a detailed report about the data, including missing values, distribution of data, correlations, and data quality alerts, etc.
I work specifically in data quality (imbalanced and missing data) so I've been following the project for a while, but I'm curious whether you make a case of really exploring your data characteristics beforehand and how serious do you consider these alerts.
Do you think this shift towards a "data-centric" approach in AI is really set to be the "next big paradigm" in AI? It's cool to see it valued, but idk...