| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by paddy_m 588 days ago

Nice work!

Do you have any plans for data cleaning?

I am working on a somewhat similar open source project. I intend to add heuristic data cleaning. With the UI I want to be able to toggle between different strategies quickly - strip characters from a column to treat it as numeric, if less than 2% or 5% of values have a character, fill na with mean, interpret dates in different formats - drop if the date doesn't parse. The idea bing that if it's really quick to change between different strategies, you can create more opinionated strategies to get to the right answer faster.

Happy to collaborate and talk tables with anyone who's interested.

2 comments

kengoa 588 days ago

Yes I do have plans for data preprocessing using DuckDB WebAssembly (I have upcoming features secion in this blog: https://kengoa.github.io/software/2024/11/03/small-software....) but this will require SQL which some of the target audience might not be familiar with. I'm thinking of something like visual query builder from metabase.

> With the UI I want to be able to toggle between different strategies quickly - strip characters from a column to treat it as numeric, if less than 2% or 5% of values have a character, fill na with mean, interpret dates in different formats - drop if the date doesn't parse

Those are really good examples and I can make those predefined preproccesing techniques available as toggles in the dataset tab. Thanks for the feedback!

remolacha 588 days ago

not quite what you're describing, but I open-sourced a fuzzy deduplication tool last week: https://dedupe.it Would be interested in expanding it to deal with data cleaning more broadly

turtlebits 588 days ago

Not sure if you have introduced an artificial delay, but deduping ~25 rows shouldn't take 5+ seconds...

edit: I see you're using an LLM, but " ~$8.40 per 1k records" sounds unsustainable.