Hacker News new | ask | show | jobs
Show HN: One-Click CSV Deduplication (open-source) (app.dedupe.it)
4 points by remolacha 596 days ago
I made an app to fuzzy-deduplicate my Google Sheets and CRM records

- No manual configuration required - Works out-of-the-box on most data types (ex. people, companies, product catalog)

Implementation details:

- Embeds records using an E5 model - Performs similarity search using DuckDB w/ vector similarity extension - Does last-mile comparison and merges duplicates using Claude

Demo video: https://youtu.be/7mZ0kdwXBwM

Github repo (Apache 2.0 licensed): https://github.com/SnowPilotOrg/dedupe_it

Lmk any feedback on how to make this better!

1 comments

Curious how this scales. Just tried this with the test dataset and it was probably the slickest deduplication experience I’ve had
Appreciate the kind words! Linear scaling in terms of speed and cost. We haven't yet optimized the prompts & choice of model to minimize token usage, so I'd recommend emailing us for advice if you want to run this on a large dataset