I made an app to fuzzy-deduplicate my Google Sheets and CRM records
- No manual configuration required
- Works out-of-the-box on most data types (ex. people, companies, product catalog)
Implementation details:
- Embeds records using an E5 model
- Performs similarity search using DuckDB w/ vector similarity extension
- Does last-mile comparison and merges duplicates using Claude
Appreciate the kind words! Linear scaling in terms of speed and cost. We haven't yet optimized the prompts & choice of model to minimize token usage, so I'd recommend emailing us for advice if you want to run this on a large dataset