| Hey HN Crew! We all have lists...and they can be annoying to de-duplicate. * User feedback
* Groceries
* Employee Surveys
* Bug reports
* You name it Most ways to consolidate like-items work off of keywords or worse, exact phrases (Sheets/Excel). But LLMs are much better at understanding an items semantic meaning and determining if two items should be combined or not. I decided to build my first python package, The Semantic Deduplicator, to help me consolidate items based on their meaning, not keywords. For Example On Groceries:
['We need more berries', 'I want more more milk', 'Can we get more carbonated water please?', 'We need more sparkling water']
...deduplicated...
['Berries', 'Milk', 'Sparkling Water'] How it works: 1. Start with an empty list ready to populate 2. The first item you add will get 1) transformed into a clean name (user feedback > product request) and 2) added to the list 3. While you're adding more items * Check to see if your new item's embedding is close to any existing item * If so, ask the LLM to compare your two items to see if they should be combined * If so, combine them This package is more of an exploration and POC so be careful with it. I'd love to hear any feedback. All the links: * YT Explainer Video: https://www.youtube.com/watch?v=etLsNgkGbeM * Twitter Thread: https://twitter.com/GregKamradt/status/1719760658936545336 * Pypi: https://pypi.org/project/semantic-deduplicator/ * Github: https://github.com/gkamradt/SemanticDeduplicator |
We had the same idea and made it a core product feature - https://docs.arguflow.ai/duplicate_detection