Hacker News new | ask | show | jobs
NLP Algorithms for Clustering AI Content Search Keywords
1 points by alexander-g 853 days ago
I'm working on an NLP project to organize a dataset of 1.5 million keywords and phrases extracted from millions of generative AI prompts for image/video creation. The challenge lies in the diverse variations of keywords, especially names, that need to be grouped under unified categories to enhance searchability and data utility.

A typical example includes various forms of 'Margot Robbie', and fine-tuning models/LoRas created on her pictures. The goal is straightforward but daunting: clean, standardize, extract meaningful keywords, and cluster these terms based on their relational context to each other.

Given the sheer volume of data, manual sorting is impractical. Hence, I'm looking for scalable, automated solutions.

I'm here to gather insights on tools, libraries, and algorithms that might be particularly effective for this kind of task. If anyone has tackled similar challenges or has relevant experience, your advice would be invaluable.

Appreciate any pointers!

2 comments

the first thing that comes to mind is CLIP: https://github.com/openai/CLIP maybe combining it with some sort of knowledge graph is what you're looking for