|
|
|
|
|
by autokad
588 days ago
|
|
I have found that embeddings + LLM is very successful. I'm going to make the words up as to not yield my work publicly, but I had to classify something into 3 categories. I asked a simple llm to label it, it was 95% accurate. taking the min distance from the word embeddings to the mean category embeddings was about 96%. When I gave gave the LLM the embedding prediction, the LLM was 98% accurate. There were issues an embedding model might not do well on where as the LLM could handle. for example: These were camel case words, like WoodPecker, AquafinaBottle, and WoodStock (I changed the words to not reveal private data).
WoodPecker and WoodStock would end up with close embedding values because the word Wood dominated the embedding values, but these were supposed to go into 2 different categories. |
|
When faced with a similar challenge we developed a custom tokenizer, pretrained BERT base model[0], and finally a SPLADE-esque sparse embedding model[1] on top of that.
[0] - https://huggingface.co/atomic-canyon/fermi-bert-1024
[1] - https://huggingface.co/atomic-canyon/fermi-1024