| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by marcinzm 1374 days ago
	Aren't typos just a question of how you generate your vectors/embeddings? I'd be surprised if a transformer with a character level tokenizer trained on a representative source of data (ie: with typos) wouldn't be able to make sense of typos.

2 comments

evrydayhustling 1374 days ago

Can confirm. We use sentence-level transformer embeddings for (vector) search, clustering, and classification tasks. As an old school ML guy I've been amazed at how robust they are to typos, slang, punctuation, etc.

However, I'm sure there are still applications where you don't have access to a robust embedding for your domain but can apply other techniques to deal with that domain's noise.

link

O__________O 1374 days ago

Here is decent intro to sentence level transformers & embeddings:

https://www.pinecone.io/learn/sentence-embeddings/

link

dustincoates 1374 days ago

Yes, good point. I still believe that net-net you're going to get better results on typos with a keyword-based search, but I didn't mean to imply that vector searching won't handle typos at all.

link