| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by leobg 709 days ago
	The example could be handled with no machine learning at all. Just use a bag of words comparison with a subword tokenizer. And if you do need embeddings (to map synonyms/topics), fastText is faster, cheaper and runs locally. For hard cases, you can feed the source/target schemas to gpt-4o once to create a map - and then apply that one map to all instances.

2 comments

riku_iki 709 days ago

> fastText is faster, cheaper and runs locally

the question is if quality will be acceptable

link

flysand7 709 days ago

The question if machine learning algorithm's produced embeddings will have the acceptable quality too. With a library I presume that the quality is at least predictable. I personally have less trust in machine learning though

link

riku_iki 709 days ago

> The question if machine learning algorithm's produced embeddings will have the acceptable quality too

there are tons of benchmarks and results which demonstrated that embeddings from language models are superior to word2vec in (almost) all scenarios.

link

srean 709 days ago

BTW Bag of words models were once considered ML not too long ago.

link