Hacker News new | ask | show | jobs
by leobg 709 days ago
The example could be handled with no machine learning at all. Just use a bag of words comparison with a subword tokenizer. And if you do need embeddings (to map synonyms/topics), fastText is faster, cheaper and runs locally. For hard cases, you can feed the source/target schemas to gpt-4o once to create a map - and then apply that one map to all instances.
2 comments

> fastText is faster, cheaper and runs locally

the question is if quality will be acceptable

The question if machine learning algorithm's produced embeddings will have the acceptable quality too. With a library I presume that the quality is at least predictable. I personally have less trust in machine learning though
> The question if machine learning algorithm's produced embeddings will have the acceptable quality too

there are tons of benchmarks and results which demonstrated that embeddings from language models are superior to word2vec in (almost) all scenarios.

BTW Bag of words models were once considered ML not too long ago.