Hacker News new | ask | show | jobs
by minimaxir 546 days ago
The MTEB benchmark was never that great since embeddings are used for more specific domain-specific tasks (e.g. search/clustering) that can't really be represented well in a generalized test, moreso than LLM next-token-prediction benchmarks which aren't great either.

As with all LLM models and their subproducts, the only way to ensure good results is to test yourself, ideally with less subjective, real-world feedback metrics.

1 comments

> As with all LLM models and their subproducts, the only way to ensure good results is to test yourself, ideally with less subjective, real-world feedback metrics.

This is excellent advice. Sadly, very few people/organizations implement their own evaluation suites.

It doesn't make much sense to put data infrastructure in production without first evaluating its performance (IOPS, uptime, scalability, etc.) on internal workloads; it is no different for embedding models or models in general for that matter.