|
|
|
|
|
by minimaxir
546 days ago
|
|
The MTEB benchmark was never that great since embeddings are used for more specific domain-specific tasks (e.g. search/clustering) that can't really be represented well in a generalized test, moreso than LLM next-token-prediction benchmarks which aren't great either. As with all LLM models and their subproducts, the only way to ensure good results is to test yourself, ideally with less subjective, real-world feedback metrics. |
|
This is excellent advice. Sadly, very few people/organizations implement their own evaluation suites.
It doesn't make much sense to put data infrastructure in production without first evaluating its performance (IOPS, uptime, scalability, etc.) on internal workloads; it is no different for embedding models or models in general for that matter.