Hacker News new | ask | show | jobs
by abhgh 859 days ago
That's probably the intent, but I don't know if this actually achieves this (I have another comment that's about the use of bayesopt here). But even if it did, bayesopt operates sequentially (it's a Sequential Model-based Optimizer or SMBO) and so the trajectory of queries different LLMs evaluate would be different. Unless there is something to correct this cascading bias I don't know if you could use this to compare LLMs. Or obtain a score that's comparable to standard reported numbers.

On a different note, if all we want is a diverse set of representative samples (based on embeddings), there are algorithms like DivRank that do that quite well.