Hacker News new | ask | show | jobs
by observationist 864 days ago
This, exactly - what is meant by evaluate in this context? Is this more efficient inference using approximation, so you can create novel generations, or is it some test of model attributes?

What the OP is doing here is completely opaque to the rest of us.

2 comments

Fair question.

Evaluate refers to the phase after training to check if the training is good.

Usually the flow goes training -> evaluation -> deployment (what you called inference). This project is aimed for evaluation. Evaluation can be slow (might even be slower than training if you're finetuning on a small domain specific subset)!

So there are [quite](https://github.com/microsoft/promptbench) [a](https://github.com/confident-ai/deepeval) [few](https://github.com/openai/evals) [frameworks](https://github.com/EleutherAI/lm-evaluation-harness) working on evaluation, however, all of them are quite slow, because LLM are slow if you don't have infinite money. [This](https://github.com/open-compass/opencompass) one tries to speed up by parallelizing on multiple computers, but none of them takes advantage of the fact that many evaluation queries might be similar and all try to evaluate on all given queries. And that's where this project might come in handy.

Your explanations are still unclear.

I know what evaluation is, and inference, and training. Deployment means to deploy - to put a model in production. It does not mean inference. Inference means to input a prompt into a model and get the next token, or tokens as the case may be. Training and inference are closely related, since during training, inference is run and the error given by the difference between the prediction and target is backpropagated, etc.

Evaluation is running inference over a suite of tests and comparing the outcomes to some target ideal. An evaluation on the MMLU dataset lets you run inference on zero and few shot prompts to test the knowledge and function acquisition of your model, for example.

So is your code using Bayesian Optimization to select a subset of a corpus, like a small chunk of the MMLU dataset, that is representative of the whole, so you can test on that subset instead of the whole thing?

This is becoming so common in AI discussions. Everyone with a real use case is opaque, or just flat out doesn't talk. The ones who are talking have toy use cases. I think its because it's so hard to build a moat, and techniques are one of the ways to build one.
Hi, OP here. I would kind of have to disagree here. You raised some interesting points, but I don't think something can be qualified as *moat* if it is overcome-able by just sharing the use cases. For example, we all know Google's use cases is to search, but no one has built one as well as they do. Their moat is in their technology and brand recognision.
Not to disagree with your argument as a whole, but Google's most hasn't been technological for years, but instead comes from their ability to be the default search engine everywhere they can, including if they need to pay Apple billions for that position.