Hacker News new | ask | show | jobs
by azinman2 864 days ago
What I don’t get from the webpage is what are you evaluating, exactly?
3 comments

This, exactly - what is meant by evaluate in this context? Is this more efficient inference using approximation, so you can create novel generations, or is it some test of model attributes?

What the OP is doing here is completely opaque to the rest of us.

Fair question.

Evaluate refers to the phase after training to check if the training is good.

Usually the flow goes training -> evaluation -> deployment (what you called inference). This project is aimed for evaluation. Evaluation can be slow (might even be slower than training if you're finetuning on a small domain specific subset)!

So there are [quite](https://github.com/microsoft/promptbench) [a](https://github.com/confident-ai/deepeval) [few](https://github.com/openai/evals) [frameworks](https://github.com/EleutherAI/lm-evaluation-harness) working on evaluation, however, all of them are quite slow, because LLM are slow if you don't have infinite money. [This](https://github.com/open-compass/opencompass) one tries to speed up by parallelizing on multiple computers, but none of them takes advantage of the fact that many evaluation queries might be similar and all try to evaluate on all given queries. And that's where this project might come in handy.

Your explanations are still unclear.

I know what evaluation is, and inference, and training. Deployment means to deploy - to put a model in production. It does not mean inference. Inference means to input a prompt into a model and get the next token, or tokens as the case may be. Training and inference are closely related, since during training, inference is run and the error given by the difference between the prediction and target is backpropagated, etc.

Evaluation is running inference over a suite of tests and comparing the outcomes to some target ideal. An evaluation on the MMLU dataset lets you run inference on zero and few shot prompts to test the knowledge and function acquisition of your model, for example.

So is your code using Bayesian Optimization to select a subset of a corpus, like a small chunk of the MMLU dataset, that is representative of the whole, so you can test on that subset instead of the whole thing?

This is becoming so common in AI discussions. Everyone with a real use case is opaque, or just flat out doesn't talk. The ones who are talking have toy use cases. I think its because it's so hard to build a moat, and techniques are one of the ways to build one.
Hi, OP here. I would kind of have to disagree here. You raised some interesting points, but I don't think something can be qualified as *moat* if it is overcome-able by just sharing the use cases. For example, we all know Google's use cases is to search, but no one has built one as well as they do. Their moat is in their technology and brand recognision.
Not to disagree with your argument as a whole, but Google's most hasn't been technological for years, but instead comes from their ability to be the default search engine everywhere they can, including if they need to pay Apple billions for that position.
"Evaluation" has a pretty standard meaning in the LLM community the same way that "unit test" does in software. Evaluations are suites of challenges presented to an LLM to evaluate how well it does as a form of bench-marking.

Nobody would chime in on an article on "faster unit testing in software with..." and complain that it's not clear because "is it a history unit? a science unit? what kind of tests are those students taking!?", so I find it odd that on HN people often complain about something similar for a very popular niche in this community.

If you're interested in LLMs, the term "evaluation" should be very familiar, and if you're not interested in LLMs then this post likely isn't for you.

There’s lots to evaluate. If you’re evaluating model quality, there are many benchmarks all trying to measure different things… accuracy in translation, common sense reasoning, how well it stays on topic, can you regurgitate a reference in the prompt text, how biased is the output along a societal dimension, other safety measures, etc. I’m in the field but not an LLM researcher per se, so perhaps this is more meaningful to others, but given the post it seems useful to answer my question which was what _exactly_ is being evaluated?

In particular this is only working off the encoded sentences so it seems to me that things that involve attention etc aren’t being evaluated here.

Unit testing isn't an overloaded term. Evaluation by itself is overloaded, though "LLM evaluation" disambiguates it. I first parsed the title as 'faster inference' rather than 'faster evaluation' even being aware of what LLM evaluation is, because that's a probable path given 'show' 'faster' and 'LLM' in the context window.

That misreading could also suggest some interesting research directions. Bayesian optimization to choose some parameters which guide which subset of the neurons to include in the inference calculation? Why not.

Hi, OP here, sorry for late reply. I am not actually "evaluating", but rather using the "side effects" of bayesian optimization that allows zoning in/out on some regions on the latent space. Since embedders are so fast compared to LLM, it saves time by saving LLMs from evaluating on similar queries. Hope that makes sense!
But aren’t you really just evaluating the embeddings / quality of the latent space then?