| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Aurornis 5 days ago

> The benchmark prompt was:

> Write a compact Python function that parses a unified diff and returns the changed file paths. Then explain two edge cases.

> Each benchmark generated about 128 tokens.

Generating 128 tokens is probably not enough for good benchmark results. MTP speedup depends on how often the predicted tokens are accepted. In my experience, the very early output has a higher acceptance rate, so short testing can give false positive speedups.

llama.cpp includes a tool specifically for benchmarking that will sweep the arguments for you so you don't have to restart the server and send it prompts:

https://github.com/ggml-org/llama.cpp/blob/master/tools/llam...

EDIT: Also the section about downloading the models should have mentioned that llama.cpp has a "-hf" argument that will download the models for you. I appreciate the author for sharing their experience, but for beginners this might not be the best guide to use.

3 comments

freerunnering 5 days ago

> I appreciate the author for sharing their experience, but for beginners this might not be the best guide to use.

Yeah, I didn't write this as a proper developer guide. My screen recording started getting loads of favourites and I started getting messages asking about how I set it up, so just through up a quick rundown of how I setup this test.

I little just saw the Unclothe announcement about "Double the speed" and thought "Ha. I wonder if that will get it fast enough I'd actually be prepared to use it" and had a go at setting it up.

I'd done tests before last year with things like Devstral, but they were always both so slow and dumb, I didn't want to bother.

This finally hit the "wow, this is useable" level of both speed and intelligence.

link

Phemist 4 days ago

I wasn't familiar with Unclothe, so I had to look it up..

Are you sure you did not mean Unsloth?

link

threecheese 4 days ago

They likely did, and this autocorrect slip might suggest why OP is using local models :)

link

Phemist 4 days ago

Indeed, a clear Freudian slip. The one where you say one thing, but you mean your mother.

link

freerunnering 4 days ago

For some reason every time I type "Unsloth" macOS auto corrects it to "unclothe". It did it now, writing this reply. It's really annoying!

link

liuliu 5 days ago

Realistically, you need to experiment with any user prompt + a good amount of system prompt (at least > 1000 tokens, but realistically, in the range of 3000 tokens probably good).

llama.cpp includes tools for that, what you are looking at is to have a prefill before token generation to measure it properly. Increasingly also, measuring token generation speed at longer context (32k or 64k) is important too.

link

reactordev 5 days ago

This is akin to saying “it runs on my machine” without actually examining the problem. Sad. You’re absolutely right that 128 tokens is nothing, it’s a little more than a hello response.

link