Hacker News new | ask | show | jobs
by reaslonik 220 days ago
Breaks as in contains words that grammatically work but don't make sense, mistakes the symbol | for a person, points back to things that didn't exist in the request etc. I use templates like these for explaining questions:

from

```

excerpt of text or code from some application or site

```

What is the meaning of excerpt?

Just doesen't seem to work at a useable level. Coding questions get code that runs, but almost always misses so many things that finding out what it missed and fixing those takes a lot more time than handwriting code.

>Overall it still seems extremely good for its size and I wouldn't expect anything below 30B to behave like that. I mean, it flies with 100 tok/sec even on a 1650 :D

For it's size absolutely, I've not seen 1,5B models that form even sentences right most of the time so this is miles ahead of most small models, not just to the hinted at levels the benchmarks would you have believe

1 comments

Interesting, I haven't seen it actually return nonsense yet. (Some incorrect things and getting into thinking loops, but always coherent) I'm running it on a latest llama.cpp with the bf16 gguf. What are you using?
I'm running the huggingface's .safetensors with vLLM with as little starting parameters as possible. I thought it must not be sending temp right, but after setting temp to something else I got chinese so it should be sending it.

Overall if you're memory constrained it's probably still worth to try and fiddle around with it if you can get it to work. Speedwise if you got the memory a 5090 can get ~50-100tok/s for a single query with 32B-AWQ and way more if you have something parallel like open-webui