Hacker News new | ask | show | jobs
by tmikaeld 652 days ago
I have the same experience, hallucinates and rambles on and on about "solutions" that are not related.

Unfortunately, this has always been my experience with all open source code models that can be self-hosted.

1 comments

It sounds like you are trying to chat with the base model when you should be using a chat model.
No, I’m using 9b-chat-q8_0 on a 4090
Turns out that Ollama on windows will run multiple models in parallell consuming all available VRAM and RAM. Changing it to 1 fixed the issue, now it's working great! However, the context length for the output is very small - only 1024 tokens.
That's some really strange behavior, I don't know why that would cause poor results rather than just poor performance.

Can you configure the context size with `/set parameter num_ctx N`? On my laptop with an RTX A3000 12GB I can run `yi-coder:9b-chat` (Q4_0) with 32768 context and it produces good results quickly. That uses 11GB of VRAM so it's maxed out for this setup.

Solved, see:

https://github.com/01-ai/Yi-Coder/issues/6#issuecomment-2334...

Works very well now! 65K input tokens with 8192 output tokens is no longer an issue on my 4090. (It maxes out on 22GB/VRAM)

Awesome! Glad to hear you got it sorted out.