| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tmikaeld 652 days ago
	I have the same experience, hallucinates and rambles on and on about "solutions" that are not related. Unfortunately, this has always been my experience with all open source code models that can be self-hosted.

1 comments

Gracana 652 days ago

It sounds like you are trying to chat with the base model when you should be using a chat model.

link

tmikaeld 651 days ago

No, I’m using 9b-chat-q8_0 on a 4090

link

tmikaeld 651 days ago

Turns out that Ollama on windows will run multiple models in parallell consuming all available VRAM and RAM. Changing it to 1 fixed the issue, now it's working great! However, the context length for the output is very small - only 1024 tokens.

link

Gracana 651 days ago

That's some really strange behavior, I don't know why that would cause poor results rather than just poor performance.

Can you configure the context size with `/set parameter num_ctx N`? On my laptop with an RTX A3000 12GB I can run `yi-coder:9b-chat` (Q4_0) with 32768 context and it produces good results quickly. That uses 11GB of VRAM so it's maxed out for this setup.

link

tmikaeld 651 days ago

Solved, see:

https://github.com/01-ai/Yi-Coder/issues/6#issuecomment-2334...

Works very well now! 65K input tokens with 8192 output tokens is no longer an issue on my 4090. (It maxes out on 22GB/VRAM)

link

Gracana 651 days ago

Awesome! Glad to hear you got it sorted out.

link