Hacker News new | ask | show | jobs
by JediPig 645 days ago
I tested this out on my workload ( SRE/Devops/C#/Golang/C++ ). it started responding about non-sense on a simple write me boto python script that changes x ,y,z value.

Then I tried other questions in my past to compare... However, I believe the engineer who did the LLM, just used the questions in benchmarks.

One instance after a hour of use ( I stopped then ) it answered one question with 4 different programming languages, and answers that was no way related to the question.

2 comments

I have the same experience, hallucinates and rambles on and on about "solutions" that are not related.

Unfortunately, this has always been my experience with all open source code models that can be self-hosted.

It sounds like you are trying to chat with the base model when you should be using a chat model.
No, I’m using 9b-chat-q8_0 on a 4090
Turns out that Ollama on windows will run multiple models in parallell consuming all available VRAM and RAM. Changing it to 1 fixed the issue, now it's working great! However, the context length for the output is very small - only 1024 tokens.
That's some really strange behavior, I don't know why that would cause poor results rather than just poor performance.

Can you configure the context size with `/set parameter num_ctx N`? On my laptop with an RTX A3000 12GB I can run `yi-coder:9b-chat` (Q4_0) with 32768 context and it produces good results quickly. That uses 11GB of VRAM so it's maxed out for this setup.

Solved, see:

https://github.com/01-ai/Yi-Coder/issues/6#issuecomment-2334...

Works very well now! 65K input tokens with 8192 output tokens is no longer an issue on my 4090. (It maxes out on 22GB/VRAM)

Have you ran the model in full FP16? It is possible a lot of performance is lost when running quantized versions.