|
|
|
|
|
by spockz
1 day ago
|
|
For what it is worth, I’m on a similar machine. (9070XT,5900X) and found a lot of performance improvement over ollama by compiling llama.cpp and running with —no-mmap and —perf. The context is still quite small though. With online models I use contexts of at least 200k which is useful for longer running/more complicated commands. Locally I haven’t gone much further than 8k. That is sufficient for small changes on small code bases. And you need condensed tool output. I haven’t tried any tool that compresses the tokens yet. |
|
1. The hardware will eventually catch up.
2. This keeps the delta between frontier models smaller.
3. We can still fine tune and own the weights.
4. The models will be more useful, faster, and reliable.
RTX is hobbyist tier, not professional tier.
Gated cloud models from hyperscalers treat us like hobbyists in their own right.
We need equivalent scale models, but open.