| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by pferdone 43 days ago

I can see that and I don't know your setup, but there are people pushing >70t/s with MTP on a single 3090, with big contexts still >50t/s. 64k is not a lot for agentic coding, and IIRC 128k with turboquant and the likes should be possible for you. r/LocalLLM/ and r/LocalLLaMA/ are worth a visit IMO.

EDIT: just found this recipe repo, may wanna give it a go: https://github.com/noonghunna/club-3090

EDIT-2: this can also shave off a lot of context need for tool calling -> https://github.com/rtk-ai/rtk

3 comments

gchamonlive 42 days ago

club-3090 with llamacpp did it. Full 262k context, usable in oh-my-pi. Still testing it, but initial results are promising.

I had to make a couple of adjustments though. After downloading the model with hf, I needed to move the mmproj-F16.gguf to the parent folder:

   tree /media/fast-storage/club-3090-models/qwen3.6-27b/
  /media/fast-storage/club-3090-models/qwen3.6-27b/
  ├── mmproj-F16.gguf
  └── unsloth-q3kxl
      └── Qwen3.6-27B-UD-Q3_K_XL.gguf

then, on starting the server, the container would complain that llama-server wasn't a known binary, so I needed to add PATH="/app:$PATH" to the entrypoint of the llama service.

The only things that's missing is for llama to emit thinking blocks that oh-my-pi can parse, but it's running alright. That's mostly cosmetic.

pferdone 33 days ago

That‘s so cool man! Congrats!

gchamonlive 43 days ago

I managed to execute with vllm successfully, but it breaks opencode on simple "what's this repo?" task. On oh-my-pi it wont event execute because omp sends multiple system prompts. I'll try with llama.cpp later and see if it works more reliably.

gchamonlive 43 days ago

will give more info in the post

EDIT: thanks for the links!