|
|
|
|
|
by devashish86
3 days ago
|
|
Author here. Quick context the post doesn't quite spell out: The tool_choice="auto" failure on Qwen3-Next isn't a parser issue — the model
reasons inside <think>, decides, and never emits the tool call. No error, just
empty tool_calls. The fix was swapping the backbone from Thinking to Instruct,
not tuning any parser flag. The "load the bigger model first, size the smaller against actual residency"
playbook generalizes to anything with shared CUDA framework overhead. The ~5 GiB
framework floor shows up even at small gpu_memory_utilization values — plan
against actuals, not targets. |
|
```
(...) - Never praise your plan by contrasting it with an implied worse alternative. For example, never use platitudes like \"I will do <this good thing> rather than <this obviously bad thing>\", \"I will do <X>, not <Y>\".
- Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user's query. (...)
```
It seems the OpenAI people added that first bullet to specifically address the tendency the model has, as seen in the parent comment. The goblin stuff coincidentally appears right after in the system prompt, so in included it as a bonus.