Reasoning models spend a whole bunch of time reasoning before returning an answer. I was toying with QWQ 32B last night and ran into one question I gave it where it spent 18 minutes at 13tok/s in the <think> phase before returning a final answer. I value local compute but reasoning models aren’t terribly feasible at this speed since you don’t really need to see the first 90% of their thinking output.
ChatGPT o3 mini high thinks at about 140 tokens/s by my estimation and I sometimes wish it can return answers quicker.
Getting a simple prompt answer would take 2-3 minutes using the AMD system and forget about longer context.