| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by aesthesia 4 days ago
	Thinking shouldn't be too hard to deal with---just let the model generate freely until it hits a </think> token, then do constrained decoding, right?

1 comments

stymaar 4 days ago

Sure, but does llama-cpp support that?

link

nl 4 days ago

It does and this is how I did it.

But actually getting that grammar right as well as actually making it work with the correct Jinja template to correctly enable thinking mode and parse it out was a lot more work than I expected.

link