| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sunaookami 76 days ago
	Fiction. One of my "benchmarks" is giving the model a bunch of (self-made) text and having it simulate a 4chan thread about it. This tests tool use (calling the APIs), some skills, censorship and general creativity. Some models refuse every new turn after reading real 4chan threads ;) Claude is especially good at this surprisingly while GPT fails spectacularly and Gemini is just lazy (and barely usable since it's constantly overloaded). Qwen (coder-model from Qwen CLI, so Qween 3.5) is also very good but sadly not usable in Pi (they detect and block calls outside their CLI).

1 comments

admiralrohan 76 days ago

Interesting. Are you running something like Autoresearch loop for writing fiction? How will the agent determine whether the output is good as this is subjective.

link

sunaookami 75 days ago

I don't have any advanced setup, creative writing is always subjective. I just one-shot most of the time.

link