Hacker News new | ask | show | jobs
by sunaookami 76 days ago
Fiction. One of my "benchmarks" is giving the model a bunch of (self-made) text and having it simulate a 4chan thread about it. This tests tool use (calling the APIs), some skills, censorship and general creativity. Some models refuse every new turn after reading real 4chan threads ;) Claude is especially good at this surprisingly while GPT fails spectacularly and Gemini is just lazy (and barely usable since it's constantly overloaded). Qwen (coder-model from Qwen CLI, so Qween 3.5) is also very good but sadly not usable in Pi (they detect and block calls outside their CLI).
1 comments

Interesting. Are you running something like Autoresearch loop for writing fiction? How will the agent determine whether the output is good as this is subjective.
I don't have any advanced setup, creative writing is always subjective. I just one-shot most of the time.