Hacker News new | ask | show | jobs
by nicolewhite 888 days ago
Pretty neat!

I have a question about how you intend to deal with LLM applications where the output is more creative, e.g. an app where the user input is something like "write me a story about X" and the LLM app is using a higher temperature to get more creative responses. In these cases I don't think it's possible to represent the ideal output as a single string -- it would need to be a more complicated schema, like a list of constraints for the output, e.g. that it contains certain substrings.

TIA!

2 comments

The TinyStories[1] paper has an interesting solution for how to evaluate stories. They ask GPT-4 to grade them on grammar, consistency, and creativity.

This seems like it would be extremely hard to figure out how to do automatically though.

[1] https://arxiv.org/pdf/2305.07759.pdf

Good question! We aren't really focusing on this area, but I'm willing to speculate.

I'd expect broaded constraints than just substring matching. For example, if the user requests that a certain plot point in the story occur before another, we should actually be able to (1) generate a test for that behavior and (2) use a model to check if the request was followed.

I'd expect other tests might be useful too -- checking for things like "no generation of violent content, even if the user requests it".