| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nicolewhite 888 days ago

Pretty neat!

I have a question about how you intend to deal with LLM applications where the output is more creative, e.g. an app where the user input is something like "write me a story about X" and the LLM app is using a higher temperature to get more creative responses. In these cases I don't think it's possible to represent the ideal output as a single string -- it would need to be a more complicated schema, like a list of constraints for the output, e.g. that it contains certain substrings.

TIA!

2 comments

robrenaud 888 days ago

The TinyStories[1] paper has an interesting solution for how to evaluate stories. They ask GPT-4 to grade them on grammar, consistency, and creativity.

This seems like it would be extremely hard to figure out how to do automatically though.

[1] https://arxiv.org/pdf/2305.07759.pdf

maxrmk 888 days ago

Good question! We aren't really focusing on this area, but I'm willing to speculate.

I'd expect broaded constraints than just substring matching. For example, if the user requests that a certain plot point in the story occur before another, we should actually be able to (1) generate a test for that behavior and (2) use a model to check if the request was followed.

I'd expect other tests might be useful too -- checking for things like "no generation of violent content, even if the user requests it".