Hacker News new | ask | show | jobs
by ajhai 1206 days ago
Congrats on the launch! I'm glad to see all the tooling come up in this space.

Regarding tests, how do you evaluate the generated completions for tests? Allowing users to execute a set of tests against a prompt and show completions for visual inspection is a good start but imho doesn't scale when the app is in production with a large corpus of tests. Something we are exploring right now is to generate a similarity/divergence score between generated completions to make this easy at scale.

Disclosure: We are building something very similar at Promptly (https://trypromptly.com) out of our experience using GPT-3 at MakerDojo

2 comments

Thanks! We totally agree that spot-checking won't scale long term. We're currently testing a feature in beta that allows you to provide an "expected output" and then choose from a variety of comparison metrics (e.g. exact match, semantic similarity, Levenshtein distance, etc.) to derive a quantitative measure of output quality. The jury's still out whether this is sufficient, but we're excited to continue pushing in this direction.

p.s. it's cool to hear from another company that's helping expand this market!

What I think would be really interesting is to apply distance metric learning (DML) to the problem. You have users tell you what responses are good and bad and use that to learn a metric that will classify responses as good as bad. One of the big challenges is that DML is typically applied to data in some vector space as opposed to strings, but I would expect using some embedding constructed from the output could work well.
Super interesting idea! We already expose UIs and APIs for supplying feedback on the quality of the output, so this could totally be possible once enough feedback has been collected. Thanks for sharing
Letting users pick a comparison metric of their choice is a good option till something better comes along. Good luck with Vellum!
Please remove that "text-shadow: 8px -9px 0px #ffffff;" for the "hero-title" class. It is possible to use text shadows effectively, but it is very, very easy to use them in ways that are a lot worse than not using them at all.