Hacker News new | ask | show | jobs
by noaflaherty 1205 days ago
Thanks! We totally agree that spot-checking won't scale long term. We're currently testing a feature in beta that allows you to provide an "expected output" and then choose from a variety of comparison metrics (e.g. exact match, semantic similarity, Levenshtein distance, etc.) to derive a quantitative measure of output quality. The jury's still out whether this is sufficient, but we're excited to continue pushing in this direction.

p.s. it's cool to hear from another company that's helping expand this market!

2 comments

What I think would be really interesting is to apply distance metric learning (DML) to the problem. You have users tell you what responses are good and bad and use that to learn a metric that will classify responses as good as bad. One of the big challenges is that DML is typically applied to data in some vector space as opposed to strings, but I would expect using some embedding constructed from the output could work well.
Super interesting idea! We already expose UIs and APIs for supplying feedback on the quality of the output, so this could totally be possible once enough feedback has been collected. Thanks for sharing
Letting users pick a comparison metric of their choice is a good option till something better comes along. Good luck with Vellum!