| HN Mirror

Ideally, you want to start small and iterate. With Promptrepo, you can use versioning to compare model outputs across different datasets. In the test UI, we calculate confidence scores using @promptrepo/score [1], which parses OpenAI’s logprobs and shows field-level reliability. Fields with low confidence are highlighted in red, making it easy to catch signs of overfitting or data drift.

[1] https://github.com/ManiDoraisamy/promptrepo-score