| HN Mirror

yeah we use an LLM for the grading .. (for the free form questions)

the flow is basically:

When practice questions are generated, the model generates the question + the reference answer together, but the user only sees the question. then on submit, a smaller model grades the learner answer against that reference answer + the grading criteria.

I benchmarked a bunch of judge models for this on a small multi-subject set, and `gpt-oss-20b` ended up being a very solid sweet spot for quality/speed/structured-output reliability. on one of the internal benchmarks it got ~98.3% accuracy over 60 grading cases, with ~1.6s p50 latency, so it feels fast enough to use live.

for math, it’s not just LLM grading though:

- `SymPy` for latex/math expressions, so if the learner writes an equivalent answer in a different form, it still gets marked correct; so `(x+2)(x+3)` and `x^2 + 5x + 6` can both pass. (but might remove that one since it might be easily replaced by an LLM? And it's a niche use that add some maintenance cost)

- tolerance-based checks for the JSXGraph board state stuff; so on the graph if you plotted x = 5.2 instead of 5.3 it will be within the margin of error to pass but will give you a message about it

I also tried embedding/similarity checking early on, but it was noticeably worse on tricky answers, so I didn’t use that as the main path.