|
|
|
|
|
by askrzypczak
103 days ago
|
|
The IRT angle is interesting — most adaptive learning tools just do basic spaced repetition, but using Item Response Theory to estimate ability level in real-time is a much more honest approach to "personalized." The JSXGraph integration for gradable math graphs is a nice touch too, that's a hard problem. Quick question: how do you handle subjects where the "right answer" is more ambiguous? Does the LLM grading struggle with open-ended questions outside of math? |
|
the flow is basically:
When practice questions are generated, the model generates the question + the reference answer together, but the user only sees the question. then on submit, a smaller model grades the learner answer against that reference answer + the grading criteria.
I benchmarked a bunch of judge models for this on a small multi-subject set, and `gpt-oss-20b` ended up being a very solid sweet spot for quality/speed/structured-output reliability. on one of the internal benchmarks it got ~98.3% accuracy over 60 grading cases, with ~1.6s p50 latency, so it feels fast enough to use live.
for math, it’s not just LLM grading though:
- `SymPy` for latex/math expressions, so if the learner writes an equivalent answer in a different form, it still gets marked correct; so `(x+2)(x+3)` and `x^2 + 5x + 6` can both pass. (but might remove that one since it might be easily replaced by an LLM? And it's a niche use that add some maintenance cost)
- tolerance-based checks for the JSXGraph board state stuff; so on the graph if you plotted x = 5.2 instead of 5.3 it will be within the margin of error to pass but will give you a message about it
I also tried embedding/similarity checking early on, but it was noticeably worse on tricky answers, so I didn’t use that as the main path.