Hacker News new | ask | show | jobs
by numpad0 333 days ago
PSA: models confusingly named "$1-distill-$2"(sometimes without "-distill") are $2 trained on outputs of $1, referred to as "distillation" process, not the other way around nor the real thing.

The article contains nonexistent configurations such as "Deepseek-R1 1.5B", those are that thing.