Hacker News new | ask | show | jobs
by a1j9o94 63 days ago
You would only use the base model during training. This is a distillation technique