Hacker News new | ask | show | jobs
by joemoon 998 days ago
It's definitely a very similar method but fundamentally different in that the 'Distilling step-by-step' approach is a multi-task model.

As I understand it, rather that training the smaller model to produce the CoT/rationale as a part of the (decoded) response, it actually has two output layers. One output is for the label and the other is for the rationale. The other layers are shared, which is how/why the model is able to have an improved "understanding" of which nuances matter in the labeling task.