|
|
|
|
|
by joemoon
998 days ago
|
|
It's definitely a very similar method but fundamentally different in that the 'Distilling step-by-step' approach is a multi-task model. As I understand it, rather that training the smaller model to produce the CoT/rationale as a part of the (decoded) response, it actually has two output layers. One output is for the label and the other is for the rationale. The other layers are shared, which is how/why the model is able to have an improved "understanding" of which nuances matter in the labeling task. |
|