Hacker News new | ask | show | jobs
by gwern 757 days ago
Yes. This has already been demonstrated by "Teaching Arithmetic to Small Transformers" https://arxiv.org/abs/2307.03381 , I'm not sure what OP adds except demonstrating that you can do that via the embedding itself rather than the tokenization.

> We first demonstrate that conventional training data is not the most effective for arithmetic learning, and simple formatting changes can significantly improve accuracy. This leads to sharp phase transitions as a function of training data scale, which, in some cases, can be explained through connections to low-rank matrix completion. Building on prior work, we then train on chain-of-thought style data that includes intermediate step results. Even in the complete absence of pretraining, this approach significantly and simultaneously improves accuracy, sample complexity, and convergence speed. We also study the interplay between arithmetic and text data during training and examine the effects of few-shot prompting, pretraining, and model scale. Additionally, we discuss length generalization challenges.