|
|
|
|
|
by nikvaes
803 days ago
|
|
After trying to understand and implement some algorithms in RASP [1, 2], my take-way was that certain functions need a certain amount of transformer layers to operate. Following this logic, it should become apparent that the functions learned by transformers can be spread over multiple heads. Repeating these functions might be very valuable for understanding and solving a problem, but current inference does not allow (a set of subsequent) heads to be repeated. This paper indeed seems a promising direction. [1] https://arxiv.org/pdf/2106.06981.pdf [2] https://www.youtube.com/watch?v=t5LjgczaS80 |
|