|
|
|
|
|
by mapleshamrock
239 days ago
|
|
Couldn't you do something like add a bidirectional encoder after your embedding look up table to compress your text into some smaller token-count semantic space before feeding your transformer blocks to get a similar effect, then? |
|
This sort of "dynamic chunking" of low-level information, perhaps down to the level of raw bytes, into shorter sequences of meta tokens for input to some big sequence processing model is an active area of research. Eg, one neat paper exploring this direction is: "Dynamic Chunking for End-to-End Hierarchical Sequence Modeling" [1], from one of the main guys behind Mamba and other major advances in state-space models.
[1] - https://arxiv.org/abs/2507.07955