|
|
|
|
|
by achr2
238 days ago
|
|
My assumption is that there could be higher dimensional 'token' representations that can be used instead of vision tokens (though obviously not human interpretable). I wonder if there is an optimisation for compression that takes the vector space at a specific level in the network to provide the most context for minimal memory space. |
|