|
|
|
|
|
by vjsrinivas
100 days ago
|
|
Great work and love the detailed breakdown. This is kind of tangential, but it reminded me of this work: https://arxiv.org/pdf/2310.12973 (Frozen Transformers in Language Models are Effective Visual Encoder Layers). The paper puts out an interesting hypothesis that these LLM-derived transformer layers have the ability to "refine" any set of learned tokens, even in different modalities. I wonder if what you're seeing here is related? |
|