|
|
|
|
|
by nemjack
381 days ago
|
|
I don't think you're quite right. The author is arguing that images and text should not be processed differently at any point. Current early fusion approaches are close, but they still treat modalities different at the level of tokenization. If I understand correctly he would advocate for something like rendering text and processing it as if it were an image, along with other natural images. Also, I would counter and say that there is some actionable information, but its pretty abstract. In terms of uniting modalities he is bullish on tapping human intuition and structuralism, which should give people pointers to actual books for inspiration. In terms of modifying the learning regime, he's suggesting something like an agent-environment RL loop, not a generative model, as a blueprint. There's definitely stuff to work with here. It's not totally mature, but not at all directionless. |
|
On the ‘we need to do rl loop rather than a generative model’ point - I’d say this is the consensus position today!