Hacker News new | ask | show | jobs
by jeeceebees 1734 days ago
There is a lot of evidence that these token-based models work with multi-modal data. In fact, several groups have proposed different multi-modal transformer architectures already (e.g. [1] or [2]), although I don't believe anyone has scaled them up much farther than 300M parameters yet.

If these models are shown videos of butterflies flapping their wings with a text description of 'a butterfly flapping its wings,' why wouldn't you expect it to start to relate the information coming from multiple modalities?

It's definitely a challenge to get enough high-quality data to feed a 100B parameter version of such a mutli-modal model, but there don't seem to be any theoretically insurmountable issues towards this "dumb" way of giving the models more intuition.

[1] VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text, https://arxiv.org/abs/2104.11178

[2] Perceiver IO: A General Architecture for Structured Inputs & Outputs, https://arxiv.org/abs/2107.14795