Hacker News new | ask | show | jobs
Language Is Not All You Need: Aligning Perception with Language Models (arxiv.org)
87 points by craftsquick 1211 days ago
10 comments

It's not exactly clear from the paper how they've set up the training, but it appears this model has an aspect which uses a secondary model to represent images as vectors, combines them with their text captions, and then uses those text representations along with the image vectors to train the LLM. I will leave aside the question of whether a 1024-dimensional image vector and its text caption are "images".

What's interesting is that it seems to actually lose information, as asking it to identify the studio that made WALL-E is beyond its capabilities, while asking it to describe the image (i.e. regenerating more closely something that was fed into it) and then processing on that text, is successful.

The "chain-of-thought" trick in LLMs I suspect underestimates the extent to which the interviewer is carrying water for the LLM's "reasoning" ability. The interviewer has a sense of what answer they want and will ask questions that produce further results that more easily prime the model to produce it. Reasoning supposes that these steps are carried out internally, but we see claims being made of reasoning when there is an external intelligence essentially directing the generation and combination of facts.

Another curious aspect is the flattening of 2D IQ test questions into linear format, which of course misses the point of the question in being able to reason spatially instead of linearly.

They link to this repo at the top of the paper https://github.com/microsoft/unilm although I don't see anything that actually says "Kosmos".

Slightly interestingly, the last commit changes a heading from "AI" to "AGI" https://github.com/microsoft/unilm/commit/bbbb5b4b06c2dd501d...

The model is relatively small, 1.6B. I am guessing the goal is to be able to run on a user's home PC. But it would be interesting to see how much better it gets if you scale it up by a factor of 10 or 100.

Multimodal transformers is the logical leap forward. These ideas are not new, deep mind and google have been at this for some time. It’s yet another arms race and those with the most diverse data will win
The chain-of-thought prompting in section 4.5 is extremely interesting to me, but it looks like they're missing a test group - what is the performance if the image is simply described and then the task is evaluated using only the text of the description, not only when combined with the image.
Finally the Metaverse is taking shape. Those human friends on Facebook can now be replaced with Metafriends
The funny thing is, something like the Metaverse might actually be helpful to construct multimodal models - it could provide higher fidelity training data.
I think the models would be too biased if the training data doesn't have legs.
Meta's whole business agenda is to collect data
Can someone ELI5 me why suddenly LLM and transformers became all the rage in AI scene?
* 2017 - the Attention Is All You Need paper proposes transformers, suggests language translation as their primary application

* 2019 - GPT2 is presented to the public and demonstrates transformer-based LLMs and their emergent capabilities

* 2020 - GPT3 is released and shows that throwing more compute at LLMs yields significantly better LLMs

* 2022 - ChatGPT is openly released to the public, showcasing the versatility of an LLM-based chatbot

In my experience transformers have been all the rage in the researcher/enthusiast scene since 2019. The technology has just gradually matured enough to become viable for consumer use, which is why you see the industry rushing to adopt it. ChatGPT was the watershed moment for the tech because suddenly anyone in the world could sign up for free, open a chat dialogue and start getting legible LLM output without needing to understand the tech or prompt engineering.

Not to mention without needing expertise to deploy the thing
The technology has been a while coming .. language models have long been a research area within machine learning, with recurrent models such as RNNs and LSTMs being an earlier approach since they allow the model to process a (language) sequence of arbitrary length.

Problems/limitations of recurrent models led to other approaches being tried using "attention" as way to let earlier parts of a sequence impact future prediction, culminating in the 2017 "Attention is all you need" paper which introduced the "Transformer" architecture that all these current LLMs are based on.

From there it was a matter of scale - scaling up the model and amount of data the models were trained on. Nobody knew how well this "Transformer" architecture could perform at scale, but early signs were promising enough to keep pushing to see how much better they could get. OpenAI in particular have been very aggressive in pushing this scaling up with their GPT-N (N=1/2/3..) models. They themselves expressed some surprise at the capabilities of GPT-2, leading to the much larger GPT-3 that is the basis of ChatGPT.

Both OpenAI and others had been leery of publically releasing these very capable LLMs for fear of ways they might be misused, but finally OpenAI released GPT-3 (with a bit of human feedback polish) in the guise of the chat bot ChatGPT, which was the first time the public had seen what the tech was capable of.

The sudden impact of ChatGPT belies the incremental improvements that brought us to this point, but seems to have been largely because the public had never seen/experienced the steps that got us here, partly because of the highly accessible packaging of the tech as a web-based chat bot, and perhaps partly because it was released without much explanation from OpenAI as to what it was/how it works - they seem quite happy for the public to do what they've done and anthromorphise it as being an AI assistant.

Transformers aren't really a wonderful architecture in the sense of a great fit between the architecture and what we know about the task. (For comparison, I think convolutional networks are.)

What makes Transformers great is:

1. Can handle long sequences without large increase in number of parameters to be trained.

2. Parallelize better than previous sequence models, ie LSTM. If we could train LSTMs of the size and with the same training data size as current Transformers, they'd probably be just as good.

So maybe RWKV [1] is the next step. It parallelizes even better and seems to have no sequence limit.

[1] https://github.com/BlinkDL/RWKV-LM

Transformers were used on text to train models without needing labeled data. People realized that simply scaling the data and models meant better performance. when they scaled even further, emergent intelligence started appearing and the models were dominating every known task. Now everyone wants an LLM
> and the models were dominating every known task

I don't think this is very accurate. How well does LLM perform on image segmentation, for example?

I meant more transformers than LLMs, models trained in self supervised way like DINO actually are pretty good at image segmentation
Transfomers are currently state of art i.e. the problems that transformers are currently solving cant be solved better by any other known technique/algorithm. Thats the main reason why they are so popular at the moment. LLMs are popular right now because a recent transformer based neural network has proven to be fun/useful i.e. gpt-3/chatgpt. A lot more useful than previous language models at least.
RWKV is showing that maybe RNNs can perform on par with transformers

https://github.com/BlinkDL/RWKV-LM

I believe it's because they became more capable
because we discovered that LLMs are capable of emergent understanding and reasoning.
The chain-of-thought works quite like the "System 2" introduced in <<Thinking, Fast and Slow>>, which is slower, more deliberative, and more logical.
Lately, the words "all you need" is all you need for a paper :p
Perception is not all you need. To have effective AI comparable to HI you need a motivation system comparable to the human sex drive.
First image that popped to mind were the "squids" from Matrix.