Hacker News new | ask | show | jobs
by lsy 1211 days ago
It's not exactly clear from the paper how they've set up the training, but it appears this model has an aspect which uses a secondary model to represent images as vectors, combines them with their text captions, and then uses those text representations along with the image vectors to train the LLM. I will leave aside the question of whether a 1024-dimensional image vector and its text caption are "images".

What's interesting is that it seems to actually lose information, as asking it to identify the studio that made WALL-E is beyond its capabilities, while asking it to describe the image (i.e. regenerating more closely something that was fed into it) and then processing on that text, is successful.

The "chain-of-thought" trick in LLMs I suspect underestimates the extent to which the interviewer is carrying water for the LLM's "reasoning" ability. The interviewer has a sense of what answer they want and will ask questions that produce further results that more easily prime the model to produce it. Reasoning supposes that these steps are carried out internally, but we see claims being made of reasoning when there is an external intelligence essentially directing the generation and combination of facts.

Another curious aspect is the flattening of 2D IQ test questions into linear format, which of course misses the point of the question in being able to reason spatially instead of linearly.