| I don't like how most LLM explainer articles and videos say that essentially a LLM " predicts the next word". I'm a developer but not very good at maths and I still don't understand any of it. A LLM clearly has some "visual" capacity. You ask Gemini to build something with Canvas and it's able to reason about the shape of things. Like recently I waanted a checkbox that has like a gradient flowing around the edge. It figured out it could use a radial gradient from the center of the checkbox, and overlay that with a small inner div so you only see the edge that looks like the gradient is circling around the checkbox. How is that "predicting the next word"? Not saying AI is intelligent or conscious or anything like that, but the algorithm clearly is far more complex than "predicting words". What I mean, is the LLM is able to represent things in space . That part I don't understand. I also still dont understand the relationship between the chat based LLM and the multi modal stuff. I think I read somewhere when image is generated it is also tokens? |