Hacker News new | ask | show | jobs
by united893 1249 days ago
The response takes a long time to generate. The user could just sit there and stare at a blank response, or start reading in realtime as the response is generated.
2 comments

I find it surprising that you can display any of it before the whole thing is done, since I would expect information dependencies between the start and the finish of a sentence or paragraphs. I have yet to really look into how these models work, they are black boxes to me.
From what I understand, these models generate the response one word at a time. Every time you see a new word appear at the end, the model is taking into consideration the entire chat history + its own answer so far to generate that next token.
Thanks for the comment, that's so fascinating since it seems to put limitations on thinking in general. A human for example can imagine future possibilities concurrently while speaking and correct themselves as they go.

It doesn't seem to map well tk how I put together a thought either, but admittedly I wouldn't really know how the mechanics of my brain do it, maybe it's not so different just with some auxiliary modules bolted on ha.

Check out the illustrated transformer: https://jalammar.github.io/illustrated-transformer/

tl;dr: It decodes the output one word at a time, but at each step it can focus on any mix of words from the input via the attention mechanism. So the output token n can't depend on future output token n+1 in GPT, but it can attend to any of the input tokens

I did not expect that, when iterating with smaller models like nanoGPT, even tough the output is one token at a time it did not felt like it would take half a second between each of them, but I guess that's what happen with billions parameters models.