|
|
|
|
|
by criemen
358 days ago
|
|
Thanks for writing the article! I didn't quite get Note that during the prefill phase, all prompt tokens from a request can be processed in one batch. This is possible because the query (Q) tensors, calculated from the tokens immediately before them, are available for each prompt token position. I know that in practice prefill is much faster than inference. Would watching the 2h video from Karpathy help me understand why? |
|
Instead for decode, you need to sequentially generate each token.