|
|
|
|
|
by menaerus
483 days ago
|
|
I am not gonna downvote you but you will need to find your manners. People around and most certainly your colleagues will be grateful for that. Perhaps also learn to deal with the arguments and different opinions without coloring them "plain wrong" in advance. Give people around you the benefit of a doubt. > Also in this case I admit I failed to have a theory on why your number is so off because giving out prefill numbers and claiming it's decode isn't in my book. Maybe it's because it is not off? It's not terribly difficult to sum up all the matmul calculcations and number of bytes one needs to load and store per each layer in self-attention. My number could be off for a bit but it is certainly not terribly off. |
|
> different opinions
I won't argue with you so hard if it's your "opinions". What you described is not an opinion. And facts could be wrong. Plainly wrong.
> Maybe it's because it is not off?
Yeah, as I said earlier your number might be correct as an estimation for prefilling 1000 tokens on Llama 3 8B. That's not what everybody here called "decode". Your number shows that prefill is compute-bound. So what?