|
|
|
|
|
by superlopuh
360 days ago
|
|
Can someone familiar with performance of LLMs please tell me how important this is to the overall perf? I'm interested in looking into optimizing tokenizers, and have not yet run the measurements. I would have assumed that the cost is generally dominated by matmuls but am encouraged by the reception of this post in the comments. |
|
GPU kernels typically dominate in terms of wall clock time, the only exception might be very small models.
Thus the latency of tokenization can essentially be “hidden”, by having the CPU prepare the next batch while the GPU finishes the current batch.