|
|
|
|
|
by kolinko
866 days ago
|
|
If I’m not mistaken, for parallel inference requests and for prompt preprocessing it’s compute bound. Also, if you have just a single model you want to optimise (and not the training), you could build an array of asics that do specific matrix computations - then you don’t need to read weights from memory at all. |
|