|
|
|
|
|
by jdsully
849 days ago
|
|
For inference the common answer will be "no", you use the model you get and it takes a constant time to process. However the truth is that inference platforms do take shortcuts that affect accuracy. E.g. LLama.cpp will down convert fp32 intermediates to 8-bit quantized so it can do the work using 8-bit integers. This is degrading the computation's accuracy for performance. |
|
[nodding repeatedly with a serious face and lot of resolve]