Hacker News new | ask | show | jobs
by jdsully 849 days ago
For inference the common answer will be "no", you use the model you get and it takes a constant time to process.

However the truth is that inference platforms do take shortcuts that affect accuracy. E.g. LLama.cpp will down convert fp32 intermediates to 8-bit quantized so it can do the work using 8-bit integers. This is degrading the computation's accuracy for performance.

1 comments

I have no freaking idea what you said in the second paragraph but I love it and it will linger in the back of my head until I understand enough to look it up.

[nodding repeatedly with a serious face and lot of resolve]