Hacker News new | ask | show | jobs
by touisteur 1564 days ago
Hi, I'm curious what you mean about model quantization being necessary on CPU and GPU? They're not necessary by default, as openvino, tvm, tensorrt can run single-precision inference on most classic models quite fast? If you're reaching for very low power or ultimate perf, yeah you can downgrade to fp16 (well... Mixed precision) with NVIDIA tensor cores or avx512-fp16, or bf16 in some Intel vnni confs? Going to integer will give you more throughput too but it's not necessary. Even myriad-x is supposed to handle some kind of fp16 with the shave cores.

The only time I had to reach for quantized (integer) networks to do anything at all was inferencing on FPGAs. Are you targeting dsp slices by default or implementing full ieee754 floating point by default?

Are you saying that with Tensil you can run single precision non-quantized models with up to 2x gpu perf?

I probably misunderstood your last sentence, sorry.

Genuinely curious!

1 comments

Sorry if this was unclear - in a datacenter use case you are right, but for an edge deployment, you will usually need to quantize, prune or compress your ML model to get it working as fast as you'd like on a sufficiently small CPU/GPU. Compared with running your ML model unchanged on those platforms, Tensil can run with the performance ranges listed above. You can also quantize and use Tensil too!