|
|
|
|
|
by touisteur
1564 days ago
|
|
Hi, I'm curious what you mean about model quantization being necessary on CPU and GPU? They're not necessary by default, as openvino, tvm, tensorrt can run single-precision inference on most classic models quite fast? If you're reaching for very low power or ultimate perf, yeah you can downgrade to fp16 (well... Mixed precision) with NVIDIA tensor cores or avx512-fp16, or bf16 in some Intel vnni confs? Going to integer will give you more throughput too but it's not necessary. Even myriad-x is supposed to handle some kind of fp16 with the shave cores. The only time I had to reach for quantized (integer) networks to do anything at all was inferencing on FPGAs. Are you targeting dsp slices by default or implementing full ieee754 floating point by default? Are you saying that with Tensil you can run single precision non-quantized models with up to 2x gpu perf? I probably misunderstood your last sentence, sorry. Genuinely curious! |
|