Hacker News new | ask | show | jobs
by ipiszy 1361 days ago
tl;dr:

Meta is open sourcing AITemplate, an inference engine for both Nvidia and AMD GPUs. Code: https://github.com/facebookincubator/AITemplate.

AITemplate delivers much better perf (1.9x ~ 12.8x) compared to PyTorch eager on SOTA models, including Bert, ResNet, VIT and StableDiffusion.

AITemplate also delivers high perf numbers using AMD GPUs (MI-250). With AITemplate, MI-250 achieves 80% ~ 96% A100 perf on various ResNet / Bert / VIT models.

AITemplate uses sophisticated fusion techniques to optimize perf, including vertical, horizontal, and memory fusions.

btw, I'm one of the authors of AITemplate, happy to answer any questions.

6 comments

How did AITemplate performance to state of art inference engine like tvm or onnx runtime ? Did AITemplate optimize/quantify network?

Edit: link for TVM https://tvm.apache.org/

AITemplate only supports fp16 data types with fp16 or fp32 accumulation right now. We are working on supporting more data types and quantization.

We don't have an official comparison between AITemplate and tvm / onnx for now, but we do have perf numbers like https://github.com/facebookincubator/AITemplate/tree/main/ex..., https://github.com/facebookincubator/AITemplate/tree/main/ex.... Feel free to run these examples on other frameworks and compare perf.

I'd love to hear about this too: especially after running the model through an onnx optimizer, like this one [0].

[0] https://github.com/daquexian/onnx-simplifier

Thanks, that is very helpful. Do you have to train the model differently for use with AITemplate? Could it be helpful for Leela Chess Zero (LC0)? I think LC0 has a generic Pytorch backend, that is several times slower than its NVidia specific CUDA backend. I'm not very clueful about this stuff though.
No, you don't need to train the model differently to use it with AITemplate. Here is an intro example to do inference with AITemplate with a very simple PyTorch model: https://facebookincubator.github.io/AITemplate/tutorial/how_.... For more advanced examples, check out https://github.com/facebookincubator/AITemplate/tree/main/ex...
As @haolu7 mentioned, you could take a pre-trained model and use AITemplate to do model inference. All you need to do is to re-write the model using AITemplate frontend and map PyTorch params to AITemplate params. Besides, AITemplate has a limited operator coverage compared to mature frameworks like PyTorch so you may need to implement your own kernels if necessary (though it already supports Bert, VIT, StableDiffusion, ResNet, Detectron, and general recommendation models).
How does the performance compare with tensor rt? I didn't see any benchmarks comparing against that. I expect it to be lower for now, but excited for see what the future brings.
Do you know of any good explanations of the techniques you used for those who only touch PyTorch Eager + occasionally torchscript?
You could check "AITemplate optimizations" section in the blog (https://ai.facebook.com/blog/gpu-inference-engine-nvidia-amd...), and https://github.com/facebookincubator/AITemplate#more-about-a.... The basic idea is to do aggressive kernel fusions.
Have you tested this on big models involving multi-gpu communication, or any plans?
For now it's for single GPU inference only.
How do you verify the correctness of your fusion operation ?
We have a bunch of unittests and E2E tests to compare numeric numbers between AITemplate and PyTorch eager.