| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ipiszy 1361 days ago

tl;dr:

Meta is open sourcing AITemplate, an inference engine for both Nvidia and AMD GPUs. Code: https://github.com/facebookincubator/AITemplate.

AITemplate delivers much better perf (1.9x ~ 12.8x) compared to PyTorch eager on SOTA models, including Bert, ResNet, VIT and StableDiffusion.

AITemplate also delivers high perf numbers using AMD GPUs (MI-250). With AITemplate, MI-250 achieves 80% ~ 96% A100 perf on various ResNet / Bert / VIT models.

AITemplate uses sophisticated fusion techniques to optimize perf, including vertical, horizontal, and memory fusions.

btw, I'm one of the authors of AITemplate, happy to answer any questions.

6 comments

Narew 1361 days ago

How did AITemplate performance to state of art inference engine like tvm or onnx runtime ? Did AITemplate optimize/quantify network?

Edit: link for TVM https://tvm.apache.org/

link

ipiszy 1360 days ago

AITemplate only supports fp16 data types with fp16 or fp32 accumulation right now. We are working on supporting more data types and quantization.

We don't have an official comparison between AITemplate and tvm / onnx for now, but we do have perf numbers like https://github.com/facebookincubator/AITemplate/tree/main/ex..., https://github.com/facebookincubator/AITemplate/tree/main/ex.... Feel free to run these examples on other frameworks and compare perf.

link

davidatbu 1360 days ago

I'd love to hear about this too: especially after running the model through an onnx optimizer, like this one [0].

[0] https://github.com/daquexian/onnx-simplifier

link

throwaway81523 1361 days ago

Thanks, that is very helpful. Do you have to train the model differently for use with AITemplate? Could it be helpful for Leela Chess Zero (LC0)? I think LC0 has a generic Pytorch backend, that is several times slower than its NVidia specific CUDA backend. I'm not very clueful about this stuff though.

link

haolu7 1361 days ago

No, you don't need to train the model differently to use it with AITemplate. Here is an intro example to do inference with AITemplate with a very simple PyTorch model: https://facebookincubator.github.io/AITemplate/tutorial/how_.... For more advanced examples, check out https://github.com/facebookincubator/AITemplate/tree/main/ex...

link

ipiszy 1361 days ago

As @haolu7 mentioned, you could take a pre-trained model and use AITemplate to do model inference. All you need to do is to re-write the model using AITemplate frontend and map PyTorch params to AITemplate params. Besides, AITemplate has a limited operator coverage compared to mature frameworks like PyTorch so you may need to implement your own kernels if necessary (though it already supports Bert, VIT, StableDiffusion, ResNet, Detectron, and general recommendation models).

link

fooblaster 1360 days ago

How does the performance compare with tensor rt? I didn't see any benchmarks comparing against that. I expect it to be lower for now, but excited for see what the future brings.

link

upbeat_general 1361 days ago

Do you know of any good explanations of the techniques you used for those who only touch PyTorch Eager + occasionally torchscript?

link

ipiszy 1360 days ago

You could check "AITemplate optimizations" section in the blog (https://ai.facebook.com/blog/gpu-inference-engine-nvidia-amd...), and https://github.com/facebookincubator/AITemplate#more-about-a.... The basic idea is to do aggressive kernel fusions.

link

papersnake 1360 days ago

Have you tested this on big models involving multi-gpu communication, or any plans?

link

ipiszy 1360 days ago

For now it's for single GPU inference only.

link

pretty_dumm_guy 1361 days ago

How do you verify the correctness of your fusion operation ?

link

ipiszy 1360 days ago

We have a bunch of unittests and E2E tests to compare numeric numbers between AITemplate and PyTorch eager.

link