Seems to be an issue on their side. E.g., for a step of GPT2 training on a 7900 XTX [1]: tinygrad is ~440ms, PyTorch 2.4.0.dev20240513 is ~97ms, Karpathy's llm.c with ROCm is ~79ms, and llm.c with custom kernels is ~58ms
That issue seems a month old, while the 58ms number looks 1 day old.
I have seen last month getting a lot of work done in improving performance (it's in the release announcement as well), but of course I still don't think it can compete with that number...still, a new comparision would be cool.
I have seen last month getting a lot of work done in improving performance (it's in the release announcement as well), but of course I still don't think it can compete with that number...still, a new comparision would be cool.