There is a difference. We train with large batch sizes these days. The ANE silicon size is tiny and can't do the large matrix multiplications for big LLMs with or without a batch size higher than 1. Meaning that it cannot saturate the RAM bandwidth and that you're better using off the much bigger GPU on the Apple die.
The submitted article also talks about training models.