The looped matrix multiply that you show is very hard to optimize for in the general case of einsum. Often the looped GEMM is found permuted such as `kbi,kjb->bij`. In this case, heuristics are needed to determine if GEMM is worth it due to unaligned memory copies.
`optimize=True` is generally best when there are more than two tensors in the expression.
The optimizer clearly tries to improve the performance, but in many cases, it doesn't seem to change anything. Let's simply multiply some matrices:
I can do or a naive But even with optimization, I see I'm not sure if I'm doing something wrong.