I understand the achievement, but can't square it with my belief that uniform systolic arrays will prove to be the best general purpose compute engine for neural networks. Those are almost trivial to route, by nature.
Imagine a bit level systolic array. Just a sea of LUTs, with latches to allow the magic of graph coloring to remove all timing concerns by clocking everything in 2 phases.
GPUs still treat memory as separate from compute, they just have wider bottlenecks than CPUs.
I think the next step is arrays of memory-based compute.