These seem power and density optimized. This sort of custom hardware is all about supply chains and getting a lot of them everywhere. This flavors the inference use-case.
For large training jobs it is more about turn around time; running hideously expensive GPUs sucking down huge amounts of power is fine.
It looks rather general-purpose (for ML tasks) to me:
Each PE is equipped with two processor cores (one of them equipped with the vector extension) and a number of fixed-function units that are optimized for performing critical operations, such as matrix multiplication, accumulation, data movement, and nonlinear function calculation. The processor cores are based on the RISC-V open instruction set architecture (ISA) and are heavily customized to perform necessary compute and control tasks.