Memory bandwidth is important too. The Knights Landing processors have a 16GB on-chip memory to the cores have significantly higher bandwidth than you'd get with DDR4; the additional memory bandwidth makes more of an impact on the runtime of some algorithms than raw compute performance does.
The optional 16GB L3 is on separate chips, but it's colocated inside the same chip package. This kind of MCMs (multi-chip modules) have been used for a long time in the semiconductor industry since the 70s. Recent examples include AMD Xenos in XBox 360, Wii U CPU, IBM POWER chips.
Direct addressing is the preferred configuration. Only if your existing code's working set does not fit in MCDRAM does the cache configuration make sense.
It might sound pedantic on my part, but 'it can act as cache' is very different in practice from 'It is a cache'.