>> Except most M0 cores will have 1 or 2 cycle internal flash memory reads, and this has slow external flash.
With 16K cache. I'm not sure how you ensure consistent performance though - make sure your code will all fit in 16K, but is that enough? And what about the first time through?
Even on ARM with built in flash you can't ensure that easily. The only way to do so is to copy code to ram and run from there (most of the M0-4 devices I've seen don't have an icache). This is because of the way that flash ends up being read from by them, (I can't remember the correct term, i want to say something like stop-waits) where the processor ends up waiting for an indeterminate time period waiting on the flash memory to read the next page.
16k cache is likely enough to ensure stable performance of any given function and any tight loops you're using but will probably not be enough for the entire program so you'll still have misses that cause slow downs but it'll probably not be terribly noticeable unless you're trying to ensure timing over large functions.
Most M0 chips I've seen have the cache, it just sits on the other side of the AHB matrix. It's more integrated into the flash controller than the CPU core.
With 16K cache. I'm not sure how you ensure consistent performance though - make sure your code will all fit in 16K, but is that enough? And what about the first time through?