Performance isn't actually the objective of the Pi cluster; the people using it have a real supercomputer next door. It's a testbed so they can validate programs before transferring them to the expensive supercomputer.
I would imagine going from a 10-node to 100-node system is more overall complicated than going from 32 to 64. Sure the instructions change, but that should basically be all abstracted away by the toolchain. However job management, allocation, data logistics, queues, cache invalidation, bottlenecks, etc, are all key issues that compound non-linearly with scale.