They're making progress. At 5000 FPS, it's not surprising that they can recover audio. But from 60 FPS, that's striking. That works because some imagers don't take the whole frame at once.
Almost all consumer cameras have rolling shutters. In fact for this experiment, the crappier the camera, the better, as it's less pronounced in a lot of higher-end cameras. I'd suspect they might even be able to do better sound recovery with a GoPro than a DSLR.
Would it not be hard to interleave the first frame of these videos given different starting times and angles (ignoring camera movement)? It should be easy if the videos have synchronized timestamps, but that might not always be the case.
Any in-frame motion probably allows you to align to frame after the fact. This is existing technology, and gives you timestamp to frame alignment.
If you are reconstructing sound, you can now fuzz the time alignments to give the maximum signal for the maximum time (non-correlation will damp to random noise quickly). This allows you to pairwise reconstruct time alignments.
At that point, you put them all together and run your detailed analysis.
Now, I didn't say this way EASY. :) Or cheap. Or real-time.