TL;DR - sound is a physical phenomenon, pushing on objects (more pronounced on wide thin lightweight things like bags). A fast enough video camera with image enhancement can "see" the sound affecting the object, which then can be translated to recreating the sound from the video image.
Not long ago there was a spate of HN articles about apps that could measure your heart rate via the camera (it watches for & measures subtle changes in your skin color which occur during the pulse cycle). This is exactly the same idea, just with a much faster "pulse".
I expect the researchers will next discover the "rolling shutter" (a "that's not a bug, it's a feature!" of cell phone cameras) and discover how to extract the audio info without the need for high-framerate cameras. atomatica found a perfect example: http://youtu.be/TKF6nFzpHBU?t=10s
Not long ago there was a spate of HN articles about apps that could measure your heart rate via the camera (it watches for & measures subtle changes in your skin color which occur during the pulse cycle). This is exactly the same idea, just with a much faster "pulse".
I expect the researchers will next discover the "rolling shutter" (a "that's not a bug, it's a feature!" of cell phone cameras) and discover how to extract the audio info without the need for high-framerate cameras. atomatica found a perfect example: http://youtu.be/TKF6nFzpHBU?t=10s