Wow - how does it do that?! Image recognition techniques alone (as far as I know) couldn't do that, right? So how does it know how far away something is?
My guess is they estimate one or several dominant scene planes from the sparse triangulated feature points and get scale through incorporating accelerometer measurements.