"This was a few years ago so sensors may have improved"
Apparently they're much better than previously. When I first tried this around 3 years ago, I had the same problems you did. The errors get cubed by the double integration. Software Kalman filters and other tricks didn't really help.
A lot of the newer sensor chips now do "sensor fusion"[1] on the chip itself, which means that the raw data is much more useful, and requires almost no post-processing.
I assume that this, in combination with higher sampling rates, means that the errors are small enough for the output to be useful over "human interaction" scale.