I wrote up some of my experiments attempting to do what you are describing. I explain why you cant simply use a 2D array of an audiofile. You can find my post here:
I am by no means an expert in this area and a few people have since told me I did a few stupid things in my analysis. But you might find it interesting.