| Disclaimer: I am in the research group at Ayasdi. What we do is simple in many ways. But before I make specific comments I'd like to point out that simplicity is a virtue not a demerit. To pull from another area of Mathematics - the derivative is a simple idea - it's just the slope of a tangent line. Any yet, much of modern physics is the result of understanding and applying this idea. With simplicity the difficulty is understanding how and why you use the technique - rather than describing the technique itself. When you say "simplistic notion of topological invariance" I think you may be talking about persistent homology which is not currently the primary product sold by Ayasdi. I disagree that homology is simplistic - it's one of the central tools in modern mathematics. In any case we sidestep the whole notion of inferring the neighborhood structure focus on creating meaningful (open) covers of the data. Instead of finding a neighborhood structure (which you can think of as a particular choice of covering of your data by metric balls) we create (open) covers that summarize some important aspect of the data. I mean summary in a technical sense that is beyond the scope of this comment. (I have some videos on youtube that address this and other issues. I think the most recent is the best but there is some important material in the earlier videos that I don't repeat in the later ones https://www.youtube.com/results?search_query=anthony+bak I apologize for the vanity post) I can briefly describe one technique for generating covers that we use instead of neighborhood structure but the details of how this fits into the bigger picture are best seen from materials on the Ayasdi resources page or from the videos above. In the simplest (and most common) case we use a function on the data to get a map from the data to the real line. Using the function we induce a cover on our data from a cover on the real line by taking inverse images of the sets in the real line cover. This data cover is too coarse for most purposes so we break large sets into smaller sets by clustering within each inverse image set. In this way we build a useful cover of the space. To finish, we calculate the "nerve" of this cover to convert our (complex, high dimensional) space into a combinatorial object called a simplicial complex that "remembers" geometric and topological features of the original space (while forgetting others). Why this is the right thing to do is covered in the video and on the resources page. It's ok to be confused about the "why?". I don't see how this is a parameter sweep and I suspect again that you're talking about persistent homology not the nerve construction described here. Stepping back for a moment, topological spaces comprise a far richer set of spaces than manifolds. In my experience real world data almost never looks like it's sampled from a manifold and the tools we need to describe what is happening need to far less rigidity than those coming from manifold learning. In particular, we want tools that make as few assumptions (manifold, homogeneity, statistical) as possible. As far as shape goes topology has (arguably) the most relaxed notion of shape in mathematics - so the fewest assumptions are needed to study the shape of the data. As an aside I'll mention that in the video I show an example of using Topology a la Ayasdi to do manifold learning. We find a Klein bottle glued inside a Sphere along a singular set. One of the reasons this is a nice example is that it was also solved using tools from manifold learning - but the methods required knowing the local structure of the singularity. None of those assumptions were used in the topological reconstruction - we didn't need to know ahead of time what we were looking for. I go through a bunch of examples in the videos of using these ideas on actual data. I believe I show some examples of telco churn data, insurance fraud, and mobil phone parkinsons detection. These comprise a small selection of what you can find poking around the Ayasdi resources page but go well beyond the NKI cancer data set you refer to (although, I personally find that example compelling). Finally, I also like the other approaches you mention but I see them as complimentary not oppositional to the topological approach - and I think generally speaking there is a fair bit of overlap between the various communities. In particular, the Hodge theoretic analysis on simplicial complexes is really nice and Yuan Yao, who is a coauthor on the Hodge Ranking paper, was a postdoc with Gunnar Carlsson (Ayasdi cofounder) and I just was visiting him in Beijing. I regularly talk with collaborators of Larry Wasserman and his graduate students. Looking further a field, Vin de Silva, one of the inventors of isomap, was also a postdoc with Gunnar Carlsson and is very active in the Topological Data community. In fact, on a technical level, like I mentioned above, the topological framework can use manifold learning techniques (such as isomap) to help create the topology used as part of the nerve construction. So the fields cross fertilize technical results as well. Yes, like you say, there is exciting things going on, and we are constantly integrating those ideas into our product. What all of these methods share is a desire to bring richer geometric (broadly speaking) toolset to modern data problems. I see particular value in the topological approach but support other work on bringing geometry to data. |
The history of data analysis indicates exactly the opposite. Methods are shown often not shown to work theoretically for years after they are accepted to work well in real world data analysis. Lots of popular methods may not even come with theoretical guarantees and lots of theoretical guarantees are useless or misleading because they depend on assumptions about the data which are rarely true or have other issues.
But I have to say your still just sidestepping how you actually do the neighborhood learning now by calling it open covers (I just used the term neighborhood structure to keep it less jargony for this audience). How do you map the text documents mentioned in the article to real space? If you are just integrating isomaps and other standard techniques and the added value is the simplicial complex vis that is fine but you aren't developing any new math.
The procedure you are describing is similar to hierarchical clustering and will suffer from similar sensitivity to selection of the initial splits and any parameters. Manifold forests, for example, are also hierarchical but used a bagged ensemble to partially addresses this. I'd like to see more public work from you guys on combating this sort of overfitting and sensitivity...I just picked on persistent homology because it is one of the only things topology seems to add in this area.
This stuff is really cool and has generated useful results. If you guys just want to be the main consulting company for doing manifold learning that is great but marketing articles like this that try to claim you're the exclusive purveyors of some new math is turning a large portion of the community away.