| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by abak 4393 days ago

The process I outlined is sometimes called "Mapper" and is from published literature:

http://comptop.stanford.edu/u/preprints/mapperPBG.pdf

The more general concept of calculating the nerve of a covering is standard topology material where the technique is used to create combinatorial representations of topological spaces. http://en.wikipedia.org/wiki/Nerve_of_an_open_covering

TDA is a framework not an example or a method. It applies anywhere that you can define a notion of distance or similarity along with a function (not nec. continuous) on your data. It's hard to think of a data situation where that doesn't apply. That's why we can handle so many different data types. The method has very few assumptions or requirements.

If you are saying that for some specific example you saw somewhere (or maybe it's in the article?) we didn't tell you the metric and function I'm afraid I can't help you - I am not myself familiar with the details of the text analysis you're referring to. But generally speaking we are open about exactly what metric and function we've used.

I am not obfuscating how we create covers (what you're calling neighborhood learning) - in fact I spelled it out exactly. We pull back the cover from the real line (period!). On the real line we typically cover so that either all covers are the same size or contain the same number of points (In the software you pick the number of sets in the cover, their percentage overlap, and if you want to have them same size on the real line or contain the same number (approx) of data points). How we cover the real line is a choice and isn't part of what TDA is. There are other reasonable choices.

I think compared to most companies we are in fact open and transparent. The part of the software that isn't open primarily deals with how we've scaled this algorithm to deal with large datasets. The "mathy" parts are all either published or we give customers the formulas in our documentation.

I agree that in the real world we use things that aren't proven to work. Typically there is some range of applicability but we can't pin down the requirements for when a technique will work and when it won't - for instance, we can prove something for gaussian noise but in practice it works more broadly.

We don't use advanced math to obfuscate this issue and we certainly confront this issue on a daily basis. The details of exactly where our methods apply is complicated (what function?, how general a space can we use?) and to some degree unknown, but since topology itself has very flexible foundations we believe that lots (most) data problems lend themselves to being understood using these methods. This is clear without needing a formal proof of applicability.

I think one of the reasons that there's confusion is that TDA doesn't neatly fit into existing analytical boxes. It's not clustering, dimensionality reduction, manifold learning or feature selection, It's different enough that when you say it sounds like hierarchical clustering I think you're missing the point of what a topological summary is. We are also not doing manifold learning (we do not need underlying manifold assumptions)

The way I think of it is as a framework for analytics. In the simplest case It takes a metric space and a real valued function and produces a geometric summary. You can create the topology using any method you like (similarity spaces, standard metrics, manifold learning, metric learning, social network graph distance, some other method you know or invent) and you can use any function you might know or invent (mean of point coordinates, distance from separating hyperplane in a SVM, local curvature estimates, age of person, page rank of a node in a graph etc.). As long as the metric and the function were "sensible" for your data you'll extract some geometric/topological truth about your data.

It's not that we have a new method like a new kind of SVM or a different kind of neural network. We have a very generic way for extracting the (up till now) overlooked geometric/topological content of existing methods. This geometric information adds fidelity to existing methods (like all good frameworks, it not only uses but improves existing methods). This one way to think about what Ayadi does.

(As an aside I'll point out that higher fidelity with existing methods also means that you can build more accurate models and we are currently automating the process of model improvement using TDA).

The kinds of methods you describe (manifold forests, hierarchical clustering etc.) are all methods and for the most part you can input them into the topological framework as well. TDA doesn't fix all of your problems (parameter selection, over fitting etc.) but instead gives you more fidelity and geometric information about what you've chosen to do. We've built in many standard methods into the software but allow you to extend them with your own custom ones if you choose.

In terms of "new math" or not I guess I don't really understand the complaint. Articles on TDA are published in peer reviewed math and science journals and are done by mathematicians (Some prominent mathematicians doing and publishing on TDA: Gunnar Carlsson, Stanford Math. Robert Ghrist, UPenn Math, Shmuel Weinberger UChicago Math, John Harer, Duke Math. Robert MacPherson, Institute for Advanced Study. Konstantin Mischaikow, Rutgers Math.). People from Ayasdi publish and present results to other mathematicians who consider it new. For example, Gunnar just published a review article describing TDA in Acta Numerica: http://journals.cambridge.org/action/displayJournal?jid=ANU (subscription required). Some of this he invented, some others published elsewhere, but people in the community consider it new.

Using topology to understand point cloud data is new - methods aren't copied from existing mathematical literature even if we are inspired by results in geometry, topology, algebraic topology, and algebraic geometry. Just because we work by analogy doesn't make this not new math - in fact - I would argue that's how most "new math" is created across all of mathematics.

As a postscript I would add that we also come up with new metrics and functions to solve problems we come across. But this is not the core of what TDA is.

Edit: Added point on data types in 3rd paragraph.