|
I am the person behind generating the original NTCIR math datasets, and probably most of the research-produced MathML out there. We've recently presented that we have more than 350 million formulas from arXiv converted over to MathML, together with the rest of the papers as HTML5. As someone who has stared at arXiv TeX/LaTeX for years, I can testify you don't want to be looking at TeX math in actual latex documents, there is a lot more that goes in there beyond the toy formula syntax used on the web. As also someone who has worked on math search engines and math-rich NLP for a few years, complaining that you have a structured machine-parseable representation for mathematics and wanting TeX instead sounds naive. On one hand, the MathML formulas in the datasets already could preserve the source TeX (the TeX annotations may even be there, I can't remember right now), should you need it directly. On the other hand, you can use any structured methods, such as the ones used by content-based search engines such as MathWebSearch, or handpick any relevant information from the MathML tree to feed it back into a statistical algorithm, as done for example by the WebMIAS search engine. The most fundamental bit to understand if you're doing research on automated processing of human mathematics is that formulas are two dimensional objects best represented as trees, be they layout trees describing the presentation, or operator trees describing the content, or some other hybrid tree that tries doing both (such as LaTeXML's XMath spec). |
2. I wish NTCIR corpus is not that difficult to download (I once wrote a request for NTCIR corpus, but no one replies), please make it public accessible just like what MIaS does: https://mir.fi.muni.cz/mias/
3. My search engine (http://tkhost.github.io/opmes) is actually using structural method, but I still give up MathML and go parsing TeX directly instead. Why? In TeX I can just omit irrelevant command like "\color" and "\mbox", and only focus on a handful math-related TeX subset, and the result is great. Although my search engine can just handle "toy formula syntax", but maybe it is better than MathWebSearch (https://zbmath.org/formulae/) and even beat Tangent (http://saskatoon.cs.rit.edu/tangent/random) in long query. But in MathML, I have no idea why I need to read its lengthy spec, and I see no reason to write a MathML parser.
NTCIR-math conference (and its none-friendly website) makes me unwilling to submit a single paper.