|
|
|
|
|
by ga6840
3721 days ago
|
|
I am building a project and doing research on math-aware search (my project is hosted on https://github.com/t-k-/the-day-after-tomorrow)
As for the search engine for math, it is a pity that MathML has become a standard "input" for mainstream research.
The most famous conference on Math search: NTCIR, is actually publishing its main dataset/corpus in MathML. Converting MathML back into LaTeX is possible but error-prone for most moderate-complex expressions (I tried it using haskell pandoc).
This makes math-aware search engines have to include a MathML parser. And the most popular digital math document are still mostly written in LaTeX,
math search engine thus needs another tool (e.g. LaTeXML) to convert LaTeX to much more lengthy MathML stuffs.
As a researcher in this field, all I see is MathML brings a lot overhead to our life. I think LaTeX is still the ideal way to "input" math expression, it is human-friendly and most commonly used math input.
While WEB standard should focus on "rendering" LaTeX.
I have to point it out that I am pretty comfortable about what MathJax provides, but if there needs to be a WEB standard on math,
I wish some day the standard way to write math expression in HTML is something like this: <math> x = \frac{-b \pm \sqrt{b^2 - 4ac}} {2a} </math> Instead of:
<math display="block">
<mrow>
<mi>x</mi>
<mo>=</mo>
<mfrac>
<mrow>
<mo>−</mo>
<mi>b</mi>
<mo>±</mo>
<msqrt>
<mrow>
<msup>
<mi>b</mi>
<mn>2</mn>
</msup>
<mo>−</mo>
<mn>4</mn>
<mi>a</mi>
<mi>c</mi>
</mrow>
</msqrt>
</mrow>
<mrow>
<mn>2</mn>
<mi>a</mi>
</mrow>
</mfrac>
</mrow>
</math> |
|
As someone who has stared at arXiv TeX/LaTeX for years, I can testify you don't want to be looking at TeX math in actual latex documents, there is a lot more that goes in there beyond the toy formula syntax used on the web.
As also someone who has worked on math search engines and math-rich NLP for a few years, complaining that you have a structured machine-parseable representation for mathematics and wanting TeX instead sounds naive. On one hand, the MathML formulas in the datasets already could preserve the source TeX (the TeX annotations may even be there, I can't remember right now), should you need it directly. On the other hand, you can use any structured methods, such as the ones used by content-based search engines such as MathWebSearch, or handpick any relevant information from the MathML tree to feed it back into a statistical algorithm, as done for example by the WebMIAS search engine.
The most fundamental bit to understand if you're doing research on automated processing of human mathematics is that formulas are two dimensional objects best represented as trees, be they layout trees describing the presentation, or operator trees describing the content, or some other hybrid tree that tries doing both (such as LaTeXML's XMath spec).