|
|
|
|
|
by ga6840
3725 days ago
|
|
1. In NTCIR (main) dataset, I see many cases where <m:math> does not contain an altext (and thus no TeX). I asked LaTeXML author Bruce Miller <bruce.miller@nist.gov> about this, he said LaTeXML will always put the same TeX string as an altext attribute
on the <m:math>. So I assume you guys are using some out-dated LaTeXML version? I really want to plead NTCIR to ensure the original LaTeX annotation is kept in main dataset, or please provide both MathML and LaTeX version corpus for researcher to freely choose. This will allow LaTeX-only math search engines being able to compare results with other MathML search engines. You know it is hard to convert all of them back into LaTeX correctly. 2. I wish NTCIR corpus is not that difficult to download (I once wrote a request for NTCIR corpus, but no one replies), please make it public accessible just like what MIaS does:
https://mir.fi.muni.cz/mias/ 3. My search engine (http://tkhost.github.io/opmes) is actually using structural method, but I still give up MathML and go parsing TeX directly instead. Why? In TeX I can just omit irrelevant command like "\color" and "\mbox", and only focus on a handful math-related TeX subset, and the result is great. Although my search engine can just handle "toy formula syntax", but maybe it is better than MathWebSearch (https://zbmath.org/formulae/) and even beat Tangent (http://saskatoon.cs.rit.edu/tangent/random) in long query.
But in MathML, I have no idea why I need to read its lengthy spec, and I see no reason to write a MathML parser. NTCIR-math conference (and its none-friendly website) makes me unwilling to submit a single paper. |
|
2. There are annoying copyright issues with making the datasets available for public use. We're working with arXiv to resolve that, it's out of our control for now. It's a long-lasting frustration of mine that the datasets can't be simply made public.
3.You can omit anything you like from the MathML, there is no inferiority to omitting from TeX. "but maybe it is better than MWS" - prove it, submit to NTCIR, and beat everyone. Also, being better than MWS is not an argument that MWS should be denied the very data it needs to run. At the same time you can still obtain whatever degradation you need from the presentation MathML. Failing to recognize any claim to correctness than your own without any substantive proof is not a reasonable position and I urge you to reconsider.
"I have no idea why I need to read its lengthy spec, and I see no reason to write a MathML parser."
You don't need to write a parser, you can use an off-the-shelf parser for XML/HTML5 and handle the MathML reliably and appropriately. In fact you can reuse that from any open source search engine for math, MWS included. Writing a TeX parser on the other hand is something I will always roll my eyes at, since actual real world TeX is not something you can "parse", or do anything with reliably, unless you have a full TeX implementation underneath. Which is 1000x harder than using a parser to deal with MathML.
Finally, whining about NTCIR's UI being imperfect as a reason not to submit is just childish.