Hacker News new | ask | show | jobs
by ga6840 3725 days ago
1. In NTCIR (main) dataset, I see many cases where <m:math> does not contain an altext (and thus no TeX). I asked LaTeXML author Bruce Miller <bruce.miller@nist.gov> about this, he said LaTeXML will always put the same TeX string as an altext attribute on the <m:math>. So I assume you guys are using some out-dated LaTeXML version? I really want to plead NTCIR to ensure the original LaTeX annotation is kept in main dataset, or please provide both MathML and LaTeX version corpus for researcher to freely choose. This will allow LaTeX-only math search engines being able to compare results with other MathML search engines. You know it is hard to convert all of them back into LaTeX correctly.

2. I wish NTCIR corpus is not that difficult to download (I once wrote a request for NTCIR corpus, but no one replies), please make it public accessible just like what MIaS does: https://mir.fi.muni.cz/mias/

3. My search engine (http://tkhost.github.io/opmes) is actually using structural method, but I still give up MathML and go parsing TeX directly instead. Why? In TeX I can just omit irrelevant command like "\color" and "\mbox", and only focus on a handful math-related TeX subset, and the result is great. Although my search engine can just handle "toy formula syntax", but maybe it is better than MathWebSearch (https://zbmath.org/formulae/) and even beat Tangent (http://saskatoon.cs.rit.edu/tangent/random) in long query. But in MathML, I have no idea why I need to read its lengthy spec, and I see no reason to write a MathML parser.

NTCIR-math conference (and its none-friendly website) makes me unwilling to submit a single paper.

1 comments

1. Correct, the dataset was generated back in 2013 and will probably be regenerated for the next NTCIR issue.

2. There are annoying copyright issues with making the datasets available for public use. We're working with arXiv to resolve that, it's out of our control for now. It's a long-lasting frustration of mine that the datasets can't be simply made public.

3.You can omit anything you like from the MathML, there is no inferiority to omitting from TeX. "but maybe it is better than MWS" - prove it, submit to NTCIR, and beat everyone. Also, being better than MWS is not an argument that MWS should be denied the very data it needs to run. At the same time you can still obtain whatever degradation you need from the presentation MathML. Failing to recognize any claim to correctness than your own without any substantive proof is not a reasonable position and I urge you to reconsider.

"I have no idea why I need to read its lengthy spec, and I see no reason to write a MathML parser."

You don't need to write a parser, you can use an off-the-shelf parser for XML/HTML5 and handle the MathML reliably and appropriately. In fact you can reuse that from any open source search engine for math, MWS included. Writing a TeX parser on the other hand is something I will always roll my eyes at, since actual real world TeX is not something you can "parse", or do anything with reliably, unless you have a full TeX implementation underneath. Which is 1000x harder than using a parser to deal with MathML.

Finally, whining about NTCIR's UI being imperfect as a reason not to submit is just childish.

Thank you for informing me on my first two questions, so now I understand NTCIR's problem.

At very first I tried to compare my results (MAP, recall, precision) with participants in NTCIR, but I take a lot efforts to get dataset, after which I find I cannot convert MathML back into TeX very confidently, most importantly, my parser-generated tree structure is fine-tuned and very dependent on TeX input, I cannot just take MathML tree structure directly, I need much more efforts than just importing an existing XML parser. Because of these, I can not compare my results with mainstream NTCIR researchers. But I definitely tried very hard, sadly I give up. If NTCIR someday can provide (even if request is needed) TeX data for competition, I will consider to (and able to, willing to) compare my results with NTCIR participants (in order to "prove" it).

Writing a TeX parser only for math search is not that difficult, I have written it, it parses most user-created document on math.stackexchange.com. Although I cannot convince you I get better results, I can argue parsing search-interested TeX subset is effortless (if you only care math-related TeX), I even opensourced my search engine TeX parser. Again, problem is not that easy to grab a XML parser and reuse it in my project, I believe a good math-aware search engine needs to get a tree structure very different from that a MathML structure represents, you get a tree by reusing MWS praser, so WHAT? That tree is not the tree I want, I need a lot effort to convert it, the easy way for me is to convert MathML back into TeX (Since I have already done that from TeX), sadly it turns out to be too complicated to worth giving a shot.

Lastly, I am more than childish to complain NTCIR and refuse submit a paper, I give up putting unworthy and duplicated effort on implementing a MathML parser that generates the expression tree I need (this step is the most difficult, rather than just parsing XML), instead, focusing on finding another conference to publish my efforts, it turns out my paper (a demo) get accepted in ECIR 2016, so glad I did not waste too much time on NTCIR, otherwise I would have missed ECIR.