| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ddumas 4334 days ago

This looks nice. What I'd really like to see, along these lines, is a python library for automated document metadata extraction with confidence assessment, like this:

./autometa.py --author --verbose academic-paper.pdf

Author: "Edward Witten" Confidence: High (matches template "amslatex")

3 comments

deanmalmgren 4334 days ago

I thought about the metadata thing but decided to exclude it for the earliest versions of textract to keep things simple. If you'd like to see it in there and have a good example of how you'd like to use metadata, please feel free to throw an issue on the issue tracker https://github.com/deanmalmgren/textract/issues/

link

kalkin 4334 days ago

As far as I have been able to tell, the public state of the art in academic paper metadata parsing is Grobid: https://github.com/kermitt2/grobid

Not quite as simple a commandline interface as you suggest, but not too hard to set up, and pretty impressive. Now if only Google Scholar would open-source whatever they use...

link

emillon 4334 days ago

For video files, guessit does something similar using only the file name:

http://guessit.readthedocs.org/

link