Show HN: Papermint – Use a simple description to extract key phrases from PDFs | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

Show HN: Papermint – Use a simple description to extract key phrases from PDFs (papermintai.com)

4 points by dabs 779 days ago

Hi all! My friend and I (MIT alumni and longtime engineers) created Papermint (papermintai.com), a web application that allows you to easily extract multiple key phrases from any PDF document that has searchable text.

No consultation needed, you can click the link and start experimenting

What are key phrases: Phrases in a document that fall under some "type" or "category". For example, with a restaurant menu, key phrases could be "item names", "item prices", "item descriptions", etc.

Easy to use: Papermint does not require annotating multiple documents or writing complex rules to extract data. To achieve good extraction accuracy, all Papermint needs is the name and description of your key phrase. E.g. "item names": "Names of items that can be ordered in a restaurant menu, e.g. 'Slice of Pizza', 'Coke', etc."

How is this different from existing tools? Current tools cannot extract all key phrases when their number varies across documents. They also do not work very well with document types they don’t directly support without a lot of manual annotation.

Please reach out to me with any questions or feedback, excited to hear how you use it!

1 comments

throwaway888abc 779 days ago

But you can do this with ChatGPT ? What i'm missing here (confused) ?

dabs 778 days ago

Good question!

ChatGPT certainly makes it easier to implement a prototype that works quite well for documents meeting certain conditions. For example, it does pretty well with restaurant menus out of the box because the entities extracted tend to have fairly unique text.

However, with documents where you have a lot of repetition or complex tabular structures even the latest ChatGPT isn't enough. It struggles capturing the structure of the table in the output, and struggles when the same text appears in different instances in the document.

This is where a hybrid system that merges the zero-shot strenghts of ChatGPT, but that also leverages strong priors and conditioning from strong heuristics, can yield a much better end product.

Currently the implementation is more of the LLM heavy side, but our plan is to iterate to include more of these heuristics to get a more robust tool overall across different document types.