Hacker News new | ask | show | jobs
by casvc 3157 days ago
Thanks for asking. The main difference is focus on depth instead of breadth - thus instead of multitude of possible output formats support only few (PDF/HTML/TXT/IMG), but with some added features. Just few examples: - bulk search and autoredactions (marking / blacking out parts of documents that match certain queries) - signature and handwriting detection - tokenization (for TXT output) - language detection (for TXT/PDF output) - named entity detection (for TXT/PDF output)

Potential customers are people developing systems for GDPR (data protection), fraud detection, eDiscovery and content management.

1 comments

If you are doing some kind of intense annotation probably your most important thing is having an output format that supports the annotation you want to do -- not necessarily supporting any.

I have been thinking about universal annotation and the formats that I find the most interesting are PDF (because so much content exists in PDF) and HTML (open, easy to work with.)

You are absolutely right - we are thinking along the same lines. The only reason why we are offering TXT/IMG as output formats next to PDF/HTML is the fact that some people will have their own composite document formats and they can build those out of TXT/IMG.