| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by konfuzio 1929 days ago

Hi Walter,

thanks for your questions! We have updated the post and included the answers to your questions.

- Of course, this step can be omitted if the documents already have text embeddings. However, it is often necessary to read tables or scanned documents, for example. In our software solution, the users can decide for any project if they want to use text embeddings, Tesseract, or a commercial OCR.

- With page segmentation or also called layout analysis, we refer to the division of a document into separate parts.

- This is done with our own trained model because we couldn’t achieve the needed outcome with off-the-shelf software like Tesseract or Abbyy FineReader.

1 comments

WalterGR 1927 days ago

Thanks for the follow-up.

Did you see this, posted earlier today? It looks like the actual data isn't available yet, however.

https://news.ycombinator.com/item?id=26339769

Wit: Wikipedia-Based Image Text Dataset (github.com/google-research-datasets)

link