| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hugodutka 700 days ago

I used this approach extensively over the past couple of months with GPT-4 and GPT-4o while building https://hotseatai.com. Two things that helped me:

1. Prompt with examples. I included an example image with an example transcription as part of the prompt. This made GPT make fewer mistakes and improved output accuracy.

2. Confidence score. I extracted the embedded text from the PDF and compared the frequency of character triples in the source text and GPT’s output. If there was a significant difference (less than 90% overlap) I would log a warning. This helped detect cases when GPT omitted entire paragraphs of text.

3 comments

themanmaran 699 days ago

One option we've been testing is the 'maintainFormat` mode. This tries to return the markdown in a consistent format by passing the output of a prior page in as additional context for the next page. Especially useful if you've got tables that span pages. The flow is pretty much:

- Request #1 => page_1_image

- Request #2 => page_1_markdown + page_2_image

- Request #3 => page_2_markdown + page_3_image

link

sidmitra 699 days ago

>frequency of character triples

What are character triples? Are they trigrams?

link

hugodutka 699 days ago

I think so. I'd normalize the text first: lowercase it and remove all non-alphanumeric characters. E.g for the phrase "What now?" I'd create these trigrams: wha, hat, atn, tno, now.

link

nbbaier 698 days ago

> I extracted the embedded text from the PDF

What did you use to extract the embedded text during this step? Other than some other OCR tech

link

hugodutka 697 days ago

PyMuPDF, a PDF library for Python.

link

jimmySixDOF 695 days ago

A different approach from vanilla OCR/parsing seems to be this mixed ColPali integrating a purposed small vision models and a ColBERT type indexing for retrieval. So - if search is the intended use case - it can skip the whole OCR step entirely.

[1] https://huggingface.co/blog/manu/colpali

link