|
|
|
|
|
by gillesjacobs
296 days ago
|
|
Extracting structure and elements from HTML should be trivial and probably has multiple libraries in your programming language of choice. Be happy you have machine-readable semantic documents, that's best-case scenario in NLP. I used to convert the chunks to Markdown as it was more token-efficient and LLMs are often heavily preference trained on Markdown, but not sure with current input pricing and LLM performance gains that matters anymore. If you have scanned documents, last I checked Gemini Flash was very good cost/performance wise for document extraction. Mistral OCR claims better performance in their benchmarks but people I know used it and other benchmarks beg to differ. Personally I use Azure Document Intelligence a lot for the bounding boxes feature, but Gemini Flash apparently has this covered too. https://getomni.ai/blog/ocr-benchmark Sidenote: What you want for RAG is not OCR as-in extracting text. The task for RAG preprocessing is typically called Document Layout Analysis or End-to-End Document Parsing/Extraction. Good RAG is multimodal and semantic document structure and layout-aware so your pipeline needs to extract and recognize text sections, footers/headers, images, and tables. When working with PDFs you want accurate bounding boxes in your metadata for referring your users to retrieved sources etc. |
|