|
|
|
|
|
by hubraumhugo
718 days ago
|
|
Do you also see the ingestion process as the key challenge for many RAG systems to avoid "garbage in, garbage out"? How does R2R handle accurate data extraction for complex and diverse document types? We have a customer who has hundreds of thousands of unstructured and diverse PDFs (containing tables, forms, checkmarks, images, etc.), and they need to accurately convert these PDFs into markdown for RAG usage. Traditional OCR approaches fall short in many of these cases, so we've started using a combined multimodal LLM + OCR approach that has led to promising accuracy and consistency at scale (ping me if you want to give this a try). The RAG system itself is not a big pain point for them, but the accurate and efficient extraction and structuring of the data is. |
|
Try it with complex layout documents -> https://pg.llmwhisperer.unstract.com/
If anyone wants to solve for RAG right from loading from source, extraction, and sending processed data to destination/API, try Unstract [2] (it is open-source)
[1] https://unstract.com/llmwhisperer/
[2] https://github.com/Zipstack/unstract