LLMWhisperer has some nice tooling where they can fall back to OCR as well forcing text extraction from scanned documents as well as documents that have the text preserved as text.