|
I’m excited to share Content Extractor with Vision LLM, an open-source Python tool I’ve been working on. It extracts content (text and images) from documents (PDF, DOCX, PPTX) and generates detailed image descriptions using Vision Language Models like Ollama’s llama3.2-vision and OpenAI GPT-4 Vision. The output is clean, structured Markdown, useful for tasks like knowledge management, archiving, or preprocessing content for other AI models. Key Features: Multi-format support: PDF, DOCX, PPTX.
Flexible processing:
Text + Images: Extract text and images separately.
Page as Image: Preserve layouts as high-res images (300 DPI).
Image descriptions via local (Ollama) or cloud-based models (OpenAI).
Modular design (SOLID principles).
Simple CLI and detailed logging.
Tech Stack: Python 3.12
Document processing libraries: PyMuPDF, python-docx, python-pptx
Vision Language Models: Ollama llama3.2-vision, OpenAI GPT-4 Vision
How to Try It: Clone the repo:
bash
Copy code
git clone https://github.com/MDGrey33/content-extractor-with-vision.gi...
cd content-extractor-with-vision
Install dependencies using Poetry.
Start the Ollama server and pull the llama3.2-vision model:
bash
Copy code
ollama serve
ollama pull llama3.2-vision
Run the tool:
bash
Copy code
poetry run python main.py --source ./example_folder --type pdf
What I’d Love to Hear: Feedback on design, features, or use cases.
Suggestions for improving modularity or adding functionality.
Contributions (e.g., testing, documentation, new features).
GitHub Repository: Content Extractor with Vision LLM Looking forward to your thoughts, ideas, or any issues you encounter! Cheers,
Roland Abou Younes |