Hacker News new | ask | show | jobs
by MDGrey33 536 days ago
I’m excited to share Content Extractor with Vision LLM, an open-source Python tool I’ve been working on. It extracts content (text and images) from documents (PDF, DOCX, PPTX) and generates detailed image descriptions using Vision Language Models like Ollama’s llama3.2-vision and OpenAI GPT-4 Vision.

The output is clean, structured Markdown, useful for tasks like knowledge management, archiving, or preprocessing content for other AI models.

Key Features:

Multi-format support: PDF, DOCX, PPTX. Flexible processing: Text + Images: Extract text and images separately. Page as Image: Preserve layouts as high-res images (300 DPI). Image descriptions via local (Ollama) or cloud-based models (OpenAI). Modular design (SOLID principles). Simple CLI and detailed logging. Tech Stack:

Python 3.12 Document processing libraries: PyMuPDF, python-docx, python-pptx Vision Language Models: Ollama llama3.2-vision, OpenAI GPT-4 Vision How to Try It:

Clone the repo: bash Copy code git clone https://github.com/MDGrey33/content-extractor-with-vision.gi... cd content-extractor-with-vision Install dependencies using Poetry. Start the Ollama server and pull the llama3.2-vision model: bash Copy code ollama serve ollama pull llama3.2-vision Run the tool: bash Copy code poetry run python main.py --source ./example_folder --type pdf What I’d Love to Hear:

Feedback on design, features, or use cases. Suggestions for improving modularity or adding functionality. Contributions (e.g., testing, documentation, new features). GitHub Repository: Content Extractor with Vision LLM

Looking forward to your thoughts, ideas, or any issues you encounter!

Cheers, Roland Abou Younes