Hacker News new | ask | show | jobs
Content Extractor – Open-Source Tool for Document and Image Analysis (github.com)
3 points by MDGrey33 536 days ago
4 comments

Hi everyone! I’m excited to announce PyVisionAI, an evolution of the project formerly known as Content Extractor with Vision LLM. Now available on pip and Poetry, it’s a Python library and CLI tool designed to extract text and images from documents and describe images using Vision Language Models. Key Features Dual Functionality: Use as a CLI tool or integrate as a Python library. File Extraction: Process PDF, DOCX, and PPTX files to extract text and images. Image Descriptions: Generate descriptions using local models (Ollama's llama3.2-vision) or cloud models (OpenAI GPT-4 Vision). Markdown Output: Save results in neatly formatted Markdown files. Quick Start Install via pip:bashCopy codepip install pyvisionai Extract content from a file:bashCopy codefile-extract -t pdf -s path/to/file.pdf -o output_dir Describe an image:bashCopy codedescribe-image -i path/to/image.jpg Repo & Contribution GitHub: PyVisionAI. https://github.com/MDGrey33/pyvisionai Whether you’re working with complex documents or image-rich data, PyVisionAI simplifies the process. Try it out and share your feedback—I’d love to hear your thoughts! This version is shorter while still emphasizing CLI and library functionality for both file extraction and image descriptions. Let me know if you’d like to tweak anything further!
I’m excited to share Content Extractor with Vision LLM, an open-source Python tool I’ve been working on. It extracts content (text and images) from documents (PDF, DOCX, PPTX) and generates detailed image descriptions using Vision Language Models like Ollama’s llama3.2-vision and OpenAI GPT-4 Vision.

The output is clean, structured Markdown, useful for tasks like knowledge management, archiving, or preprocessing content for other AI models.

Key Features:

Multi-format support: PDF, DOCX, PPTX. Flexible processing: Text + Images: Extract text and images separately. Page as Image: Preserve layouts as high-res images (300 DPI). Image descriptions via local (Ollama) or cloud-based models (OpenAI). Modular design (SOLID principles). Simple CLI and detailed logging. Tech Stack:

Python 3.12 Document processing libraries: PyMuPDF, python-docx, python-pptx Vision Language Models: Ollama llama3.2-vision, OpenAI GPT-4 Vision How to Try It:

Clone the repo: bash Copy code git clone https://github.com/MDGrey33/content-extractor-with-vision.gi... cd content-extractor-with-vision Install dependencies using Poetry. Start the Ollama server and pull the llama3.2-vision model: bash Copy code ollama serve ollama pull llama3.2-vision Run the tool: bash Copy code poetry run python main.py --source ./example_folder --type pdf What I’d Love to Hear:

Feedback on design, features, or use cases. Suggestions for improving modularity or adding functionality. Contributions (e.g., testing, documentation, new features). GitHub Repository: Content Extractor with Vision LLM

Looking forward to your thoughts, ideas, or any issues you encounter!

Cheers, Roland Abou Younes

I wonder if you could make a workflow like this run well completely local. I might try to build it. I saw this fully local open source tool recently that helps categorize your photos and screenshots… https://www.reddit.com/r/ollama/s/dFAnG5y5G8
Hey this one can do it with llama3.2-vision totally locally and the results are pretty decent. We can build workflows totally local with ollama in this case pretty easily.
Its a library now on pypi and in true python fashion its called pyvisionai https://github.com/MDGrey33/pyvisionai it features a library and cli for quick access. A stable set of integration tests to eliminate the chance of breaking changes.