Hacker News new | ask | show | jobs
by novaRom 447 days ago
> Built With: DocLayout-YOLO, Google Vision API, Gemini Pro Vision, MathPix OCR, OpenAI API, OpenCV, and more.

the whole pipeline is not open source

1 comments

Yep — some components currently rely on external APIs (e.g. OpenAI, MathPix), primarily for stability and ease of deployment during early release. But I’m planning to support fully local inference in the future to eliminate API key dependency.

The local pipeline would include:

• Tesseract or TrOCR for general OCR

• Pix2Struct, Donut, or DocTR for document structure understanding

• OpenAI CLIP for image-text semantic alignment

• Gemma / Phi / LLaMA / Mistral for downstream reasoning tasks

Goal is to make the system fully self-hostable for offline and private use.