Yep — some components currently rely on external APIs (e.g. OpenAI, MathPix),
primarily for stability and ease of deployment during early release.
But I’m planning to support fully local inference in the future to eliminate API key dependency.
The local pipeline would include:
• Tesseract or TrOCR for general OCR
• Pix2Struct, Donut, or DocTR for document structure understanding
The local pipeline would include:
• Tesseract or TrOCR for general OCR
• Pix2Struct, Donut, or DocTR for document structure understanding
• OpenAI CLIP for image-text semantic alignment
• Gemma / Phi / LLaMA / Mistral for downstream reasoning tasks
Goal is to make the system fully self-hostable for offline and private use.