Hacker News new | ask | show | jobs
by r3gal08 85 days ago
How are you handling the data extraction? Is it a multimodal VLM (OCR+LLM) or a standard OCR engine feeding a separate LLM? I’ve been hitting a wall trying to understand how this viable. The compute overhead for real-time analysis at scale seems massive without a serious backend. How are you managing the frequency?
1 comments

hi, while vision is going to be a part, it's hard to scale both on server and the client(it's resource intensive and the battery will drain faster on client). we hook deeper into the OS layer with accessibility, Apple Script and other ways to get raw text. This also lets us create a privacy friendly app with granular data controls for the user.

> compute overhead for real-time analysis at scale seems massive without a serious backend. you're still right about this part though, and we do have a serious backend.