| Hi HN,
We launched OCR search as a part of our documents search engine. To demonstrate OCR capability, we scanned lease documents available from the General Services Administration (GSA) and performed OCR on all lease documents available for Washington state. Documents are then indexed and loaded into our search engine. We can perform regular and zonal OCR on documents as well. This demo does not use zonal OCR. We can perform OCR on thousands of documents within one hour. The system is scalable. All it needs it more workers for OCR and indexing data. Checkout demo and fully functional search engine at:
https://ocr-search.joyspace.ai. We have created a video to explain in more detail. Check out this YouTube video — https://www.youtube.com/watch?v=7EG9TPysBpU We see a high accuracy for OCR.You can search for any word, numbers, obscure characters, and also search table data within documents. While this demo is for scanned documents, we support HTML, PDF (regular and scanned), RTF, DOC, DOCX, CSV, EXCEL, JSON and other standard documents for search. You can get access to our APIs if you are interested in building search experience into your application. We have highly available APIs for Search Engine. You can visit https://www.joyspace.ai to get access to our search engine. We manage indexing and search pipelines for you. Happy to answer any questions around OCR and Search. |
I have done some work on PDFs before and I know extracting info. from PDF is hard.
Kudos to you for building a search for scanned PDFs.
Do I have to manage Chunking for the search engine?
You mentioned about APIs. Do you support multiple clouds? For example, I have some data Dropbox, S3, GDrive, and R2. Will I be able to connect all these clouds?
Can you tell me more about data security?
Either way, looks impressive for data engineering and ML pipelines.