| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kordlessagain 692 days ago
	I do this with https://mitta.ai by using a Playwright container that does a callback to a pipeline that uses either meta data from the PDF or sends it to an EasyOCR deployment on a GPU instance on Google for text extraction. Then I use a custom chunker and instructor/xl embeddings. All of that code is Open Source, and works well for most sites. Some sites block Google IPs, but the Playwright container can run locally, so should be able to work around it with some minimal effort.