I do this with https://mitta.ai by using a Playwright container that does a callback to a pipeline that uses either meta data from the PDF or sends it to an EasyOCR deployment on a GPU instance on Google for text extraction. Then I use a custom chunker and instructor/xl embeddings.
All of that code is Open Source, and works well for most sites. Some sites block Google IPs, but the Playwright container can run locally, so should be able to work around it
with some minimal effort.
Give it URLs or domains, and it will crawl and extract their content, embed them in a vector database, and give you an endpoint that you can then query when doing RAG stuff or semantic search.
> Tech stack is a mix of serverless Laravel, with Cloudflare and AWS functions, and some Pinecone for vector storage. Still experimenting on a few things but don't want to over-engineer unless I know where I'm going.
All of that code is Open Source, and works well for most sites. Some sites block Google IPs, but the Playwright container can run locally, so should be able to work around it with some minimal effort.