Hacker News new | ask | show | jobs
by cranberryturkey 693 days ago
how does this work?
3 comments

I do this with https://mitta.ai by using a Playwright container that does a callback to a pipeline that uses either meta data from the PDF or sends it to an EasyOCR deployment on a GPU instance on Google for text extraction. Then I use a custom chunker and instructor/xl embeddings.

All of that code is Open Source, and works well for most sites. Some sites block Google IPs, but the Playwright container can run locally, so should be able to work around it with some minimal effort.

Give it URLs or domains, and it will crawl and extract their content, embed them in a vector database, and give you an endpoint that you can then query when doing RAG stuff or semantic search.
But how does it work in the background? What‘s the tech stack?
In another comment:

> Tech stack is a mix of serverless Laravel, with Cloudflare and AWS functions, and some Pinecone for vector storage. Still experimenting on a few things but don't want to over-engineer unless I know where I'm going.

There are a few ways. I built something similar huckai.com on top of vectara.com. They have open sourced their versions https://github.com/vectara

You can also do this on AWS now fairly easily. https://medium.com/data-reply-it-datatech/how-to-build-a-cus...

The lablab.ai Discord community is a pretty good place to learn how this product category is evolving.