| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cranberryturkey 693 days ago
	how does this work?

3 comments

kordlessagain 692 days ago

I do this with https://mitta.ai by using a Playwright container that does a callback to a pipeline that uses either meta data from the PDF or sends it to an EasyOCR deployment on a GPU instance on Google for text extraction. Then I use a custom chunker and instructor/xl embeddings.

All of that code is Open Source, and works well for most sites. Some sites block Google IPs, but the Playwright container can run locally, so should be able to work around it with some minimal effort.

link

tompec 693 days ago

Give it URLs or domains, and it will crawl and extract their content, embed them in a vector database, and give you an endpoint that you can then query when doing RAG stuff or semantic search.

link

xiconfjs 693 days ago

But how does it work in the background? What‘s the tech stack?

link

ramon156 693 days ago

In another comment:

> Tech stack is a mix of serverless Laravel, with Cloudflare and AWS functions, and some Pinecone for vector storage. Still experimenting on a few things but don't want to over-engineer unless I know where I'm going.

link

sunir 690 days ago

There are a few ways. I built something similar huckai.com on top of vectara.com. They have open sourced their versions https://github.com/vectara

You can also do this on AWS now fairly easily. https://medium.com/data-reply-it-datatech/how-to-build-a-cus...

The lablab.ai Discord community is a pretty good place to learn how this product category is evolving.

link