|
|
|
|
|
by simplecto
648 days ago
|
|
One of my projects is a virtual agency of multiple LLMs for a variety of back-office services (copywriting, copy-editing, social media, job ads, etc). We ingest your data wherever you point our crawlers and then clean it for work working in RAGs or chained LLMs. One library we like a lot is Trafilatura [1]. It does a great job of taking the full HTML page and returning the most semantically relevant parts. It works well for LLM work as well as generating embeddings for vectors and downstream things. [1] - https://trafilatura.readthedocs.io/en/latest/ |
|
I use it nearly hourly for my HN summarizer HackYourNews (https://hackyournews.com).