| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by simplecto 648 days ago

One of my projects is a virtual agency of multiple LLMs for a variety of back-office services (copywriting, copy-editing, social media, job ads, etc).

We ingest your data wherever you point our crawlers and then clean it for work working in RAGs or chained LLMs.

One library we like a lot is Trafilatura [1]. It does a great job of taking the full HTML page and returning the most semantically relevant parts.

It works well for LLM work as well as generating embeddings for vectors and downstream things.

[1] - https://trafilatura.readthedocs.io/en/latest/

2 comments

ukuina 648 days ago

+1 for Trafilatura. Simple, no fuss, and rarely breaks.

I use it nearly hourly for my HN summarizer HackYourNews (https://hackyournews.com).

link

nickpsecurity 648 days ago

The paper is good, too, for understanding how it works. The author also mentions many related tools in it.

https://aclanthology.org/2021.acl-demo.15/

link