Hacker News new | ask | show | jobs
by benhoff 460 days ago
I used this recently to download websites, stuffed them into a sqlite db, processed them with Mozllia's readability library, and then used the result and an llm to ask questions of the webpage itself.

It was helpful to take each step in chunks, as I didn't have a complete processing pipeline when I started.

I had wondered if there was an easier or better way to do this, as I probably would have liked to get the sitemap, pass the sitemap to an llm, then only download selected html pages vs the entire website.

1 comments

But the sitemap could be incomplete, couldn't it?
True, I guess that's the advantage of HTTrack.

I guess for my use case, it would be better to get the parsing that HTTrack does, get all the url's, and pass that into an intelligence to selectively grab files.