Basically, I took a start URL for the crawl, and my program would load the page in Firefox using thirtyfour, and then extract all links from the page and use some basic rules for keeping track of which ones to visit and in which order. I had Squid proxy configured to save all traffic that passed through it.
It worked ok-ish. I only really stopped that project because of a hardware malfunction.
The main annoyance that I didn’t get around to solving was being more smart about not trying to load non-html content that was already loaded anyway as part of the page. Because the way I extracted links from the page I also extracted URLs of JS, CSS etc that were referenced.
Basically, I took a start URL for the crawl, and my program would load the page in Firefox using thirtyfour, and then extract all links from the page and use some basic rules for keeping track of which ones to visit and in which order. I had Squid proxy configured to save all traffic that passed through it.
It worked ok-ish. I only really stopped that project because of a hardware malfunction.
The main annoyance that I didn’t get around to solving was being more smart about not trying to load non-html content that was already loaded anyway as part of the page. Because the way I extracted links from the page I also extracted URLs of JS, CSS etc that were referenced.