Hacker News new | ask | show | jobs
by roenxi 440 days ago
I've written some unfathomably bad web crawlers in the past. Indeed, web crawlers might be the most natural magnet for bad coding and eye-twitchingly questionable architectural practices I know of. While it likely isn't the major factor here I can attest that there are coders who see pages-articles-multistream.xml.bz2 and then reach for a wget + HTML parser combo.

If you don't live and breath Wikipedia it is going to soak up a lot of time figuring out Wikipedia's XML format and markup language, not to mention re-learning how to parse XML. HTTP requests and bashing through the HTML is all everyday web skills and familiar scripting that is more reflexive and well understood. The right way would probably be much easier but figuring it out will take too long.

Although that is all pre-ChatGPT logic. Now I'd start by asking it to solve my problem.

3 comments

You don't even need to deal with any XML formats or anything, they publish a complete dataset on Huggingface that's just a few lines to load in your Python training script

https://huggingface.co/datasets/wikimedia/wikipedia

To be a "good" web crawler, you have to go beyond "not bad coding". If you just write the natural "fetch page, fetch next page, retry if it fails" loop, notably, missing any sort of wait between fetches, so that you fetch as quickly as possible, you are already a pest. You don't even need multiple threads or machines to be a pest; a single machine on a home connection fetching pages as quickly as it can be already be a pest to a website with heavy backend computation or DB demands. Do an equally naive "run on a couple dozen threads" upgrade to your code and you expand the blast radius of your pestilence out to even more web sites.

Being a truly good web crawler takes a lot of work, and being a polite web crawler takes yet more different work.

And then, of course, you add the bad coding practices on top of it, ignoring robots.txt or using robots.txt as a list of URLs to scrape (which can be either deliberate or accidental), hammering the same pages over and over, preferentially "retrying" the very pages that are timing out because you found the page that locks the DB for 30 seconds in a hard query that even the website owners themselves didn't know was possible until you showed them by taking down the rest of their site in the process... it just goes downhill from there. Being "not bad" is already not good enough and there's plenty of "bad" out there.

I think most crawlers inevitably tend to turn into spaghetti code because of the number of weird corner cases you need to deal with.

Crawlers are also incredibly difficult to test in a comprehensive way. No matter what test scenarios you come up with, there's a hundred more weird cases in the wild. (e.g. there's a world's difference between a server taking a long time to respond to a request, and a server sending headers quickly but taking a long time to send the body)

I thrive for these kinds of moving-target challenges. But nobody will hire.
You'd probably ask ChatGPT to write you a crawler for Wikipedia, without thinking to ask whether there's a better way to get Wikipedia info. So that download would be missed, because how and what we ask AI stays very important. Actually this is not new, googling skills were known as being important before and even philosophers recognized that asking good questions was crucial.