| HN Mirror

Sorry if it sounded empty, there is a reason why I didn't include examples. I'm not really saying "don't use libraries", more just that you should understand the problem first before looking for an easy solution. To be honest, I've done all my scraping in PHP/Perl over the years. Only recently have I started to look into other options such as Python and NodeJS (hence looking at this thread).

I don't claim that my scrapers are better off because they are written from scratch, but they do the job that I want them to do. If I find a target that has a "quirk" I write that into my classes to be used then and in later instances. The real point of doing it this way is more about knowing what the scrapper is doing, rather than what it might do. When you're scraping, you're walking a fine line. Targets may be fine with you doing it to them, but as soon as your scraper freaks out then starts hammering the site, you're in trouble (even worse if you end up doing damage to the target).

I'm not saying that 3rd party libraries are prone to doing this, more so if you forget to set an option or handle an exception, you might screw yourself. If you wrote the scraper it's your own fault for not handling the issue properly. If you used a 3rd party library and the library bugged out causing the issue, you can't really go after the writers, right?

This all comes back to understanding your target, and to understand them, you need some form of knowledge on how it all works.

In response to your questions - I do a lot of things manually when setting up the scrapers. I don't import the data into any sort of DOM (due to watching memory), and in doing that I'm not really concerned about Encoding (for the record I'm generally dealing with UTF-8 and Shift_JIS only) or Broken HTML (I do a general check over the source to see if the layout has changed. If it has, it exits gracefully sending me update notifications on what changed, then puts itself out of action until I reset it. If it's a mission critical scraper, lets just say that I have a myriad of alerts that are sent to me). It's probably not the best way of doing things but it works for me.

Sorry if I was vague, I probably should have put some sort of rant-detection on my mouth. If I didn't answer something specifically, it's not that I was ignoring it, it probably just fell into the "I don't trust it so I don't use it" category. Again, not advocating that people shouldn't use 3rd party libraries, just that you should at least know what you are doing before you do.