Hacker News new | ask | show | jobs
by kysol 4843 days ago
Not to rain on the parade of this post (I'm in support of more people learning to scrape, and more services out there giving us easier to access data). I'm someone who loves web scraping, but I'm also someone who believes that if you don't know what the library is doing, you shouldn't be using it.

You can give a brief overview of how to use it and what to look for in the page to extract from, but you're giving a very simple cheat sheet to people that may not understand HTML (trust me they exist... unfortunately). As soon as your example breaks, or they reach a limitation with the library, they are going to throw their arms in the air and deem the library broken, or the task impossible to do because the example said it would work. The only reason I'm writing this is that I know of these sorts of people, I deal with them on a regular basis, and I have to explain to them every time to look at what they are doing on a lower level to get a better understanding of their problem to find the solution.

These sorts of people will stumble across this article after their bosses told them "We need to pull Company X's product information into our sales screens so that we can compare the competitions prices while making our price adjustments". Knowing that they don't even have a clue on how to do that, they will Google for it and retrieve this article. With no experience, and an boss behind them, they will just blindly use it and pray that it works, but due to their inexperience with the subject at hand they will fail.

Sorry to be so negative, I just had to say that. It's the same as any other tutorial out there, just Scraping is something that I feel you need to know what you're doing before you do it.

Personally, I write my own scrapers from scratch (or using libraries I have written over time to make certain aspects less painful) for years. I know, I know, there is a myriad of ready-to-go libraries out there that will do the same thing and probably better for me, but where's the challenge. Sure if you're time restricted, then go forth and grab a library and start scraping, but please at least try to understand what you are doing at a lower level.

1 comments

I'm definitely not advocating for people not understanding the problem they want solved.

That said, your post sounds empty. Can you elaborate on why your own scrapers that you write from scratch make it all better? How do you your scrapers deal with encode detection, broken html, content prioritization and so forth?

I don't like the current options we've got in pythonland, but just writing: "this sucks, so I write my own" sounds like an ego trip. Can you describe in detail what BeautifulSoup (or lxml which is usually a better option) is doing wrong at the lower level and how your scripts are making it better?

Sorry if it sounded empty, there is a reason why I didn't include examples. I'm not really saying "don't use libraries", more just that you should understand the problem first before looking for an easy solution. To be honest, I've done all my scraping in PHP/Perl over the years. Only recently have I started to look into other options such as Python and NodeJS (hence looking at this thread).

I don't claim that my scrapers are better off because they are written from scratch, but they do the job that I want them to do. If I find a target that has a "quirk" I write that into my classes to be used then and in later instances. The real point of doing it this way is more about knowing what the scrapper is doing, rather than what it might do. When you're scraping, you're walking a fine line. Targets may be fine with you doing it to them, but as soon as your scraper freaks out then starts hammering the site, you're in trouble (even worse if you end up doing damage to the target).

I'm not saying that 3rd party libraries are prone to doing this, more so if you forget to set an option or handle an exception, you might screw yourself. If you wrote the scraper it's your own fault for not handling the issue properly. If you used a 3rd party library and the library bugged out causing the issue, you can't really go after the writers, right?

This all comes back to understanding your target, and to understand them, you need some form of knowledge on how it all works.

In response to your questions - I do a lot of things manually when setting up the scrapers. I don't import the data into any sort of DOM (due to watching memory), and in doing that I'm not really concerned about Encoding (for the record I'm generally dealing with UTF-8 and Shift_JIS only) or Broken HTML (I do a general check over the source to see if the layout has changed. If it has, it exits gracefully sending me update notifications on what changed, then puts itself out of action until I reset it. If it's a mission critical scraper, lets just say that I have a myriad of alerts that are sent to me). It's probably not the best way of doing things but it works for me.

Sorry if I was vague, I probably should have put some sort of rant-detection on my mouth. If I didn't answer something specifically, it's not that I was ignoring it, it probably just fell into the "I don't trust it so I don't use it" category. Again, not advocating that people shouldn't use 3rd party libraries, just that you should at least know what you are doing before you do.