|
Not to rain on the parade of this post (I'm in support of more people learning to scrape, and more services out there giving us easier to access data). I'm someone who loves web scraping, but I'm also someone who believes that if you don't know what the library is doing, you shouldn't be using it. You can give a brief overview of how to use it and what to look for in the page to extract from, but you're giving a very simple cheat sheet to people that may not understand HTML (trust me they exist... unfortunately). As soon as your example breaks, or they reach a limitation with the library, they are going to throw their arms in the air and deem the library broken, or the task impossible to do because the example said it would work. The only reason I'm writing this is that I know of these sorts of people, I deal with them on a regular basis, and I have to explain to them every time to look at what they are doing on a lower level to get a better understanding of their problem to find the solution. These sorts of people will stumble across this article after their bosses told them "We need to pull Company X's product information into our sales screens so that we can compare the competitions prices while making our price adjustments". Knowing that they don't even have a clue on how to do that, they will Google for it and retrieve this article. With no experience, and an boss behind them, they will just blindly use it and pray that it works, but due to their inexperience with the subject at hand they will fail. Sorry to be so negative, I just had to say that. It's the same as any other tutorial out there, just Scraping is something that I feel you need to know what you're doing before you do it. Personally, I write my own scrapers from scratch (or using libraries I have written over time to make certain aspects less painful) for years. I know, I know, there is a myriad of ready-to-go libraries out there that will do the same thing and probably better for me, but where's the challenge. Sure if you're time restricted, then go forth and grab a library and start scraping, but please at least try to understand what you are doing at a lower level. |
That said, your post sounds empty. Can you elaborate on why your own scrapers that you write from scratch make it all better? How do you your scrapers deal with encode detection, broken html, content prioritization and so forth?
I don't like the current options we've got in pythonland, but just writing: "this sucks, so I write my own" sounds like an ego trip. Can you describe in detail what BeautifulSoup (or lxml which is usually a better option) is doing wrong at the lower level and how your scripts are making it better?