Hacker News new | ask | show | jobs
by loudthing 1481 days ago
The main perpetrator I've come across (in, I'll admit, my limited experience with scraping) was Google services. It seemed somewhat obvious to me initially while playing around with Beautiful Soup, the most valuable resource I could use for scraping was Google search, followed by Gmail and Google Finance. I had no luck using BS with any of these services.

Largely this taught me to be creative with my data sources. For example, I built a virtual weather vane powered by a Raspberry Pi that would scrape my local airport's website to get wind direction data, then turn the vane via a servo to the correct direction. So my takeaway from this project was scraping isn't as straight forward as one would thing, there's more of an art to it in order to figuring out where to get the information you want.

1 comments

Agreed. Most public websites are not trying to force visitors to use a certain web browser by selectively denying access. No public website should be the sole, exclusive source of its data because if the data is public then by definition the data can be copied by any other website. As such, chances are the data can be found in multiple locations and at least one will not be trying to force the use of a certian web browser.

Am I correct that the examples listed here are (a) www.google.com, (b) mail.google.com and (c) www.google.com/finance/. I have no trouble extracting data from these examples.[FN1] I do not use a graphical web browser to make HTTP requests nor do I use Python or BeatifulSoup. A cookie is required for mail.google.com, in lieu of a password, but the cookie can be saved and will work for years.

1. Of course, Google Web Search is crippled. Using a basic HTTP client, e.g., no cookies, Javascript, FLoC, etc., one cannot retrieve more than 250-300 results total. Searching "too fast" will draw a temporary IP block. This "search engine" is designed for advertising not discovery. Advertisers compete for space at the top of the first page of results. Popular websites are prioritised, potentially making them even more popular. Websites that "rank"[FN2] too low in a search are not discoverable as they have no value for advertising. An index of public websites and public data is treated as properietary and secret. Google actively tries to prevent anyone from copying even a small portion of it.

2. Google makes it impossible to sort results by URL, date, or even number of keyword/string hits in the page. Results are ordered according to secret algorithm, designed for advertising.

2. Or sort by <title>.