|
|
|
|
|
by loudthing
1481 days ago
|
|
The main perpetrator I've come across (in, I'll admit, my limited experience with scraping) was Google services. It seemed somewhat obvious to me initially while playing around with Beautiful Soup, the most valuable resource I could use for scraping was Google search, followed by Gmail and Google Finance. I had no luck using BS with any of these services. Largely this taught me to be creative with my data sources. For example, I built a virtual weather vane powered by a Raspberry Pi that would scrape my local airport's website to get wind direction data, then turn the vane via a servo to the correct direction. So my takeaway from this project was scraping isn't as straight forward as one would thing, there's more of an art to it in order to figuring out where to get the information you want. |
|
Am I correct that the examples listed here are (a) www.google.com, (b) mail.google.com and (c) www.google.com/finance/. I have no trouble extracting data from these examples.[FN1] I do not use a graphical web browser to make HTTP requests nor do I use Python or BeatifulSoup. A cookie is required for mail.google.com, in lieu of a password, but the cookie can be saved and will work for years.
1. Of course, Google Web Search is crippled. Using a basic HTTP client, e.g., no cookies, Javascript, FLoC, etc., one cannot retrieve more than 250-300 results total. Searching "too fast" will draw a temporary IP block. This "search engine" is designed for advertising not discovery. Advertisers compete for space at the top of the first page of results. Popular websites are prioritised, potentially making them even more popular. Websites that "rank"[FN2] too low in a search are not discoverable as they have no value for advertising. An index of public websites and public data is treated as properietary and secret. Google actively tries to prevent anyone from copying even a small portion of it.
2. Google makes it impossible to sort results by URL, date, or even number of keyword/string hits in the page. Results are ordered according to secret algorithm, designed for advertising.