Hacker News new | ask | show | jobs
by 1vuio0pswjnm7 1481 days ago
Agreed. Most public websites are not trying to force visitors to use a certain web browser by selectively denying access. No public website should be the sole, exclusive source of its data because if the data is public then by definition the data can be copied by any other website. As such, chances are the data can be found in multiple locations and at least one will not be trying to force the use of a certian web browser.

Am I correct that the examples listed here are (a) www.google.com, (b) mail.google.com and (c) www.google.com/finance/. I have no trouble extracting data from these examples.[FN1] I do not use a graphical web browser to make HTTP requests nor do I use Python or BeatifulSoup. A cookie is required for mail.google.com, in lieu of a password, but the cookie can be saved and will work for years.

1. Of course, Google Web Search is crippled. Using a basic HTTP client, e.g., no cookies, Javascript, FLoC, etc., one cannot retrieve more than 250-300 results total. Searching "too fast" will draw a temporary IP block. This "search engine" is designed for advertising not discovery. Advertisers compete for space at the top of the first page of results. Popular websites are prioritised, potentially making them even more popular. Websites that "rank"[FN2] too low in a search are not discoverable as they have no value for advertising. An index of public websites and public data is treated as properietary and secret. Google actively tries to prevent anyone from copying even a small portion of it.

2. Google makes it impossible to sort results by URL, date, or even number of keyword/string hits in the page. Results are ordered according to secret algorithm, designed for advertising.

1 comments

2. Or sort by <title>.