|
As someone who lives in Iran, this is sad but not news. I have gotten used to see half the websites blocked by my government(Facebook, YouTube, Flickr, WordPress, etc) and the other half by your government(Java SE or anything else from Oracle, Google Code, Google Play Store, anything from Xilinx, etc). If one of my favorite websites was blocked, I may have considered not using it anymore. When virtually all websites are blocked, I can either not use the internet or find a way around it. Of course I chose the second option. Most Iranians have been using proxies and VPNs for the past few years. This blockage would not affect us much. P.S. Please stop using Google Code. Edit: Also App Engine. Udacity has been inaccessible to Iranians since the beginning because they use App Engine for hosting. This is what I get when I try to access Udacity: http://i.imgur.com/zUecPHk.png P.P.S. I am curious what percentage of the internet is blocked in Iran. When you try to access a blocked website, the censorship system shows a page explaining that the website is blocked and some links to Iranian websites. Is it possible to write a script to scan all the internet (or at least the popular websites) and determine which ones are blocked? Here is what I get when I try to access YouTube: http://git.io/HG3nsQ I have two questions: 1. Where can I find a list of all domain names, top 1000, top 100000? 2. Is it possible to conclusively determine censorship from headers only or do I have to load the whole page and compare HTML code with a sample? Bandwidth is very expensive here. |
If you can find or make a list of websites you want to scan, you can script it. The biggest problem is doing it in a way that doesn't bring you to the attention of those doing the blocking.
1. Where can I find a list of all domain names, top 1000, top 100000?
Alexa's "top 1,000,000" list (~10.2 MB download) is at http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
2. Is it possible to conclusively determine censorship from headers only or do I have to load the whole page and compare HTML code with a sample? Bandwidth is very expensive here.
It depends on the method used to block you from visiting a website.
If DNS-based blocking is used, you can use very small DNS lookups to identify whether or not a website is blocked — all of the hostnames of blocked websites will probably resolve to the same IP address. (You can check this with "nslookup www.website.com" in Windows or "host www.website.com" on Linux, OS X, etc.) If this method works, it's probably the best way — DNS requests are less likely to be logged than HTTP requests, and DNS requests and responses are small.
If the blocking uses a transparent proxy instead of forged DNS records, you could use HTTP HEAD requests and match against the "Server" header in the reply:
The software listed in that "Server" header is terribly old, and I doubt you'll find any other web server on the Internet with that exact combination of software versions. So that could be a way to identify the server serving the "website blocked" page without downloading entire pages, but it might draw attention to you if you do it for thousands of websites.