Hacker News new | ask | show | jobs
by rwg 4523 days ago
Is it possible to write a script to scan all the internet (or at least the popular websites) and determine which ones are blocked?

If you can find or make a list of websites you want to scan, you can script it. The biggest problem is doing it in a way that doesn't bring you to the attention of those doing the blocking.

1. Where can I find a list of all domain names, top 1000, top 100000?

Alexa's "top 1,000,000" list (~10.2 MB download) is at http://s3.amazonaws.com/alexa-static/top-1m.csv.zip

2. Is it possible to conclusively determine censorship from headers only or do I have to load the whole page and compare HTML code with a sample? Bandwidth is very expensive here.

It depends on the method used to block you from visiting a website.

If DNS-based blocking is used, you can use very small DNS lookups to identify whether or not a website is blocked — all of the hostnames of blocked websites will probably resolve to the same IP address. (You can check this with "nslookup www.website.com" in Windows or "host www.website.com" on Linux, OS X, etc.) If this method works, it's probably the best way — DNS requests are less likely to be logged than HTTP requests, and DNS requests and responses are small.

If the blocking uses a transparent proxy instead of forged DNS records, you could use HTTP HEAD requests and match against the "Server" header in the reply:

    Server: Apache/2.2.12 (Unix) mod_ssl/2.2.12 OpenSSL/0.9.7d mod_wsgi/3.2 mod_perl/1.29 PHP/4.4.1
The software listed in that "Server" header is terribly old, and I doubt you'll find any other web server on the Internet with that exact combination of software versions. So that could be a way to identify the server serving the "website blocked" page without downloading entire pages, but it might draw attention to you if you do it for thousands of websites.
2 comments

"...The biggest problem is doing it in a way that doesn't bring you to the attention of those doing the blocking...."

I think this is a HUGE issue that should not be taken lightly. A guy scanning certain websites from Iran IS going to attract some attention no matter how benign his motives. It just won't be taken lightly. That attention can land you on lists you don't want to be on.

I'm not saying that I don't sympathize with his/her situation... I just think that certain actions can be viewed by people with a security mindset as hostile. Indeed it may only increase the number of sites being blocked. As well as, SEVERELY restricting his/her ability to travel without being arrested. And if you attract enough of the right attention... you may find that being arrested is the least of your worries.

And all of this doesn't even take into account what Iranian authorities may do from their end.

Advice like this, given on a public forum via easily identifiable pseudonyms, should be taken with a BIG grain of salt.

Having lived the first two decades of my life, and naturally had to circumvent network blocking, in Iran, I can tell you that's not how they work. Most of the blocking they do is targeted at the masses, and most people actually do circumvent it. People who circumvent their internet blocking facilities do not generally face persecution, as it's basically 100% of the internet users.
I was referring, mostly, to what American authorities would think of an Iranian IP address port scanning web servers. That will get the attention of American authorities... and not in a good way.

You just don't go port scanning and probing willy nilly in the US. That's DOUBLY true if you are port scanning and probing sites that the US government has blocked... AND you are doing it from inside Iran.

You're just BEGGING for Homeland Security to take a closer look at you. It's very foolish.

You may know your Government... but I know mine. I can tell you that an Iranian probing sites whose access from Iran is blocked by the American government for security reasons... that's not bright. Authorities here will not take kindly to it.

Exactly.

گر حکم شود که مست گیرند

در شهر هر آنکه هست گیرند

Sorry, I couldn't help citing this particular piece of Persian poetry. Trust me, it's relevant.

"If they tell you to get drunk, everyone in the city is the boss"?
The literal translation would be something along the lines of "if they rule to arrest drunk people, they'd have to arrest everyone in the city."
Thanks. They do not use DNS-based blocking. I will try using the HEAD method if I find a way to do it anonymously.
If you have the option writing it to look for instances of US blocking that only incidentally finds local censorship may give you some ass-coverage.
Be careful, I don't think there would be a good way to do this anonymously without distributing the workload.