Hacker News new | ask | show | jobs
by 1vuio0pswjnm7 2101 days ago
Not every domain registry's zone files are available through CZDS, unfortunately.

Not every domain listed in a zone file represents a "website".

Choosing a random domain from a zone file and prefixing it with "http://" and having PHP send a GET request certainly does not have a 100% chance of returning a web page.

(Might be interesting to calculate the probability.)

Seems like the author is not even filtering out A records corresponding to the NS entries in the zone files, e.g., something like a.ns.domain.tld

Sending a GET request to such subdomains is obviously not going return a web page.

As for clicking the button over 200 million times (assuming the total domains listed in the zone files is 200 million), that might violate the ICANN Zone File Access Agreement. Unless the terms have changed, one of the restrictions used to be against redestributing the data. This project would not be redistribution of the IP address data but if the user logs the names there's an argument it could be redistribution of the name data.

To "click the button" once from the command line

   curl -s https://theinternetportal.net/php/button.php
   echo
1 comments

It's true that this doesn't list all the websites that are registered, nor do all the domains lead to a working website. However, I think that most of the invalid websites are not caused by NS entries. As for the Zone File Access Agreement, it prohibits uses that allow the access of a significant portion of the data. An immense amount of time would have to be spent scraping data to get any portion that could be considered significant.
Also, there are alternative, publicly accessible ways to get most of this public zone file data now, so I am not sure that restriction in the access agreement is anything more than an historical artifact at this point.

You could use publicly available scan data for ports 80 and 443 to pare down the list of "websites".

The goal of exposing the non-popular web is worthwhile.

You could port scan the entire IPV4 address space(minus all reserved addresses), send a GET request to everyone that responds, filter for valid HTML. It would take no more than 5 hours on a shitty PC, a lot less if you get a small aws instance.
Most non-major sites are on shared hosting. Without a host name, you won't get anything useful unfortunately.
Most major site are on shared hosting. (Sadly)
Thanks, I appreciate the feedback!