I've always been curious about how search engines seed their scanning and index programs. Like how do you know what domains, ips, etc.. to start scanning and where is the origin?
I've changed the crawler design a couple of times, but the principle for growing the set of sites to be crawled is to look for sites that are (in some sense) adjacent to domains that were found to be good.
Only this is possible as long as the index knows about the domain. Yours are, but if not, anyone can shoot me an email or something and I can poke them into the database.
The limitation for known domains is in place to avoid abuse.
1) How many pages are in your index
2) How do you do indexing and retrieval? Do you build a word index by document and find documents that match all words in the query?
Is there a domain list if I wanted to crawl the hosts myself? I see you have the raw crawl data, which is appreciated, but a raw domain list would be cool.
I guess technically that could be arranged. Although I don't want everyone to run their own crawler. It would annoy a lot of webmasters and end up with even more hurdles to be able to run a crawler. Better to share the data if possible.
It wouldn't. But such islands are typically not very interesting either. The context of who links to a domain is very important for a search engine for many tasks, not just discovery.
Very cool. Reason I ask is at first glance the header "Search the Internet" to me, implies you are searching the entire internet. It sounds like a more appropriate header would be "Search the obsecure Internet"
To be fair, no search engine lets you search the entire Internet, not even Google does this.
Internet arguably doesn't even have a size. You can construct a website that's like n.example.com/m which links to '(n+1).example.com/m' and 'n.example.com/(m+1)', for each m and n between 0 and 1e308.
I did it! For every two numbers, calc.shpakovsky.ru has a static(-looking) webpage showing their sum (or difference, etc). Together with links to several other pages. The only limitation I know of is 4k URL length. Interestingly enough, major search engines are rather smart about it and cooled down their indexing efforts after some time. Guess, I'm not the first one to make such a website.
I remember reading somewhere that Google used dmoz (https://en.wikipedia.org/wiki/DMOZ) as seed page for their crawler. Not sure if it's true though...
I've actually sort of solved this recently. Marginalia's ranking algorithm is a modified PageRank that instead of links uses website adjacencies[1].
It can rank websites even if they aren't indexed, based on who is linking to them.
Vanilla PageRank can't do this very well. Domains that aren't indexed don't have (known) outgoing links, in the periphery of the rank. There's a some tricks to get these to not mess up the algorithm completely, but they basically all rank poorly. That's even without considering all the well known tricks for manipulating vanilla pagerank. The modified version seems very robust with regards to both problems.
Not exactly this, but close enough: https://memex.marginalia.nu/links/bookmarks.gmi
I've changed the crawler design a couple of times, but the principle for growing the set of sites to be crawled is to look for sites that are (in some sense) adjacent to domains that were found to be good.