Hacker News new | ask | show | jobs
by cloudyporpoise 1165 days ago
I've always been curious about how search engines seed their scanning and index programs. Like how do you know what domains, ips, etc.. to start scanning and where is the origin?
4 comments

It's basically seeded with my personal bookmark list. Like a few dozen links.

Not exactly this, but close enough: https://memex.marginalia.nu/links/bookmarks.gmi

I've changed the crawler design a couple of times, but the principle for growing the set of sites to be crawled is to look for sites that are (in some sense) adjacent to domains that were found to be good.

May I submit my sites to your index? I think they'd be a good fit for the index.

https://www.thran.uk and https://wmw.thran.uk

You can add them yourself :-)

https://search.marginalia.nu/site/www.thran.uk

https://search.marginalia.nu/site/wmw.thran.uk

Only this is possible as long as the index knows about the domain. Yours are, but if not, anyone can shoot me an email or something and I can poke them into the database.

The limitation for known domains is in place to avoid abuse.

Thanks!
1) How many pages are in your index 2) How do you do indexing and retrieval? Do you build a word index by document and find documents that match all words in the query?
1) At this moment about 70 million documents. I've had it at about 110 million, dunno what the actual limit is.

2) Yes. Everything is in-house.

Do you build a word index by document and find documents that match all words in the query?)

Yeah. It's actually got three indices;

* One is a forward index with `document id -> document metadata`

* One is a priority term index with `term -> document id`.

* One is a full index with `term -> (document, term metadata)`

They're all based on static b-trees.

Is there a domain list if I wanted to crawl the hosts myself? I see you have the raw crawl data, which is appreciated, but a raw domain list would be cool.
I guess technically that could be arranged. Although I don't want everyone to run their own crawler. It would annoy a lot of webmasters and end up with even more hurdles to be able to run a crawler. Better to share the data if possible.
So if there was a new domain, unlinked by anything - this wouldn't find it?
It wouldn't. But such islands are typically not very interesting either. The context of who links to a domain is very important for a search engine for many tasks, not just discovery.
Very cool. Reason I ask is at first glance the header "Search the Internet" to me, implies you are searching the entire internet. It sounds like a more appropriate header would be "Search the obsecure Internet"
To be fair, no search engine lets you search the entire Internet, not even Google does this.

Internet arguably doesn't even have a size. You can construct a website that's like n.example.com/m which links to '(n+1).example.com/m' and 'n.example.com/(m+1)', for each m and n between 0 and 1e308.

I did it! For every two numbers, calc.shpakovsky.ru has a static(-looking) webpage showing their sum (or difference, etc). Together with links to several other pages. The only limitation I know of is 4k URL length. Interestingly enough, major search engines are rather smart about it and cooled down their indexing efforts after some time. Guess, I'm not the first one to make such a website.
I remember reading somewhere that Google used dmoz (https://en.wikipedia.org/wiki/DMOZ) as seed page for their crawler. Not sure if it's true though...
That may be a much easier question to answer than discovery.

How do you discover relevant new domains?

I've actually sort of solved this recently. Marginalia's ranking algorithm is a modified PageRank that instead of links uses website adjacencies[1].

It can rank websites even if they aren't indexed, based on who is linking to them.

Vanilla PageRank can't do this very well. Domains that aren't indexed don't have (known) outgoing links, in the periphery of the rank. There's a some tricks to get these to not mess up the algorithm completely, but they basically all rank poorly. That's even without considering all the well known tricks for manipulating vanilla pagerank. The modified version seems very robust with regards to both problems.

[1] https://memex.marginalia.nu/log/73-new-approach-to-ranking.g...

Start with Common Crawl and go from there.