| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cloudyporpoise 1165 days ago
	I've always been curious about how search engines seed their scanning and index programs. Like how do you know what domains, ips, etc.. to start scanning and where is the origin?

4 comments

marginalia_nu 1165 days ago

It's basically seeded with my personal bookmark list. Like a few dozen links.

Not exactly this, but close enough: https://memex.marginalia.nu/links/bookmarks.gmi

I've changed the crawler design a couple of times, but the principle for growing the set of sites to be crawled is to look for sites that are (in some sense) adjacent to domains that were found to be good.

link

HeckFeck 1165 days ago

May I submit my sites to your index? I think they'd be a good fit for the index.

https://www.thran.uk and https://wmw.thran.uk

link

marginalia_nu 1165 days ago

You can add them yourself :-)

https://search.marginalia.nu/site/www.thran.uk

https://search.marginalia.nu/site/wmw.thran.uk

Only this is possible as long as the index knows about the domain. Yours are, but if not, anyone can shoot me an email or something and I can poke them into the database.

The limitation for known domains is in place to avoid abuse.

link

HeckFeck 1165 days ago

Thanks!

link

gregw134 1165 days ago

1) How many pages are in your index 2) How do you do indexing and retrieval? Do you build a word index by document and find documents that match all words in the query?

link

marginalia_nu 1165 days ago

1) At this moment about 70 million documents. I've had it at about 110 million, dunno what the actual limit is.

2) Yes. Everything is in-house.

Do you build a word index by document and find documents that match all words in the query?)

Yeah. It's actually got three indices;

* One is a forward index with `document id -> document metadata`

* One is a priority term index with `term -> document id`.

* One is a full index with `term -> (document, term metadata)`

They're all based on static b-trees.

link

abracadaniel 1165 days ago

Is there a domain list if I wanted to crawl the hosts myself? I see you have the raw crawl data, which is appreciated, but a raw domain list would be cool.

link

marginalia_nu 1165 days ago

I guess technically that could be arranged. Although I don't want everyone to run their own crawler. It would annoy a lot of webmasters and end up with even more hurdles to be able to run a crawler. Better to share the data if possible.

link

cloudyporpoise 1165 days ago

So if there was a new domain, unlinked by anything - this wouldn't find it?

link

marginalia_nu 1165 days ago

It wouldn't. But such islands are typically not very interesting either. The context of who links to a domain is very important for a search engine for many tasks, not just discovery.

link

cloudyporpoise 1165 days ago

Very cool. Reason I ask is at first glance the header "Search the Internet" to me, implies you are searching the entire internet. It sounds like a more appropriate header would be "Search the obsecure Internet"

link

marginalia_nu 1165 days ago

To be fair, no search engine lets you search the entire Internet, not even Google does this.

Internet arguably doesn't even have a size. You can construct a website that's like n.example.com/m which links to '(n+1).example.com/m' and 'n.example.com/(m+1)', for each m and n between 0 and 1e308.

link

Lex-2008 1165 days ago

I did it! For every two numbers, calc.shpakovsky.ru has a static(-looking) webpage showing their sum (or difference, etc). Together with links to several other pages. The only limitation I know of is 4k URL length. Interestingly enough, major search engines are rather smart about it and cooled down their indexing efforts after some time. Guess, I'm not the first one to make such a website.

link

gertgoeman 1165 days ago

I remember reading somewhere that Google used dmoz (https://en.wikipedia.org/wiki/DMOZ) as seed page for their crawler. Not sure if it's true though...

link

djoldman 1165 days ago

That may be a much easier question to answer than discovery.

How do you discover relevant new domains?

link

marginalia_nu 1165 days ago

I've actually sort of solved this recently. Marginalia's ranking algorithm is a modified PageRank that instead of links uses website adjacencies[1].

It can rank websites even if they aren't indexed, based on who is linking to them.

Vanilla PageRank can't do this very well. Domains that aren't indexed don't have (known) outgoing links, in the periphery of the rank. There's a some tricks to get these to not mess up the algorithm completely, but they basically all rank poorly. That's even without considering all the well known tricks for manipulating vanilla pagerank. The modified version seems very robust with regards to both problems.

[1] https://memex.marginalia.nu/log/73-new-approach-to-ranking.g...

link

ddorian43 1165 days ago

Start with Common Crawl and go from there.

link