| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by marginalia_nu 1165 days ago

It's basically seeded with my personal bookmark list. Like a few dozen links.

Not exactly this, but close enough: https://memex.marginalia.nu/links/bookmarks.gmi

I've changed the crawler design a couple of times, but the principle for growing the set of sites to be crawled is to look for sites that are (in some sense) adjacent to domains that were found to be good.

3 comments

HeckFeck 1165 days ago

May I submit my sites to your index? I think they'd be a good fit for the index.

https://www.thran.uk and https://wmw.thran.uk

link

marginalia_nu 1165 days ago

You can add them yourself :-)

https://search.marginalia.nu/site/www.thran.uk

https://search.marginalia.nu/site/wmw.thran.uk

Only this is possible as long as the index knows about the domain. Yours are, but if not, anyone can shoot me an email or something and I can poke them into the database.

The limitation for known domains is in place to avoid abuse.

link

HeckFeck 1165 days ago

Thanks!

link

gregw134 1165 days ago

1) How many pages are in your index 2) How do you do indexing and retrieval? Do you build a word index by document and find documents that match all words in the query?

link

marginalia_nu 1165 days ago

1) At this moment about 70 million documents. I've had it at about 110 million, dunno what the actual limit is.

2) Yes. Everything is in-house.

Do you build a word index by document and find documents that match all words in the query?)

Yeah. It's actually got three indices;

* One is a forward index with `document id -> document metadata`

* One is a priority term index with `term -> document id`.

* One is a full index with `term -> (document, term metadata)`

They're all based on static b-trees.

link

abracadaniel 1165 days ago

Is there a domain list if I wanted to crawl the hosts myself? I see you have the raw crawl data, which is appreciated, but a raw domain list would be cool.

link

marginalia_nu 1165 days ago

I guess technically that could be arranged. Although I don't want everyone to run their own crawler. It would annoy a lot of webmasters and end up with even more hurdles to be able to run a crawler. Better to share the data if possible.

link

cloudyporpoise 1165 days ago

So if there was a new domain, unlinked by anything - this wouldn't find it?

link

marginalia_nu 1165 days ago

It wouldn't. But such islands are typically not very interesting either. The context of who links to a domain is very important for a search engine for many tasks, not just discovery.

link

cloudyporpoise 1165 days ago

Very cool. Reason I ask is at first glance the header "Search the Internet" to me, implies you are searching the entire internet. It sounds like a more appropriate header would be "Search the obsecure Internet"

link

marginalia_nu 1165 days ago

To be fair, no search engine lets you search the entire Internet, not even Google does this.

Internet arguably doesn't even have a size. You can construct a website that's like n.example.com/m which links to '(n+1).example.com/m' and 'n.example.com/(m+1)', for each m and n between 0 and 1e308.

link

Lex-2008 1165 days ago

I did it! For every two numbers, calc.shpakovsky.ru has a static(-looking) webpage showing their sum (or difference, etc). Together with links to several other pages. The only limitation I know of is 4k URL length. Interestingly enough, major search engines are rather smart about it and cooled down their indexing efforts after some time. Guess, I'm not the first one to make such a website.

link

marginalia_nu 1165 days ago

Haha, nice! Crawler traps are a quite old phenomenon. Been around since before Google.

Dunno about the others, but my crawler has a set depth it will crawl. It'll BFS for like 1000-10000 documents depending on some factors.

link