Hacker News new | ask | show | jobs
by marginalia_nu 1165 days ago
It's basically seeded with my personal bookmark list. Like a few dozen links.

Not exactly this, but close enough: https://memex.marginalia.nu/links/bookmarks.gmi

I've changed the crawler design a couple of times, but the principle for growing the set of sites to be crawled is to look for sites that are (in some sense) adjacent to domains that were found to be good.

3 comments

May I submit my sites to your index? I think they'd be a good fit for the index.

https://www.thran.uk and https://wmw.thran.uk

You can add them yourself :-)

https://search.marginalia.nu/site/www.thran.uk

https://search.marginalia.nu/site/wmw.thran.uk

Only this is possible as long as the index knows about the domain. Yours are, but if not, anyone can shoot me an email or something and I can poke them into the database.

The limitation for known domains is in place to avoid abuse.

Thanks!
1) How many pages are in your index 2) How do you do indexing and retrieval? Do you build a word index by document and find documents that match all words in the query?
1) At this moment about 70 million documents. I've had it at about 110 million, dunno what the actual limit is.

2) Yes. Everything is in-house.

Do you build a word index by document and find documents that match all words in the query?)

Yeah. It's actually got three indices;

* One is a forward index with `document id -> document metadata`

* One is a priority term index with `term -> document id`.

* One is a full index with `term -> (document, term metadata)`

They're all based on static b-trees.

Is there a domain list if I wanted to crawl the hosts myself? I see you have the raw crawl data, which is appreciated, but a raw domain list would be cool.
I guess technically that could be arranged. Although I don't want everyone to run their own crawler. It would annoy a lot of webmasters and end up with even more hurdles to be able to run a crawler. Better to share the data if possible.
So if there was a new domain, unlinked by anything - this wouldn't find it?
It wouldn't. But such islands are typically not very interesting either. The context of who links to a domain is very important for a search engine for many tasks, not just discovery.
Very cool. Reason I ask is at first glance the header "Search the Internet" to me, implies you are searching the entire internet. It sounds like a more appropriate header would be "Search the obsecure Internet"
To be fair, no search engine lets you search the entire Internet, not even Google does this.

Internet arguably doesn't even have a size. You can construct a website that's like n.example.com/m which links to '(n+1).example.com/m' and 'n.example.com/(m+1)', for each m and n between 0 and 1e308.

I did it! For every two numbers, calc.shpakovsky.ru has a static(-looking) webpage showing their sum (or difference, etc). Together with links to several other pages. The only limitation I know of is 4k URL length. Interestingly enough, major search engines are rather smart about it and cooled down their indexing efforts after some time. Guess, I'm not the first one to make such a website.
Haha, nice! Crawler traps are a quite old phenomenon. Been around since before Google.

Dunno about the others, but my crawler has a set depth it will crawl. It'll BFS for like 1000-10000 documents depending on some factors.