Hacker News new | ask | show | jobs
by marginalia_nu 1543 days ago
All of this is a long series of solvable problems. I should know, I've dabbled in solving most of them. This is why I suggest actually taking a stab at it before you dismiss it as impossible.

There are some problems that aren't as big as they seem. Parts of an SPA can't be reliably linked to anyway even if you find interesting text there, so you can just leave them out of the index.

Likewise, there isn't as great of a need to keep a fresh index as it may seem. The odds of a document changing is proportional to how frequently it changes. This is a bit of a paradox, where even if you crawl really aggressively, the most frequently changing documents will still always be out of date. Most documents are relatively stable over time. You can actually use how often you see changes to a document or website to modulate how often you crawl it.

The bad HTML is quite manageable. You really just need to flatten the document to get at the visible text. Even with really broken formatting, that's manageable.

The storage demands are also not as bad as you might think (most documents are tiny, sub 10 Kb), there are ways to lessen the blow on top of that. Both text and indexes can compress extremely well. Since you're paying for disk access by the block, you might as well cram more stuff into a block.

Most of the crawling concerns, in general, can be gotten around by starting off with Common Crawl (even if I do my own crawling, which also is finnicky but manageable).

> This is relatively straightforward for a limited search and document space up to a few million entries in your DB. A few million documents should be doable with off the shelf parts.

Right, so shouldn't the question be how to find the documents that are even candidates for being search results? Most documents are not ever going to be relevant to any query ever. Get rid of that noise and your hardware goes a lot longer.

I'm running a search engine on consumer hardware out of my living room that can index 100 million documents. Go a bit higher budget than a consumer PC, and you've got 5 billion. That goes a long way.

2 comments

> Get rid of that noise and your hardware goes a lot longer.

What qualifies? What defines signal, what noise? I agree, that a lot (probably nearly all) pages will receive very, very little traffic/search requests. But are these therefore not relevant?

> I'm running a search engine on consumer hardware out of my living room that can index 100 million documents.

That's extremely cool. I would love to know more. To me an impressive feat already.

I think I was editing the comment while you were replying. Sorry about that. I was just adding to it though, didn't really rug pull on your response so I think it's fine.

> What qualifies? What defines signal, what noise? I agree, that a lot (probably nearly all) pages will receive very, very little traffic/search requests. But are these therefore not relevant?

Now this is a proper difficult problem with (probably) fairly subjective answers. I do however think it's something that warrants serious investigation. It's probably a decent candidate for a machine learning model combined with some manual tweaking for sites similar to wikipedia or github that have absurd amounts of parallel historical content.

Developing heuristics for this is a bit of a hobby horse of mine. It feels tantalizingly almost doable with just a little bit more resources and time than I have.

> That's extremely cool. I would love to know more. To me an impressive feat already.

Yeah it's at <https://search.marginalia.nu/>. I've built all the software myself from scratch in Java[1], and I'm doing my own crawling and indexing. The machine it's on is a Ryzen 3900X with 128 Gb RAM. Most of the index is on a single 1 Tb consumer grade SSD.

I do use a MariaDB database for some metadata, but I think it will have to go as its hardware demands is becoming a serious bottleneck.

[1] Despite using Java, I should say regarding the index. This is approaches sunk cost at this point. Building a search engine index is not something Java is at all suitable for, its limited low-level I/O capabilities is incredibly handicapping.

> Building a search engine index is not something Java is at all suitable for

Worth pointing out that Lucene/Solr, the biggest open source player, is also Java!

This is some of the nonsense you are dealing with implementing a search index in Java:

* You can only allocate on-heap arrays of 2 billion items.

* On-heap arrays have a massive size overhead in terms of GC book-keeping.

* You can only allocate off-heap memory map 2 Gb at a time.

* This also goes for memory mapped areas.

* You have no control over the lifecycle of mapped memory and off-heap memory. They get cleared if and when the GC feels like it.

* You have no madvise capabilities

* The language barely acknowledges unsigned types

> I [...] didn't really rug pull on your response so I think it's fine.

No you didn't. All good. And I learned a lot from the extended answer. So I am thankful for the explanation.

> Developing heuristics for this is a bit of a hobby horse of mine. It feels tantalizingly almost doable with just a little bit more resources and time than I have.

I can totally understand the feeling. There are quite a few things that I'd like to go deeper into either at work or in private. But alas time.

> Now this is a proper difficult problem with (probably) fairly subjective answers.

I agree. And I don't have answers ready. A lot boils down to preference. Personally, for example I prefer written content over video. Except in a few areas were I like (some) explanatory videos. To me it comes down to the question of how easy I can skim the content when I am looking for an answer.

On the other hand - for deep immersion into a topic I use multiple media formats.

In terms of web search I sadly nowadays need to sift through a lot of seo-fied content that is there either to build a (personal) brand or to attract clicks for advertising revenue/affiliate revenue.

So in principle I agree with you on the noise problem. Still I also believe that there are real great gems to be found in the long tail. When I still feel like I came late to the party, but when I started out in the web in '97 there were so many lovely, quirky sites. So many places that people had put a lot of time, energy and thought into. And sites so packed full of information that I came away not only with more knowledge, but in awe that somebody would give this knowledge away for free.

There also were quite a number of horrible sites (my first ones probably included). So there was a noise vs. signal problem back then. Maybe not to the extent today, though.

> The machine it's on is a Ryzen 3900X with 128 Gb RAM. Most of the index is on a single 1 Tb consumer grade SSD.

Call me impressed. Sounds absolutely cool.

So even with a raid setup for redundancy this is doable.

May I ask how you decide to add me content? Do you follow links? Do you use other search engines' results as a starting point?

I could probably shoot many more questions, but don't want to be a nuisance.

Thanks for your time already.

> May I ask how you decide to add me content? Do you follow links? Do you use other search engines' results as a starting point?

I initially did basically a DFS-walk originating at a few websites I liked, with some filtering criteria that deprioritized websites that didn't look too interesting. Now that I have a fairly comprehensive mapping of the space I want to index, I use a few factors like frequent outbound links from highly ranking domains to inform which new sites to index.

> I could probably shoot many more questions, but don't want to be a nuisance.

No worry, I love to talk about this stuff.

You should try looking at people's profiles on HN - just click on the username.
Why? I don't change my reply based on the author. I reply to a statement to the best of my knowledge regardless of the author behind it.

And I learned already a lot in this thread after the explanations unfolded.

The initial statement sounded exactly like the armchair "experts" one so often encounters. Actually this was for a long time the first time that there is a person with substantial experience in the problem space behind such a statement.

> The initial statement sounded exactly like the armchair "experts" one so often encounters.

Maybe - but [marginalia_nu](https://news.ycombinator.com/user?id=marginalia_nu) isn't an armchair expert - they've actually implemented theor own publically available search engine - which is linked in their profile.

I didn't say they are. Only that the initial comment sounded like that. And in the thread above we discussed a bit about their achievements. I really liked it and learned a lot.
I'm certainly not so full of myself to demand some sort of special treatment on the Internet :P
Out of curiosity, how much disk space does your index currently use, and what's the storage hardware (SSD or spinning rust)?
The reverse index is 180 Gb, on an SSD. I do think using SSDs are a major part of why this is possible on consumer hardware. I'd need a lot of spinning rust to get the sub-100ms response times I can get it to when the index is warmed up.

Should be said I do wear through this SSD at a pretty alarming rate. I'm at 193 TBW on this disk since I started using it as an index less than a year ago.

I do have a bunch of mechanical drives I use for archiving and as intermediate working areas as well, but the index itself is on an SSD.

Thanks - I'd be keen to try this at some point, if anything just for personal usage. I've got more than enough hardware CPU & RAM-wise, if all it takes is getting a few TBs worth of solid-state storage it seems like a no-brainer.