| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by marginalia_nu 1543 days ago

I think I was editing the comment while you were replying. Sorry about that. I was just adding to it though, didn't really rug pull on your response so I think it's fine.

> What qualifies? What defines signal, what noise? I agree, that a lot (probably nearly all) pages will receive very, very little traffic/search requests. But are these therefore not relevant?

Now this is a proper difficult problem with (probably) fairly subjective answers. I do however think it's something that warrants serious investigation. It's probably a decent candidate for a machine learning model combined with some manual tweaking for sites similar to wikipedia or github that have absurd amounts of parallel historical content.

Developing heuristics for this is a bit of a hobby horse of mine. It feels tantalizingly almost doable with just a little bit more resources and time than I have.

> That's extremely cool. I would love to know more. To me an impressive feat already.

Yeah it's at <https://search.marginalia.nu/>. I've built all the software myself from scratch in Java[1], and I'm doing my own crawling and indexing. The machine it's on is a Ryzen 3900X with 128 Gb RAM. Most of the index is on a single 1 Tb consumer grade SSD.

I do use a MariaDB database for some metadata, but I think it will have to go as its hardware demands is becoming a serious bottleneck.

[1] Despite using Java, I should say regarding the index. This is approaches sunk cost at this point. Building a search engine index is not something Java is at all suitable for, its limited low-level I/O capabilities is incredibly handicapping.

2 comments

nojs 1543 days ago

> Building a search engine index is not something Java is at all suitable for

Worth pointing out that Lucene/Solr, the biggest open source player, is also Java!

link

marginalia_nu 1543 days ago

This is some of the nonsense you are dealing with implementing a search index in Java:

* You can only allocate on-heap arrays of 2 billion items.

* On-heap arrays have a massive size overhead in terms of GC book-keeping.

* You can only allocate off-heap memory map 2 Gb at a time.

* This also goes for memory mapped areas.

* You have no control over the lifecycle of mapped memory and off-heap memory. They get cleared if and when the GC feels like it.

* You have no madvise capabilities

* The language barely acknowledges unsigned types

link

sdoering 1542 days ago

> I [...] didn't really rug pull on your response so I think it's fine.

No you didn't. All good. And I learned a lot from the extended answer. So I am thankful for the explanation.

> Developing heuristics for this is a bit of a hobby horse of mine. It feels tantalizingly almost doable with just a little bit more resources and time than I have.

I can totally understand the feeling. There are quite a few things that I'd like to go deeper into either at work or in private. But alas time.

> Now this is a proper difficult problem with (probably) fairly subjective answers.

I agree. And I don't have answers ready. A lot boils down to preference. Personally, for example I prefer written content over video. Except in a few areas were I like (some) explanatory videos. To me it comes down to the question of how easy I can skim the content when I am looking for an answer.

On the other hand - for deep immersion into a topic I use multiple media formats.

In terms of web search I sadly nowadays need to sift through a lot of seo-fied content that is there either to build a (personal) brand or to attract clicks for advertising revenue/affiliate revenue.

So in principle I agree with you on the noise problem. Still I also believe that there are real great gems to be found in the long tail. When I still feel like I came late to the party, but when I started out in the web in '97 there were so many lovely, quirky sites. So many places that people had put a lot of time, energy and thought into. And sites so packed full of information that I came away not only with more knowledge, but in awe that somebody would give this knowledge away for free.

There also were quite a number of horrible sites (my first ones probably included). So there was a noise vs. signal problem back then. Maybe not to the extent today, though.

> The machine it's on is a Ryzen 3900X with 128 Gb RAM. Most of the index is on a single 1 Tb consumer grade SSD.

Call me impressed. Sounds absolutely cool.

So even with a raid setup for redundancy this is doable.

May I ask how you decide to add me content? Do you follow links? Do you use other search engines' results as a starting point?

I could probably shoot many more questions, but don't want to be a nuisance.

Thanks for your time already.

link

marginalia_nu 1542 days ago

> May I ask how you decide to add me content? Do you follow links? Do you use other search engines' results as a starting point?

I initially did basically a DFS-walk originating at a few websites I liked, with some filtering criteria that deprioritized websites that didn't look too interesting. Now that I have a fairly comprehensive mapping of the space I want to index, I use a few factors like frequent outbound links from highly ranking domains to inform which new sites to index.

> I could probably shoot many more questions, but don't want to be a nuisance.

No worry, I love to talk about this stuff.

link