|
|
|
|
|
by sdoering
1543 days ago
|
|
> Get rid of that noise and your hardware goes a lot longer. What qualifies? What defines signal, what noise? I agree, that a lot (probably nearly all) pages will receive very, very little traffic/search requests. But are these therefore not relevant? > I'm running a search engine on consumer hardware out of my living room that can index 100 million documents. That's extremely cool. I would love to know more. To me an impressive feat already. |
|
> What qualifies? What defines signal, what noise? I agree, that a lot (probably nearly all) pages will receive very, very little traffic/search requests. But are these therefore not relevant?
Now this is a proper difficult problem with (probably) fairly subjective answers. I do however think it's something that warrants serious investigation. It's probably a decent candidate for a machine learning model combined with some manual tweaking for sites similar to wikipedia or github that have absurd amounts of parallel historical content.
Developing heuristics for this is a bit of a hobby horse of mine. It feels tantalizingly almost doable with just a little bit more resources and time than I have.
> That's extremely cool. I would love to know more. To me an impressive feat already.
Yeah it's at <https://search.marginalia.nu/>. I've built all the software myself from scratch in Java[1], and I'm doing my own crawling and indexing. The machine it's on is a Ryzen 3900X with 128 Gb RAM. Most of the index is on a single 1 Tb consumer grade SSD.
I do use a MariaDB database for some metadata, but I think it will have to go as its hardware demands is becoming a serious bottleneck.
[1] Despite using Java, I should say regarding the index. This is approaches sunk cost at this point. Building a search engine index is not something Java is at all suitable for, its limited low-level I/O capabilities is incredibly handicapping.