| I think I was editing the comment while you were replying. Sorry about that. I was just adding to it though, didn't really rug pull on your response so I think it's fine. > What qualifies? What defines signal, what noise? I agree, that a lot (probably nearly all) pages will receive very, very little traffic/search requests. But are these therefore not relevant? Now this is a proper difficult problem with (probably) fairly subjective answers. I do however think it's something that warrants serious investigation. It's probably a decent candidate for a machine learning model combined with some manual tweaking for sites similar to wikipedia or github that have absurd amounts of parallel historical content. Developing heuristics for this is a bit of a hobby horse of mine. It feels tantalizingly almost doable with just a little bit more resources and time than I have. > That's extremely cool. I would love to know more. To me an impressive feat already. Yeah it's at <https://search.marginalia.nu/>. I've built all the software myself from scratch in Java[1], and I'm doing my own crawling and indexing. The machine it's on is a Ryzen 3900X with 128 Gb RAM. Most of the index is on a single 1 Tb consumer grade SSD. I do use a MariaDB database for some metadata, but I think it will have to go as its hardware demands is becoming a serious bottleneck. [1] Despite using Java, I should say regarding the index. This is approaches sunk cost at this point. Building a search engine index is not something Java is at all suitable for, its limited low-level I/O capabilities is incredibly handicapping. |
Worth pointing out that Lucene/Solr, the biggest open source player, is also Java!