| On a slightly related note- I've been thinking about building a home-local "mini-Google" that indexes maybe 1,000 websites. In practice, I rarely need more than a handful of sites for my searches, so it seems like overkill to rely on full-scale search engines for my use case. My rough idea for architecture: - Crawler: A lightweight scraper that visits each site periodically. - Indexer: Convert pages into text and create an inverted index for fast keyword search. Could use something like Whoosh. - Storage: Store raw HTML and text locally, maybe compress older snapshots. - Search Layer: Simple query parser to score results by relevance, maybe using TF-IDF or embeddings. I would do periodic updates and build a small web UI to browse. Anyone tried it or are there similar projects? |
Which was very encouraging to me, because it implies that indexing the Actually Important Web Pages might even be possible for a single person on their laptop.
Wikipedia, for comparison, is only ~20GB compressed. (And even most of that is not relevant to my interests, e.g. the Wikipedia articles related to stuff I'd ever ask about are probably ~200MB tops.)