Hacker News new | ask | show | jobs
by coffeecoders 261 days ago
On a slightly related note-

I've been thinking about building a home-local "mini-Google" that indexes maybe 1,000 websites. In practice, I rarely need more than a handful of sites for my searches, so it seems like overkill to rely on full-scale search engines for my use case.

My rough idea for architecture:

- Crawler: A lightweight scraper that visits each site periodically.

- Indexer: Convert pages into text and create an inverted index for fast keyword search. Could use something like Whoosh.

- Storage: Store raw HTML and text locally, maybe compress older snapshots.

- Search Layer: Simple query parser to score results by relevance, maybe using TF-IDF or embeddings.

I would do periodic updates and build a small web UI to browse.

Anyone tried it or are there similar projects?

11 comments

Have you ever looked at Common Crawl dumps? I did a bit of data mining and holy cow is 99.99% of the web crap. Spam, porn, ads, flame wars, random blogs by angsty teens... I understand it has historical and cultural value — and maybe literary value, in a Douglas Coupland kind of way — but for my purposes, there was very little here that I considered of interest.

Which was very encouraging to me, because it implies that indexing the Actually Important Web Pages might even be possible for a single person on their laptop.

Wikipedia, for comparison, is only ~20GB compressed. (And even most of that is not relevant to my interests, e.g. the Wikipedia articles related to stuff I'd ever ask about are probably ~200MB tops.)

YaCy (https://yacy.net) can do all this I think. Cloudflare might block you IP pretty soon though if you try to crawl.
Have you ever tried https://marginalia-search.com ? I love it.
Drew DeVault tried building something similar to this under the name SearchHut, but the project was abandoned [1]. I tried hacking on it a while ago (since it's built on Postgres and a bit of Go), but I ran out of steam trying to understand the Postgres RUM extension.

[1]: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

Perhaps not quite solving your problem, but I have a handful of domain-specific Google CSE (Custom Search Engine) that limit the results to predefined websites. I summon them from Alfred with short keywords when I'm doing interest-specific searches. https://blog.gingerbeardman.com/2021/04/20/interest-specific...
Yep. Built a crawler, an indexer/queryprocessor, and an engine responsible for merging/compacting indexes.

Crawling was tricky. Something like stackoverflow will stop returning pages when it detects that you're crawling, much sooner than you'd expect.

I think a lot of time an exhaustive searchable index just of what I've browsed would be useful, though I suppose refresh feature would be useful.
You could take a look at the leaked Yandex source code from a few years ago. I'd believe their architecture should be decent enough.
Where?
I'm not sure if linking to those files is allowed by HN, and it could potentially expose me to lawsuits.

However, searching for "Yandex git sources magnet link" might help.

Reminds me of building a Obsidian vault with all the content in markdown form. There's also plugins to show vault results when doing a Google search, making notes within your vault show up before external websites.
Kind of. I made ainews247.org that crawls certain sites and filters content so it's AI specific and valuable. I think it's a really good idea.
With LLMs why do you even need a mini-Google?
For my LLM to use! I want sources, excerpts, cross-referencing...