| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by coffeecoders 308 days ago

On a slightly related note-

I've been thinking about building a home-local "mini-Google" that indexes maybe 1,000 websites. In practice, I rarely need more than a handful of sites for my searches, so it seems like overkill to rely on full-scale search engines for my use case.

My rough idea for architecture:

- Crawler: A lightweight scraper that visits each site periodically.

- Indexer: Convert pages into text and create an inverted index for fast keyword search. Could use something like Whoosh.

- Storage: Store raw HTML and text locally, maybe compress older snapshots.

- Search Layer: Simple query parser to score results by relevance, maybe using TF-IDF or embeddings.

I would do periodic updates and build a small web UI to browse.

Anyone tried it or are there similar projects?

11 comments

andai 308 days ago

Have you ever looked at Common Crawl dumps? I did a bit of data mining and holy cow is 99.99% of the web crap. Spam, porn, ads, flame wars, random blogs by angsty teens... I understand it has historical and cultural value — and maybe literary value, in a Douglas Coupland kind of way — but for my purposes, there was very little here that I considered of interest.

Which was very encouraging to me, because it implies that indexing the Actually Important Web Pages might even be possible for a single person on their laptop.

Wikipedia, for comparison, is only ~20GB compressed. (And even most of that is not relevant to my interests, e.g. the Wikipedia articles related to stuff I'd ever ask about are probably ~200MB tops.)

link

harias 308 days ago

YaCy (https://yacy.net) can do all this I think. Cloudflare might block you IP pretty soon though if you try to crawl.

link

fabiensanglard 308 days ago

Have you ever tried https://marginalia-search.com ? I love it.

link

UltimateEdge 307 days ago

Drew DeVault tried building something similar to this under the name SearchHut, but the project was abandoned [1]. I tried hacking on it a while ago (since it's built on Postgres and a bit of Go), but I ran out of steam trying to understand the Postgres RUM extension.

[1]: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

link

msephton 307 days ago

Perhaps not quite solving your problem, but I have a handful of domain-specific Google CSE (Custom Search Engine) that limit the results to predefined websites. I summon them from Alfred with short keywords when I'm doing interest-specific searches. https://blog.gingerbeardman.com/2021/04/20/interest-specific...

link

mrkeen 307 days ago

Yep. Built a crawler, an indexer/queryprocessor, and an engine responsible for merging/compacting indexes.

Crawling was tricky. Something like stackoverflow will stop returning pages when it detects that you're crawling, much sooner than you'd expect.

link

_flux 307 days ago

I think a lot of time an exhaustive searchable index just of what I've browsed would be useful, though I suppose refresh feature would be useful.

link

matsz 308 days ago

You could take a look at the leaked Yandex source code from a few years ago. I'd believe their architecture should be decent enough.

Where?

I'm not sure if linking to those files is allowed by HN, and it could potentially expose me to lawsuits.

However, searching for "Yandex git sources magnet link" might help.

link

bryanhogan 307 days ago

Reminds me of building a Obsidian vault with all the content in markdown form. There's also plugins to show vault results when doing a Google search, making notes within your vault show up before external websites.

link

computerex 307 days ago

Kind of. I made ainews247.org that crawls certain sites and filters content so it's AI specific and valuable. I think it's a really good idea.

link

toephu2 308 days ago

With LLMs why do you even need a mini-Google?

link

andai 308 days ago

For my LLM to use! I want sources, excerpts, cross-referencing...

link