Hacker News new | ask | show | jobs
by nox21125 29 days ago
Yeah, I realized after making the comment that Paged Out articles are only one page, but that should still work. I'll probably make a page, and also use the Community Ads to promote as well.

Your storage estimates are a lot lower than what I’m seeing on my setup. I think the main reason is that Slick stores way more than just extracted text and a basic inverted index. Most of my indices contain a huge amount of metadata, structured fields, and semantic search data.

For example, nearly all of my major indices use 384-dimensional BERT embeddings with Lucene/Elasticsearch HNSW vector indexing, which adds a pretty significant amount of overhead. I’m also storing metadata, schema information, image/video fields, social tags, ranking signals, and multiple text representations.

Just my web index alone is already around 55GB for only 2.4 million documents, and the other major indices combined add another 100+ GB on top of that. The vector data alone is probably going to become enormous at larger scales.

So I think the 13TB estimate for a billion pages is probably realistic for a much leaner BM25-style setup using mostly extracted text and a simpler index, but for my current architecture it’ll probably end up quite a bit higher unless I heavily optimize storage later on.

CommonCrawl seems like a good idea, so I may try playing around with it to see how it is. If I can fix the bugs in my crawler though, and upgrade my setup, I should be able to start crawling much better and filter much better.

1 comments

Gotcha, it’ll be interesting to see how it progresses.
Thanks again for the support!