Hacker News new | ask | show | jobs
by n1xis10t 29 days ago
Oh, the rules are that paged out articles are only one page, so a longer article would have to go somewhere else like 2600.

The collection of about 1.073 million pages in extracted text form that I have takes up about 4.8 GiB spread across 15 files (but compressed they’re only 2.1 GiB), so if you were just downloading them until your hard drive filled up you’d have about 107 million pages, and you’d need something like 5TiB for a billion pages. These are the WET files from CC, which are extracted text only. I know the WARC files are made so that if you know the correct offset in bytes, you can take out individual documents without decompressing the whole file, but I’m not sure if the WET ones work the same way. If they do, your pile of text could be a bit less than half the size and still usable with an index.

I don’t know how much more space an index of the data would take up, but I think it really depends on how complicated it is. If the index is super basic, like “give me a keyword and I’ll give you a list of docs that it appears in”, then I think the index should be smaller than the text collection. You use embeddings and stuff, so I don’t know how big it would be.

Marginalia search has about 1 billion pages, and when someone asked how big the index is on disk, he said this: “16 TB for the unprocessed crawl data (compressed). 7.7 TB for the files that actually constitute the index (positions data, reverse index)” I’m guessing that the “unprocessed crawl data” is raw html, and that’s why it’s significantly larger than my Blekko-era Common Crawl extracted text based estimate.

So with an uncompressed pile of extracted text and a Marginalia style index, one billion pages would be about 13 TB on disk. He says “positions data” though, so I think that means that the locations of keywords in documents is part of the index. Probably the original extracted text and the position data don’t both need to be there (and they’re probably about the same size), so you would just pick between having the original documents and needing to use compute to find the keyword positions for ranking, or having the keyword positions for ranking and needing to use compute to reconstruct the original documents (if needed). So if you pick one instead of having both, the whole thing probably just takes up about 7.7TB.

Oh also, downloading these files from the Common Crawl should go really fast. One file has about 73000 documents in it, and takes up around 141 MiB (in it’s compressed form, but that’s the form it’ll be downloaded in.)

These wouldn’t get you recent stuff of course, but they would make the index size way bigger, and so the quality would go up but it would be dated. It would be like resurrecting Blekko. For context, Greg Lindahl said that their largest index was 4 billion pages, but that their crawl frontier was much larger.

Here’s another idea: Download tons of old stuff from the Common Crawl / Blekko, but only keep and index the pages that are inaccessible today. This would make your search engine as competitive [edit: probably complementary is a better word] as possible, because it draws from resources that the other engines don’t have. I’m pretty sure the standard is to prune 404’s from search indexes, which seems very silly to me because cached page content can be served, or a link to the Wayback machine can be given. I suppose there are a couple partial exceptions, because Kagi, Brave, and either yep.com or Yandex will give some results from the wayback machine, but I imagine this is a very small part of what they have.

1 comments

Yeah, I realized after making the comment that Paged Out articles are only one page, but that should still work. I'll probably make a page, and also use the Community Ads to promote as well.

Your storage estimates are a lot lower than what I’m seeing on my setup. I think the main reason is that Slick stores way more than just extracted text and a basic inverted index. Most of my indices contain a huge amount of metadata, structured fields, and semantic search data.

For example, nearly all of my major indices use 384-dimensional BERT embeddings with Lucene/Elasticsearch HNSW vector indexing, which adds a pretty significant amount of overhead. I’m also storing metadata, schema information, image/video fields, social tags, ranking signals, and multiple text representations.

Just my web index alone is already around 55GB for only 2.4 million documents, and the other major indices combined add another 100+ GB on top of that. The vector data alone is probably going to become enormous at larger scales.

So I think the 13TB estimate for a billion pages is probably realistic for a much leaner BM25-style setup using mostly extracted text and a simpler index, but for my current architecture it’ll probably end up quite a bit higher unless I heavily optimize storage later on.

CommonCrawl seems like a good idea, so I may try playing around with it to see how it is. If I can fix the bugs in my crawler though, and upgrade my setup, I should be able to start crawling much better and filter much better.

Gotcha, it’ll be interesting to see how it progresses.
Thanks again for the support!