|
That's funny, it must be. Paged Out seems really cool, and I'm probably going to write much more than one page if I do. They also have a community ads program which sounds nice. 2600 too, but like you said, I’m probably going to wait until I have a more reasonable index size. My disk space situation is very bad. Like I said, I’m running on a Beelink EQR5 with only 500GB of storage. If my estimates are right, which they probably aren’t, it would take around 5TB for 50 million documents. So maybe 150TB or so should be enough for a billion. Crawling speed is also a limiting factor, although not because of network speed. My internet is fine, but my crawler currently only crawls at around 1 million pages a day. It should have crawled around 30 million pages by now since it’s been running for over a month, but when I check the batches it’s produced, it’s probably closer to 2 million max, so there’s clearly a major issue somewhere. Along with that bug, I also need to make the batch processor much faster. It processes documents and also adds embeddings using BERT, which takes up a significant amount of time. So it doesn’t index 1 million a day, maybe only 30k/day which is obviously something I really need to improve. If the project keeps growing, I’ll probably eventually move to something much better than my current setup. |
The collection of about 1.073 million pages in extracted text form that I have takes up about 4.8 GiB spread across 15 files (but compressed they’re only 2.1 GiB), so if you were just downloading them until your hard drive filled up you’d have about 107 million pages, and you’d need something like 5TiB for a billion pages. These are the WET files from CC, which are extracted text only. I know the WARC files are made so that if you know the correct offset in bytes, you can take out individual documents without decompressing the whole file, but I’m not sure if the WET ones work the same way. If they do, your pile of text could be a bit less than half the size and still usable with an index.
I don’t know how much more space an index of the data would take up, but I think it really depends on how complicated it is. If the index is super basic, like “give me a keyword and I’ll give you a list of docs that it appears in”, then I think the index should be smaller than the text collection. You use embeddings and stuff, so I don’t know how big it would be.
Marginalia search has about 1 billion pages, and when someone asked how big the index is on disk, he said this: “16 TB for the unprocessed crawl data (compressed). 7.7 TB for the files that actually constitute the index (positions data, reverse index)” I’m guessing that the “unprocessed crawl data” is raw html, and that’s why it’s significantly larger than my Blekko-era Common Crawl extracted text based estimate.
So with an uncompressed pile of extracted text and a Marginalia style index, one billion pages would be about 13 TB on disk. He says “positions data” though, so I think that means that the locations of keywords in documents is part of the index. Probably the original extracted text and the position data don’t both need to be there (and they’re probably about the same size), so you would just pick between having the original documents and needing to use compute to find the keyword positions for ranking, or having the keyword positions for ranking and needing to use compute to reconstruct the original documents (if needed). So if you pick one instead of having both, the whole thing probably just takes up about 7.7TB.
Oh also, downloading these files from the Common Crawl should go really fast. One file has about 73000 documents in it, and takes up around 141 MiB (in it’s compressed form, but that’s the form it’ll be downloaded in.)
These wouldn’t get you recent stuff of course, but they would make the index size way bigger, and so the quality would go up but it would be dated. It would be like resurrecting Blekko. For context, Greg Lindahl said that their largest index was 4 billion pages, but that their crawl frontier was much larger.
Here’s another idea: Download tons of old stuff from the Common Crawl / Blekko, but only keep and index the pages that are inaccessible today. This would make your search engine as competitive [edit: probably complementary is a better word] as possible, because it draws from resources that the other engines don’t have. I’m pretty sure the standard is to prune 404’s from search indexes, which seems very silly to me because cached page content can be served, or a link to the Wayback machine can be given. I suppose there are a couple partial exceptions, because Kagi, Brave, and either yep.com or Yandex will give some results from the wayback machine, but I imagine this is a very small part of what they have.