| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by n1xis10t 30 days ago

You’re Canadian? That’s pretty hilarious, I am too. It must be something they put in the Timbits.

For promotion, I’d recommend picking the most technically interesting part of your implementation, something that’s really clever, and then making a one page writeup for Paged Out magazine about it (https://pagedout.institute/). They regularly have interesting stuff to read, and they have a pretty decent amount of readers. You could write something longer and send it in to 2600 magazine too, they’d probably be interested even if it was an overview of the project.

Maybe the engine should be bigger first though so people are more enthused when they try it. I think 1 billion pages is around where a search engine starts to seem more normal: that’s about how much Marginalia has. How much space on disk does your index take up right now? Would you say the bottleneck is more the hard drive space or the crawling speed?

1 comments

nox21125 29 days ago

That's funny, it must be. Paged Out seems really cool, and I'm probably going to write much more than one page if I do. They also have a community ads program which sounds nice. 2600 too, but like you said, I’m probably going to wait until I have a more reasonable index size.

My disk space situation is very bad. Like I said, I’m running on a Beelink EQR5 with only 500GB of storage. If my estimates are right, which they probably aren’t, it would take around 5TB for 50 million documents. So maybe 150TB or so should be enough for a billion.

Crawling speed is also a limiting factor, although not because of network speed. My internet is fine, but my crawler currently only crawls at around 1 million pages a day. It should have crawled around 30 million pages by now since it’s been running for over a month, but when I check the batches it’s produced, it’s probably closer to 2 million max, so there’s clearly a major issue somewhere.

Along with that bug, I also need to make the batch processor much faster. It processes documents and also adds embeddings using BERT, which takes up a significant amount of time. So it doesn’t index 1 million a day, maybe only 30k/day which is obviously something I really need to improve.

If the project keeps growing, I’ll probably eventually move to something much better than my current setup.

link

n1xis10t 29 days ago

Oh, the rules are that paged out articles are only one page, so a longer article would have to go somewhere else like 2600.

The collection of about 1.073 million pages in extracted text form that I have takes up about 4.8 GiB spread across 15 files (but compressed they’re only 2.1 GiB), so if you were just downloading them until your hard drive filled up you’d have about 107 million pages, and you’d need something like 5TiB for a billion pages. These are the WET files from CC, which are extracted text only. I know the WARC files are made so that if you know the correct offset in bytes, you can take out individual documents without decompressing the whole file, but I’m not sure if the WET ones work the same way. If they do, your pile of text could be a bit less than half the size and still usable with an index.

I don’t know how much more space an index of the data would take up, but I think it really depends on how complicated it is. If the index is super basic, like “give me a keyword and I’ll give you a list of docs that it appears in”, then I think the index should be smaller than the text collection. You use embeddings and stuff, so I don’t know how big it would be.

Marginalia search has about 1 billion pages, and when someone asked how big the index is on disk, he said this: “16 TB for the unprocessed crawl data (compressed). 7.7 TB for the files that actually constitute the index (positions data, reverse index)” I’m guessing that the “unprocessed crawl data” is raw html, and that’s why it’s significantly larger than my Blekko-era Common Crawl extracted text based estimate.

So with an uncompressed pile of extracted text and a Marginalia style index, one billion pages would be about 13 TB on disk. He says “positions data” though, so I think that means that the locations of keywords in documents is part of the index. Probably the original extracted text and the position data don’t both need to be there (and they’re probably about the same size), so you would just pick between having the original documents and needing to use compute to find the keyword positions for ranking, or having the keyword positions for ranking and needing to use compute to reconstruct the original documents (if needed). So if you pick one instead of having both, the whole thing probably just takes up about 7.7TB.

Oh also, downloading these files from the Common Crawl should go really fast. One file has about 73000 documents in it, and takes up around 141 MiB (in it’s compressed form, but that’s the form it’ll be downloaded in.)

These wouldn’t get you recent stuff of course, but they would make the index size way bigger, and so the quality would go up but it would be dated. It would be like resurrecting Blekko. For context, Greg Lindahl said that their largest index was 4 billion pages, but that their crawl frontier was much larger.

Here’s another idea: Download tons of old stuff from the Common Crawl / Blekko, but only keep and index the pages that are inaccessible today. This would make your search engine as competitive [edit: probably complementary is a better word] as possible, because it draws from resources that the other engines don’t have. I’m pretty sure the standard is to prune 404’s from search indexes, which seems very silly to me because cached page content can be served, or a link to the Wayback machine can be given. I suppose there are a couple partial exceptions, because Kagi, Brave, and either yep.com or Yandex will give some results from the wayback machine, but I imagine this is a very small part of what they have.

link

nox21125 29 days ago

Yeah, I realized after making the comment that Paged Out articles are only one page, but that should still work. I'll probably make a page, and also use the Community Ads to promote as well.

Your storage estimates are a lot lower than what I’m seeing on my setup. I think the main reason is that Slick stores way more than just extracted text and a basic inverted index. Most of my indices contain a huge amount of metadata, structured fields, and semantic search data.

For example, nearly all of my major indices use 384-dimensional BERT embeddings with Lucene/Elasticsearch HNSW vector indexing, which adds a pretty significant amount of overhead. I’m also storing metadata, schema information, image/video fields, social tags, ranking signals, and multiple text representations.

Just my web index alone is already around 55GB for only 2.4 million documents, and the other major indices combined add another 100+ GB on top of that. The vector data alone is probably going to become enormous at larger scales.

So I think the 13TB estimate for a billion pages is probably realistic for a much leaner BM25-style setup using mostly extracted text and a simpler index, but for my current architecture it’ll probably end up quite a bit higher unless I heavily optimize storage later on.

CommonCrawl seems like a good idea, so I may try playing around with it to see how it is. If I can fix the bugs in my crawler though, and upgrade my setup, I should be able to start crawling much better and filter much better.

link

n1xis10t 29 days ago

Gotcha, it’ll be interesting to see how it progresses.

link

nox21125 29 days ago

Thanks again for the support!

link