| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by n1xis10t 32 days ago

Very cool, I subscribed to the newsletter. I’ve experimented with retrieval and ranking across a sample of a million pages from the early days of the Common Crawl (around 2014) and I was surprised by how many of them seemed high quality. The CTO of CC tells me it’s because most of the early URLs were donated by Blekko, which was an old search engine that he used to work for. I don’t know what the quality of recent CC stuff is like, but I think it would be fun to supplement an index with this older data, especially because you’d get a lot of pages that are 404’s now (but you could deliver the extracted text to the user, or link to a temporally nearby snapshot from WayBack).

Another fun thing to consider is making a meta search engine that functions like MetaCrawler used to, where it gets all (or a bunch of) the available results from all the source engines, and then actually fetches and extracts the text from the linked pages, and then matches the query and ranks the pages independent of what the source engines did. If you’d like to do that, I would recommend adapting the source code of 4get.ca (at least for the scrapers), because the guy who writes it is rather talented at coming up with and maintaining workarounds.

If you monetize this, I’d be interested in working for you. I know Python, HTML, CSS, am familiar with JavaScript, and have a lot of experimental (and successful!) experience with ranking web results.

Also, you might be interested in reading this article (from 2600 magazine) about disappearing search engines: https://archive.org/details/search-timeline In addition to the things in that article, there was a search engine for discord (“Searchcord”) that went away in less than a week after it was announced here (on HN), and there is this recent blog post which lists search engines with independent indexes, a painfully large number of which went away with no announcement: https://seirdy.one/posts/2021/03/10/search-engines-with-own-... The author of the 2600 article doesn’t really get into theories about why search engines disappear, but it certainly seems like a lot of them do. I’m curious to know if they disappear for random different reasons, or if it’s just really difficult to make and maintain a search project, or if there’s some other common reason. If you suddenly feel disinclined to work on this project, could you let me know why (maybe anonymously with a new email account or something)? Thanks.

1 comments

nox21125 32 days ago

Thanks! I really appreciate the detailed feedback and suggestions.

The idea of supplementing the index with older Common Crawl/Blekko-era data is definitely interesting, especially for preserving pages that are gone now. The metasearch + independent reranking concept is interesting too, but one of the main goals with Slick is staying completely independent long term.

I know that comes with much slower growth and a lot more work, but I think it's better than building on top of another search engine for 5 years and then suddenly having that engine massively change direction. I actually only recently learned that Google is planning to heavily rework Search around AI as well, which honestly reinforced my decision to keep Slick independent instead of relying heavily on another engine (https://san.com/cc/googles-shift-to-ai-powered-search-result...).

Right now I'm mainly focused on improving Slick's own crawl/index quality instead of relying too heavily on external sources.

I've taken a look at 4get.ca, which is Canadian apparently (I am too), it's really good. Although again, I'm not leaning too heavily into metasearch unless maintaining a fully independent index becomes unrealistic. I have already written over 15 thousand lines of code for this engine already, over a year of coding.

I've never noticed the "search engines disappearing", probably because they're disappearing. I should probably read up on that. Most likely it's because they can't afford to run the project anymore, whether it's mentally or financially. I've experienced this too. I'm actively trying to promote to get new supporters of the search engine, to no avail.

I don't think I'll feel disinclined to work on the project any time soon, but if I ever do, I'll be sure to tell you. You are my first supporter after all.

I'm currently not looking for employees right now, but I appreciate the offer. I've been able to do this much on my own, and it's just uphill from here. Improving the ranking bugs I mentioned in my blog, getting more supporters so I have an incentive to get infrastructure, improving my crawler, etc.

I really appreciate the support.

link

n1xis10t 32 days ago

You’re Canadian? That’s pretty hilarious, I am too. It must be something they put in the Timbits.

For promotion, I’d recommend picking the most technically interesting part of your implementation, something that’s really clever, and then making a one page writeup for Paged Out magazine about it (https://pagedout.institute/). They regularly have interesting stuff to read, and they have a pretty decent amount of readers. You could write something longer and send it in to 2600 magazine too, they’d probably be interested even if it was an overview of the project.

Maybe the engine should be bigger first though so people are more enthused when they try it. I think 1 billion pages is around where a search engine starts to seem more normal: that’s about how much Marginalia has. How much space on disk does your index take up right now? Would you say the bottleneck is more the hard drive space or the crawling speed?

link

nox21125 32 days ago

That's funny, it must be. Paged Out seems really cool, and I'm probably going to write much more than one page if I do. They also have a community ads program which sounds nice. 2600 too, but like you said, I’m probably going to wait until I have a more reasonable index size.

My disk space situation is very bad. Like I said, I’m running on a Beelink EQR5 with only 500GB of storage. If my estimates are right, which they probably aren’t, it would take around 5TB for 50 million documents. So maybe 150TB or so should be enough for a billion.

Crawling speed is also a limiting factor, although not because of network speed. My internet is fine, but my crawler currently only crawls at around 1 million pages a day. It should have crawled around 30 million pages by now since it’s been running for over a month, but when I check the batches it’s produced, it’s probably closer to 2 million max, so there’s clearly a major issue somewhere.

Along with that bug, I also need to make the batch processor much faster. It processes documents and also adds embeddings using BERT, which takes up a significant amount of time. So it doesn’t index 1 million a day, maybe only 30k/day which is obviously something I really need to improve.

If the project keeps growing, I’ll probably eventually move to something much better than my current setup.

link

n1xis10t 32 days ago

Oh, the rules are that paged out articles are only one page, so a longer article would have to go somewhere else like 2600.

The collection of about 1.073 million pages in extracted text form that I have takes up about 4.8 GiB spread across 15 files (but compressed they’re only 2.1 GiB), so if you were just downloading them until your hard drive filled up you’d have about 107 million pages, and you’d need something like 5TiB for a billion pages. These are the WET files from CC, which are extracted text only. I know the WARC files are made so that if you know the correct offset in bytes, you can take out individual documents without decompressing the whole file, but I’m not sure if the WET ones work the same way. If they do, your pile of text could be a bit less than half the size and still usable with an index.

I don’t know how much more space an index of the data would take up, but I think it really depends on how complicated it is. If the index is super basic, like “give me a keyword and I’ll give you a list of docs that it appears in”, then I think the index should be smaller than the text collection. You use embeddings and stuff, so I don’t know how big it would be.

Marginalia search has about 1 billion pages, and when someone asked how big the index is on disk, he said this: “16 TB for the unprocessed crawl data (compressed). 7.7 TB for the files that actually constitute the index (positions data, reverse index)” I’m guessing that the “unprocessed crawl data” is raw html, and that’s why it’s significantly larger than my Blekko-era Common Crawl extracted text based estimate.

So with an uncompressed pile of extracted text and a Marginalia style index, one billion pages would be about 13 TB on disk. He says “positions data” though, so I think that means that the locations of keywords in documents is part of the index. Probably the original extracted text and the position data don’t both need to be there (and they’re probably about the same size), so you would just pick between having the original documents and needing to use compute to find the keyword positions for ranking, or having the keyword positions for ranking and needing to use compute to reconstruct the original documents (if needed). So if you pick one instead of having both, the whole thing probably just takes up about 7.7TB.

Oh also, downloading these files from the Common Crawl should go really fast. One file has about 73000 documents in it, and takes up around 141 MiB (in it’s compressed form, but that’s the form it’ll be downloaded in.)

These wouldn’t get you recent stuff of course, but they would make the index size way bigger, and so the quality would go up but it would be dated. It would be like resurrecting Blekko. For context, Greg Lindahl said that their largest index was 4 billion pages, but that their crawl frontier was much larger.

Here’s another idea: Download tons of old stuff from the Common Crawl / Blekko, but only keep and index the pages that are inaccessible today. This would make your search engine as competitive [edit: probably complementary is a better word] as possible, because it draws from resources that the other engines don’t have. I’m pretty sure the standard is to prune 404’s from search indexes, which seems very silly to me because cached page content can be served, or a link to the Wayback machine can be given. I suppose there are a couple partial exceptions, because Kagi, Brave, and either yep.com or Yandex will give some results from the wayback machine, but I imagine this is a very small part of what they have.

link

nox21125 31 days ago

Yeah, I realized after making the comment that Paged Out articles are only one page, but that should still work. I'll probably make a page, and also use the Community Ads to promote as well.

Your storage estimates are a lot lower than what I’m seeing on my setup. I think the main reason is that Slick stores way more than just extracted text and a basic inverted index. Most of my indices contain a huge amount of metadata, structured fields, and semantic search data.

For example, nearly all of my major indices use 384-dimensional BERT embeddings with Lucene/Elasticsearch HNSW vector indexing, which adds a pretty significant amount of overhead. I’m also storing metadata, schema information, image/video fields, social tags, ranking signals, and multiple text representations.

Just my web index alone is already around 55GB for only 2.4 million documents, and the other major indices combined add another 100+ GB on top of that. The vector data alone is probably going to become enormous at larger scales.

So I think the 13TB estimate for a billion pages is probably realistic for a much leaner BM25-style setup using mostly extracted text and a simpler index, but for my current architecture it’ll probably end up quite a bit higher unless I heavily optimize storage later on.

CommonCrawl seems like a good idea, so I may try playing around with it to see how it is. If I can fix the bugs in my crawler though, and upgrade my setup, I should be able to start crawling much better and filter much better.

link

n1xis10t 31 days ago

Gotcha, it’ll be interesting to see how it progresses.

link