|
Very cool, I subscribed to the newsletter. I’ve experimented with retrieval and ranking across a sample of a million pages from the early days of the Common Crawl (around 2014) and I was surprised by how many of them seemed high quality. The CTO of CC tells me it’s because most of the early URLs were donated by Blekko, which was an old search engine that he used to work for. I don’t know what the quality of recent CC stuff is like, but I think it would be fun to supplement an index with this older data, especially because you’d get a lot of pages that are 404’s now (but you could deliver the extracted text to the user, or link to a temporally nearby snapshot from WayBack). Another fun thing to consider is making a meta search engine that functions like MetaCrawler used to, where it gets all (or a bunch of) the available results from all the source engines, and then actually fetches and extracts the text from the linked pages, and then matches the query and ranks the pages independent of what the source engines did. If you’d like to do that, I would recommend adapting the source code of 4get.ca (at least for the scrapers), because the guy who writes it is rather talented at coming up with and maintaining workarounds. If you monetize this, I’d be interested in working for you. I know Python, HTML, CSS, am familiar with JavaScript, and have a lot of experimental (and successful!) experience with ranking web results. Also, you might be interested in reading this article (from 2600 magazine) about disappearing search engines: https://archive.org/details/search-timeline In addition to the things in that article, there was a search engine for discord (“Searchcord”) that went away in less than a week after it was announced here (on HN), and there is this recent blog post which lists search engines with independent indexes, a painfully large number of which went away with no announcement: https://seirdy.one/posts/2021/03/10/search-engines-with-own-... The author of the 2600 article doesn’t really get into theories about why search engines disappear, but it certainly seems like a lot of them do. I’m curious to know if they disappear for random different reasons, or if it’s just really difficult to make and maintain a search project, or if there’s some other common reason. If you suddenly feel disinclined to work on this project, could you let me know why (maybe anonymously with a new email account or something)? Thanks. |
The idea of supplementing the index with older Common Crawl/Blekko-era data is definitely interesting, especially for preserving pages that are gone now. The metasearch + independent reranking concept is interesting too, but one of the main goals with Slick is staying completely independent long term.
I know that comes with much slower growth and a lot more work, but I think it's better than building on top of another search engine for 5 years and then suddenly having that engine massively change direction. I actually only recently learned that Google is planning to heavily rework Search around AI as well, which honestly reinforced my decision to keep Slick independent instead of relying heavily on another engine (https://san.com/cc/googles-shift-to-ai-powered-search-result...).
Right now I'm mainly focused on improving Slick's own crawl/index quality instead of relying too heavily on external sources.
I've taken a look at 4get.ca, which is Canadian apparently (I am too), it's really good. Although again, I'm not leaning too heavily into metasearch unless maintaining a fully independent index becomes unrealistic. I have already written over 15 thousand lines of code for this engine already, over a year of coding.
I've never noticed the "search engines disappearing", probably because they're disappearing. I should probably read up on that. Most likely it's because they can't afford to run the project anymore, whether it's mentally or financially. I've experienced this too. I'm actively trying to promote to get new supporters of the search engine, to no avail.
I don't think I'll feel disinclined to work on the project any time soon, but if I ever do, I'll be sure to tell you. You are my first supporter after all.
I'm currently not looking for employees right now, but I appreciate the offer. I've been able to do this much on my own, and it's just uphill from here. Improving the ranking bugs I mentioned in my blog, getting more supporters so I have an incentive to get infrastructure, improving my crawler, etc.
I really appreciate the support.