| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by vonklaus 3772 days ago

> I'm skeptical. Search is a huge problem just because of the bizarre amount of resources you need to throw at it. I can't afford to build my own datacenter(s) to host my custom search system. There might be huge advances ahead in terms of storage capacity on commodity systems, I don't know, but in any case, I'm only one person crawling webpages versus millions of people (and bots!) creating them.

I have been thinking about this, and I have come up with some ideas, other people obviously would provide more ideas and a solution could be reached, some of my thinking:

A service that behaves like AWS/GIT/DNS/Google combined. * A user runs the service and indexes data it receives and there is a central repository of information than a user can contribute to or not contribute to. Initially, a new user would either buy a crawler or cache of data from the market and store it locally or on a private server bought as part of the service. The blockchain, or a verification mechanism would be used to provide access to the initial seed data and a hash would verify the contents. The user now has a running cache of data s/he can connect to with a private DNS-like verification system. User runs search.

* The parameters do not return the results s/he wanted from their private store. Similar to DNS, they move up food chain to the service provider (whoever creates this system, or one of the companies/orgs providing the service) to get more data. Here there is a centralized repo of information. This can be a market or platform. People can buy and sell data, filtering mechanisms and crawlers. Also people can include all of their searches, or some of their search results into the master crawl. This would be the "datacenter", but it can also be a platform that maps to many peoples individual caches:

> So if I indexed and codified everything about the Beatles I could sell this to the market by running my own server.

> I could sell a crawler that is really really good at finding all musicians and music to the market.

> I could sell a filtering/parsing engine plugin for music guys crawl results (or all results it is fed) that only delivers high quality FLAC audio files and converts high-enough quality MP3s to FLAC, all this but only for tracks with a Saxophone.

However, fucking music guys crawl stack doesn't have the shit I want in it.

I can buy (or write) a master crawler that goes out onto the internet and finds what I am looking for then delivers it to my private cache, and if I am generous, codifies it in a generally accepted meta language and inserts it into master.

Obviously there is much more here but what I am talking about is distributed and optimized search.

Notes: Google has a nearly impossible job:

* It does not allow a user to provide any filtering outside of some boolean operators and human language.

* Therefore it never knows exactly what the user wants.

* Provides a general service so to some extent it is one size fits all.

* Difficult to do machine learning because it can watch you make selections but may not ever be able to tell what the deliverable was or if you were successful.

* You can not backout or modify algorithim it uses to find results. Neccessarily, even when it knows what you want it is biased because it shows you the results and defines the algorithim. Also, in the fact I am baselessly making up right now, only 2.1% of users ever go to the 3rd page, which means that if google is wrong, it can't know and the problem compounds as users see the same bad pages and keep clicking them.

> I guess that's the classical problem of scaling a product to a large audience of mostly technically illiterate users.

yes. I am not saying google is doing a bad job. They have a nearly impossible task if they only use a searchbar with natural language and 0 filtering to deliver trillions of terabytes of data to millions of people. I am not sure how much easier it would be, but certainly n times easier, if filters worked.

Also, I think basically the idea of HTML is shit EXCEPT for the meta language. If not some simple JSON the actual results need fucking tags not the content, then we could filer down further and better.

Group annotation.

File sharing.

Bitcoin payment for content/filtering/cooperation

Running arbitrary code in a sandboxed environment like docker, not a "DOM"

Also, the silo concept is like DNS if I didn't cexplain it super well. You have a cache on your computer, a cache in the cloud, access to a master cache of information (both receiveables and lookups of other silos) and an optimization market for searching through data, or finding more of it if neccessary.

100% obvious search will end up this way. Brave software seems to sort of get this. I am hoping they realize a browser can't be decoupled from search though because you can't just fork electron and put some plugins in it. They are super talented. I am hopeful. One of the systems similar to what I am suggesting is called memex-explorer. However, I have never used it as the build is currently failing. It was opriginally funded by DARPA and NASAJPL then one day all work stopped on it and I have emailed and tweeted some of the people and orgs with no response. So while doing research, the description seems somewhat inline with my thinking.

The large problem of scaling is handled by the market. Search is essentially an API to call APIs that call an RSS feed if you think about what your browser and google are actually doing. Knowing what those APIs do is pretty fucking important.

I try to share this info with people but they all think I am fucking insane. Does this sound that farfeched? Honest question.