Hacker News new | ask | show | jobs
by majewsky 3772 days ago
> as search deteriorates

Can you expand on why search is deteriorating? Honest question. I certainly don't see the relevance of search sinking, nor can I see any competitor in the market that could even come close to threatening Google's monopoly on search.

1 comments

It is my position that search can never be decoupled from the browser and when I say "search" in the statement you are referring to I mean Google, as it is Peerless for english lang search.

Search is in fact massively expanding as tooling and machine learning capabilities increase do to research and hardware. Similarly, Google, Apache and Elastic have many open source libraries for search, indexing, storage, caching and serving which allow for scalable architecture. Also, outside of the things above like crawlers and Hadoop, Solr, etc. Microsoft and Google have open sourced JS parsing engines and Node as well as the Electron browser, Brave Browser and Node Web-Kit are built on technology that leverages this.

So, as someone who is not an information architect or data scientist, it seems like we have an ecosystem where a scaled down version of google can be built and trained on the per user basis and completely private.

The solution I have hashed out in more detailed elsewhere, but on to the actual question, is search deteriorating?

* My results seem to be worse and I have much less control than before. Anecdotally, it seems as if qoutes and boolean ops are respected less.

* Discovery is a huge issue that Google solved well, now we have the opposite conditions but the same problem. There were very few sites and it was hard to know what content was on them. Now there is too much content.

* Without fine grained control over my search I can't get make destinction between Information vs. Links. This is needing a date or well accepted piece of content/documentation vs. finding some new apps or non-facts. DuckDuckGo is quite good for some things and Google is good for others. Sometimes you may want to eliminate all wordpress sites (many content mills built on this) or remove Alexa links from your queries if you need to discover something.

* Time is bad. E.g. I have a problem with JavaScript function. Get back results from 3 years ago. This is amazing and difficult to do, so commendable but I need newer info as pace changes. E.g. News.

* Need to eliminate sites and content I don't want. NOT something like a content filter for porn or whatever, something like:

     never return results from %news-websites older than 30 days.

     never return content posted %before nov-2014

     remove links from [%Alexa-1000, %Wordpress, #TLD(.co,.co.uk)] for reputation ranking


     decrease links from [%Alexa-1000, %Wordpress, #TLD(.co,.co.uk)] by [80%] for reputation rankings


There are other things but so far my point has been:

* Google provides no versatile results.

* Many pieces of well tested software would make it easy(for the right group of software engineers) to silo crawl data and parse it with a users own parameters.

There is a way to set up this ecosystem that I have been thinking about, but to conclude:

Google is fucking awesome and really really good at what they do. Search experience is getting worse in terms of control but tooling is leagues better. Google sees this and is working on loftier goals internally (I imagine), thus it has split up into a meta-company that will work as an accelerator for growth while capitalizing on some verticals like the Real Estate thing they are doing or Delivery they just announced to keep short term profitable before they can achieve their end goal. Also, advertisement is an unsustainable paradigm for internet growth for many reasons.

Notes:

The DOM is super fucking horrible.

The Parsing engine is a great fix for a fucking horrid DOM.

DNS security is fucking horrible.

The Next google will be a browser & an optimization marketplace.

I don't think compiling to web assembly makes sense but I could be totally wrong. I think something like Docker would provide a sandbox that would let people get performance and versatility and sidestep the entire DOM, only run JS, need Apps vs. Content thing. No idea how this works on mobile though.

Wow, awesome response. Will need to let that sink in.

> So, as someone who is not an information architect or data scientist, it seems like we have an ecosystem where a scaled down version of google can be built and trained on the per user basis and completely private.

I'm skeptical. Search is a huge problem just because of the bizarre amount of resources you need to throw at it. I can't afford to build my own datacenter(s) to host my custom search system. There might be huge advances ahead in terms of storage capacity on commodity systems, I don't know, but in any case, I'm only one person crawling webpages versus millions of people (and bots!) creating them.

You implicitly address that a bit later by talking about "silo crawling", but again, I'm skeptical. The only silo structure that I can easily see is large sites with useful content like Wikipedia or StackOverflow/StackExchange, but I'm likely to come across these anyway in any given domain, and I can easily filter for these on Google today, e.g. "site:en.wikipedia.org". The more interesting and hard part is the long tail of small, sparsely interconnected websites which might contain unusual insights but are unlikely to come across with a silo crawler (or with Google's current UI, for that matter).

> Search experience is getting worse in terms of control

I guess that's the classical problem of scaling a product to a large audience of mostly technically illiterate users. Maybe Google is learning from Apple, whose UIs have for a long time favored ease of use over giving control to the user.

> I'm skeptical. Search is a huge problem just because of the bizarre amount of resources you need to throw at it. I can't afford to build my own datacenter(s) to host my custom search system. There might be huge advances ahead in terms of storage capacity on commodity systems, I don't know, but in any case, I'm only one person crawling webpages versus millions of people (and bots!) creating them.

I have been thinking about this, and I have come up with some ideas, other people obviously would provide more ideas and a solution could be reached, some of my thinking:

A service that behaves like AWS/GIT/DNS/Google combined. * A user runs the service and indexes data it receives and there is a central repository of information than a user can contribute to or not contribute to. Initially, a new user would either buy a crawler or cache of data from the market and store it locally or on a private server bought as part of the service. The blockchain, or a verification mechanism would be used to provide access to the initial seed data and a hash would verify the contents. The user now has a running cache of data s/he can connect to with a private DNS-like verification system. User runs search.

* The parameters do not return the results s/he wanted from their private store. Similar to DNS, they move up food chain to the service provider (whoever creates this system, or one of the companies/orgs providing the service) to get more data. Here there is a centralized repo of information. This can be a market or platform. People can buy and sell data, filtering mechanisms and crawlers. Also people can include all of their searches, or some of their search results into the master crawl. This would be the "datacenter", but it can also be a platform that maps to many peoples individual caches:

> So if I indexed and codified everything about the Beatles I could sell this to the market by running my own server.

> I could sell a crawler that is really really good at finding all musicians and music to the market.

> I could sell a filtering/parsing engine plugin for music guys crawl results (or all results it is fed) that only delivers high quality FLAC audio files and converts high-enough quality MP3s to FLAC, all this but only for tracks with a Saxophone.

However, fucking music guys crawl stack doesn't have the shit I want in it.

I can buy (or write) a master crawler that goes out onto the internet and finds what I am looking for then delivers it to my private cache, and if I am generous, codifies it in a generally accepted meta language and inserts it into master.

Obviously there is much more here but what I am talking about is distributed and optimized search.

Notes: Google has a nearly impossible job:

* It does not allow a user to provide any filtering outside of some boolean operators and human language.

* Therefore it never knows exactly what the user wants.

* Provides a general service so to some extent it is one size fits all.

* Difficult to do machine learning because it can watch you make selections but may not ever be able to tell what the deliverable was or if you were successful.

* You can not backout or modify algorithim it uses to find results. Neccessarily, even when it knows what you want it is biased because it shows you the results and defines the algorithim. Also, in the fact I am baselessly making up right now, only 2.1% of users ever go to the 3rd page, which means that if google is wrong, it can't know and the problem compounds as users see the same bad pages and keep clicking them.

> I guess that's the classical problem of scaling a product to a large audience of mostly technically illiterate users.

yes. I am not saying google is doing a bad job. They have a nearly impossible task if they only use a searchbar with natural language and 0 filtering to deliver trillions of terabytes of data to millions of people. I am not sure how much easier it would be, but certainly n times easier, if filters worked.

Also, I think basically the idea of HTML is shit EXCEPT for the meta language. If not some simple JSON the actual results need fucking tags not the content, then we could filer down further and better.

Group annotation.

File sharing.

Bitcoin payment for content/filtering/cooperation

Running arbitrary code in a sandboxed environment like docker, not a "DOM"

Also, the silo concept is like DNS if I didn't cexplain it super well. You have a cache on your computer, a cache in the cloud, access to a master cache of information (both receiveables and lookups of other silos) and an optimization market for searching through data, or finding more of it if neccessary.

100% obvious search will end up this way. Brave software seems to sort of get this. I am hoping they realize a browser can't be decoupled from search though because you can't just fork electron and put some plugins in it. They are super talented. I am hopeful. One of the systems similar to what I am suggesting is called memex-explorer. However, I have never used it as the build is currently failing. It was opriginally funded by DARPA and NASAJPL then one day all work stopped on it and I have emailed and tweeted some of the people and orgs with no response. So while doing research, the description seems somewhat inline with my thinking.

The large problem of scaling is handled by the market. Search is essentially an API to call APIs that call an RSS feed if you think about what your browser and google are actually doing. Knowing what those APIs do is pretty fucking important.

I try to share this info with people but they all think I am fucking insane. Does this sound that farfeched? Honest question.