Hacker News new | ask | show | jobs
by jeffbee 2145 days ago
The way that people underestimate the scale and complexity of Google's indexing always makes me chuckle. I'm especially fond of the idea that other parties will just wantonly copy it, like for research purposes or something.

Also I'm sorry to inform you that the way the index shards are produced, arranged in memory/on disk, and the manner in which they are queried and ranked are inextricably linked. You can't make a search-neutral index format that can be queried economically.

2 comments

Are you arguing that it's somehow impossible or technically unreasonable for a startup search engine to piggyback off google's search index in a similar way to how duckduckgo piggybacks off bing's search index?

And that the very idea that this could be made possible through law makes you chuckle?

I can't answer for OP but I can say it makes me chuckle too.

If the solution was purely to force Google to sell access to their index then yes it seems possible on the surface.

But as mentioned index and ranking are inextricably tied together.

Even if they weren't, no other organization is going to be able to produce search results comparable to Google using their index. You're underestimating what goes on under the hood.

So then the answer (often in these conversations) becomes to open up the ranking algos too.

The problems with that are numerous so I'll just point out some of the bigger ones:

- Arms race: Search is a constant arms race between providers and 3rd parties trying to game the system. The minute you make the algos public, gamers win the race. Search result quality returns to the way it was in the 90's and stays that way until someone else comes up with proprietary algos that work (but is that even legal at this point in our thought experiment?)

- Motivation: If search is open and you therefore can't directly profit from your efforts to improve it (because you automatically give away anything you create to competitors) where is your motivation to keep innovating?

- It's harder than you think: Truly, there's so much more going on in modern search indexing and ranking than you likely realize. The chances that some new organization (especially a gov organization) given access to Google's black box as it exists right now would be able to maintain search result quality for any significant length of time is essentially zero.

But let's imagine that it's as easy as many people think... Wouldn't the solution then be to build a public alternative rather than effectively killing what we have now?

>But as mentioned index and ranking are inextricably tied together.

I'm pretty sure they're quite extricably tied together. I'm almost certain google's engine weights the different ranking variables (e.g. page speed) differently depending upon context. Why not expose those variables to other search engines? Well, it would kill google's search engine dominance - if you're concerned with that...

Unbundling isn't technically infeasible and it would create more competition. This would help with the arms race alluded to. What if another search engine used google's index to build a more spam free index? not good for google but great for Joe public.

>Motivation: If search is open and you therefore can't directly profit from your efforts to improve it (because you automatically give away anything you create to competitors) where is your motivation to keep innovating?

Nothing saying that they can't make money from the users of their index just as Bing makes money from DDG. However, is there any reason their search engine shouldn't compete with other similar offerings? Maybe somebody out there does it better.

Well, other than "we've got an unfair advantage and wed like to keep it please"

>It's harder than you think

Actually it's probably a LOT easier. This idea is a direct attack on google's power and the easiest response they can make is "too difficult. not possible". Not "this would fuck us in the bottom line". Simply "we can't do it, who are YOU to tell us it's possible? "

fwiw if you look through history similar reactions were made to attempts to regulate pretty much all utilities. Then it happened. This kind of response is kind of an expected part of the process. Most recently it happened in the UK when utilities and banks were told to open up API access to their data. Same claim you made.

Part two is when they tell you "it's not fair!". It's coming.

Bing's index and algos are not available to DDG, there's no comparison there. DDG uses Bing's results, they can't see how they're produced. Incidentally, Google offers a similar API.

> Actually it's probably a LOT easier

Can you support that claim?

Just the scale alone is mind boggling when it comes to search.

Then throw in natural language processing, contextual signals, hubs and authorities, content categorization (which grows ever closer to looking like actual understanding), machine learning, a host of other basic and ever evolving quality signals that exist both in and inter-dependently of one another, the more complex signals that arise from the above and on and on.

Search is hard. Even the most casual of Googling (or maybe Binging would be apt in this case) will provide you with endless info about how hard it is.

"Search is hard" deliberately misses the point. This isn't about whether search is hard it's about whether decoupling search engine from search index is hard - whether the APIs used by one could be used by others.

This trick you're echoing is, incidentally, used every single time a government comes looking at unbundling opporunities. I remember distinctly how Microsoft claimed it was "too hard" to decouple office from windows. The banks in the UK made the same claim. They were all equally ridiculous and all equally self serving. There is a lot of precedent here.

Google will pretend that it is "impossibly hard" to expose their internal APIs as well, just as every other company did. It would be surprising if they didn't.

Search is hard, yes. A lot harder than exposing APIs, isn't it?

Bing doesn't make its search index public. It has a query api, where you can provide a search query and get Bing's results on a per-query basis.

The closest comparison to what this person wants is Common Crawl, which is minuscule compared to the big players, and is a 100TB gzipped download that gets updated monthly.

Bing provides API access to its search index. IIRC the type of queries you can do are a bit more sophisticated than just "what's available via the Bing search engine" (e.g. select a market).
That's not index access. You are getting ranked results. Index access would give you the posting list for the term "moose", or the intersection of the posting lists of "moose" and "caribou", or whatever.

You can't make a neutral service that returns ranked results, because that's contradictory.

> The way that people underestimate the scale and complexity of Google's indexing always makes me chuckle.

The same approach that works for Google can also work for a public body supported by public funds. It can also use much of the same systems and maybe even personnel that Google currently employs for this purpose. I'm essentially asking for the search division of Google to be appropriated by the government.

> Also I'm sorry to inform you that the way the index shards are produced, arranged in memory/on disk, and the manner in which they are queried and ranked are inextricably linked. You can't make a search-neutral index format that can be queried economically.

I'm not asking for a search neutral index - public Google just needs to be (largely) open, transparent and accountable about its ranking algorithm. That's why the stipulation that other entities be allowed to mirror the index - they can optimize the index for their own purposes and rankings on their own hardware.

> I'm especially fond of the idea that other parties will just wantonly copy it, like for research purposes or something.

Not for research purposes - other entities will be free to copy it for purposes like sovereignty (governments), less/more censorship or to build their own systems on top of it for profit, or any other purpose.

There are currently zero governments and only about 4 commercial entities possessing a datacenter large enough to do the job.
https://en.wikipedia.org/wiki/Utah_Data_Center

> designed to store data estimated to be on the order of exabytes or larger

While google says:

> The Google Search index contains hundreds of billions of webpages and is well over 100,000,000 gigabytes in size

https://www.google.com/search/howsearchworks/crawling-indexi...

Seems like the government already has the expertise and equipment required to handle data at Google scale :)

Having disk space is only a small part of being able to make use of something like a search index. Arguably the least difficult part.

One of your suggestions was

> That's why the stipulation that other entities be allowed to mirror the index - they can optimize the index for their own purposes and rankings on their own hardware.

And the point is that there's nobody who can do this outside of Google, Microsoft (who also does), Facebook, and Amazon.

Not to mention the problems of actually getting the data. You're at the scale of data where trucks of disks are faster data transfer than cables unless you have direct fiber backbone connections.

99% agreement, except for the amerocentric viewpoint. I think it is likely that Baidu has the scale.
I knew I was missing someone. Yes, Baidu (who also already runs a large search index) could probably do the same thing.
A giant building full of hard drives is NOT a Google datacenter.

You're just comparing two storage numbers without taking anything else that running Google at a global scale requires.

The NSA also monitors most of the worlds communications in near real time. A building full of hard drives is also completely useless without some sort of reasonably decent search and indexing capability, so I'm pretty sure the NSA built something for the task.
Doesn't the NSA have a bunch of massive datacenters?

Regardless of that, the datacenter can be appropriated by the government too.

No, the NSA has one facility that would be a small/medium datacenter in the big league, but only if you assume that the NSA is as efficient as Google, which is a bit of a stretch.

NSA Utah: 65 MW Google Pryor, Oklahoma: 340 MW

Megawatts are indicative of compute load, not storage load. I can definitely believe that Google is doing more compute than NSA, but that sounds more like a difference of need, not of ability.
What do you think the query pipeline looks like?

I can assure you that it's not mapping each query down to a single-sector disk read off an inverted index.

Presumably that Google data center does a lot of compute intensive, non search related stuff - like GCP for one.
Another thing that people persistently misunderstand is the scale relationship between GCP and the rest of Google.
It also does a lot of storage intensive, non search related stuff, like Google Cloud Storage.

Every Google data center does...everything.

Any tech corps shouldn't be owned by US Gov. Please.