| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by wz1000 2145 days ago

> The way that people underestimate the scale and complexity of Google's indexing always makes me chuckle.

The same approach that works for Google can also work for a public body supported by public funds. It can also use much of the same systems and maybe even personnel that Google currently employs for this purpose. I'm essentially asking for the search division of Google to be appropriated by the government.

> Also I'm sorry to inform you that the way the index shards are produced, arranged in memory/on disk, and the manner in which they are queried and ranked are inextricably linked. You can't make a search-neutral index format that can be queried economically.

I'm not asking for a search neutral index - public Google just needs to be (largely) open, transparent and accountable about its ranking algorithm. That's why the stipulation that other entities be allowed to mirror the index - they can optimize the index for their own purposes and rankings on their own hardware.

> I'm especially fond of the idea that other parties will just wantonly copy it, like for research purposes or something.

Not for research purposes - other entities will be free to copy it for purposes like sovereignty (governments), less/more censorship or to build their own systems on top of it for profit, or any other purpose.

2 comments

jeffbee 2145 days ago

There are currently zero governments and only about 4 commercial entities possessing a datacenter large enough to do the job.

wz1000 2145 days ago

https://en.wikipedia.org/wiki/Utah_Data_Center

> designed to store data estimated to be on the order of exabytes or larger

While google says:

> The Google Search index contains hundreds of billions of webpages and is well over 100,000,000 gigabytes in size

https://www.google.com/search/howsearchworks/crawling-indexi...

Seems like the government already has the expertise and equipment required to handle data at Google scale :)

joshuamorton 2145 days ago

Having disk space is only a small part of being able to make use of something like a search index. Arguably the least difficult part.

One of your suggestions was

> That's why the stipulation that other entities be allowed to mirror the index - they can optimize the index for their own purposes and rankings on their own hardware.

And the point is that there's nobody who can do this outside of Google, Microsoft (who also does), Facebook, and Amazon.

Not to mention the problems of actually getting the data. You're at the scale of data where trucks of disks are faster data transfer than cables unless you have direct fiber backbone connections.

jeffbee 2145 days ago

99% agreement, except for the amerocentric viewpoint. I think it is likely that Baidu has the scale.

joshuamorton 2145 days ago

I knew I was missing someone. Yes, Baidu (who also already runs a large search index) could probably do the same thing.

bananabreakfast 2145 days ago

A giant building full of hard drives is NOT a Google datacenter.

You're just comparing two storage numbers without taking anything else that running Google at a global scale requires.

wz1000 2145 days ago

The NSA also monitors most of the worlds communications in near real time. A building full of hard drives is also completely useless without some sort of reasonably decent search and indexing capability, so I'm pretty sure the NSA built something for the task.

wz1000 2145 days ago

Doesn't the NSA have a bunch of massive datacenters?

Regardless of that, the datacenter can be appropriated by the government too.

jeffbee 2145 days ago

No, the NSA has one facility that would be a small/medium datacenter in the big league, but only if you assume that the NSA is as efficient as Google, which is a bit of a stretch.

NSA Utah: 65 MW Google Pryor, Oklahoma: 340 MW

a1369209993 2145 days ago

Megawatts are indicative of compute load, not storage load. I can definitely believe that Google is doing more compute than NSA, but that sounds more like a difference of need, not of ability.

rossjudson 2145 days ago

What do you think the query pipeline looks like?

I can assure you that it's not mapping each query down to a single-sector disk read off an inverted index.

a1369209993 2144 days ago

?

I think the query pipeline for NSA (relative to the scales of Google's query pipeline) looks like absense-of-query-pipeline. Hence NSA using less compute and thus (the reasoning goes) less power consumption.

wz1000 2145 days ago

Presumably that Google data center does a lot of compute intensive, non search related stuff - like GCP for one.

jeffbee 2145 days ago

Another thing that people persistently misunderstand is the scale relationship between GCP and the rest of Google.

wz1000 2145 days ago

Even if all you say is true and it is truly impossible for the government to replicate any of what Google does, the point is moot. If the government is going to appropriate Google's index, might as well appropriate the datacenters too. Really, whats Google going to do with them once search is gone? According to you, it is the only thing they have have running there.

rossjudson 2145 days ago

It also does a lot of storage intensive, non search related stuff, like Google Cloud Storage.

Every Google data center does...everything.

fomine3 2142 days ago

Any tech corps shouldn't be owned by US Gov. Please.