Hacker News new | ask | show | jobs
by wz1000 2145 days ago
> The way that people underestimate the scale and complexity of Google's indexing always makes me chuckle.

The same approach that works for Google can also work for a public body supported by public funds. It can also use much of the same systems and maybe even personnel that Google currently employs for this purpose. I'm essentially asking for the search division of Google to be appropriated by the government.

> Also I'm sorry to inform you that the way the index shards are produced, arranged in memory/on disk, and the manner in which they are queried and ranked are inextricably linked. You can't make a search-neutral index format that can be queried economically.

I'm not asking for a search neutral index - public Google just needs to be (largely) open, transparent and accountable about its ranking algorithm. That's why the stipulation that other entities be allowed to mirror the index - they can optimize the index for their own purposes and rankings on their own hardware.

> I'm especially fond of the idea that other parties will just wantonly copy it, like for research purposes or something.

Not for research purposes - other entities will be free to copy it for purposes like sovereignty (governments), less/more censorship or to build their own systems on top of it for profit, or any other purpose.

2 comments

There are currently zero governments and only about 4 commercial entities possessing a datacenter large enough to do the job.
https://en.wikipedia.org/wiki/Utah_Data_Center

> designed to store data estimated to be on the order of exabytes or larger

While google says:

> The Google Search index contains hundreds of billions of webpages and is well over 100,000,000 gigabytes in size

https://www.google.com/search/howsearchworks/crawling-indexi...

Seems like the government already has the expertise and equipment required to handle data at Google scale :)

Having disk space is only a small part of being able to make use of something like a search index. Arguably the least difficult part.

One of your suggestions was

> That's why the stipulation that other entities be allowed to mirror the index - they can optimize the index for their own purposes and rankings on their own hardware.

And the point is that there's nobody who can do this outside of Google, Microsoft (who also does), Facebook, and Amazon.

Not to mention the problems of actually getting the data. You're at the scale of data where trucks of disks are faster data transfer than cables unless you have direct fiber backbone connections.

99% agreement, except for the amerocentric viewpoint. I think it is likely that Baidu has the scale.
I knew I was missing someone. Yes, Baidu (who also already runs a large search index) could probably do the same thing.
A giant building full of hard drives is NOT a Google datacenter.

You're just comparing two storage numbers without taking anything else that running Google at a global scale requires.

The NSA also monitors most of the worlds communications in near real time. A building full of hard drives is also completely useless without some sort of reasonably decent search and indexing capability, so I'm pretty sure the NSA built something for the task.
Doesn't the NSA have a bunch of massive datacenters?

Regardless of that, the datacenter can be appropriated by the government too.

No, the NSA has one facility that would be a small/medium datacenter in the big league, but only if you assume that the NSA is as efficient as Google, which is a bit of a stretch.

NSA Utah: 65 MW Google Pryor, Oklahoma: 340 MW

Megawatts are indicative of compute load, not storage load. I can definitely believe that Google is doing more compute than NSA, but that sounds more like a difference of need, not of ability.
What do you think the query pipeline looks like?

I can assure you that it's not mapping each query down to a single-sector disk read off an inverted index.

?

I think the query pipeline for NSA (relative to the scales of Google's query pipeline) looks like absense-of-query-pipeline. Hence NSA using less compute and thus (the reasoning goes) less power consumption.

Presumably that Google data center does a lot of compute intensive, non search related stuff - like GCP for one.
Another thing that people persistently misunderstand is the scale relationship between GCP and the rest of Google.
Even if all you say is true and it is truly impossible for the government to replicate any of what Google does, the point is moot. If the government is going to appropriate Google's index, might as well appropriate the datacenters too. Really, whats Google going to do with them once search is gone? According to you, it is the only thing they have have running there.
It also does a lot of storage intensive, non search related stuff, like Google Cloud Storage.

Every Google data center does...everything.

Any tech corps shouldn't be owned by US Gov. Please.