Having disk space is only a small part of being able to make use of something like a search index. Arguably the least difficult part.
One of your suggestions was
> That's why the stipulation that other entities be allowed to mirror the index - they can optimize the index for their own purposes and rankings on their own hardware.
And the point is that there's nobody who can do this outside of Google, Microsoft (who also does), Facebook, and Amazon.
Not to mention the problems of actually getting the data. You're at the scale of data where trucks of disks are faster data transfer than cables unless you have direct fiber backbone connections.
The NSA also monitors most of the worlds communications in near real time. A building full of hard drives is also completely useless without some sort of reasonably decent search and indexing capability, so I'm pretty sure the NSA built something for the task.
One of your suggestions was
> That's why the stipulation that other entities be allowed to mirror the index - they can optimize the index for their own purposes and rankings on their own hardware.
And the point is that there's nobody who can do this outside of Google, Microsoft (who also does), Facebook, and Amazon.
Not to mention the problems of actually getting the data. You're at the scale of data where trucks of disks are faster data transfer than cables unless you have direct fiber backbone connections.