Hacker News new | ask | show | jobs
by mryan 5459 days ago
http://www.worldwidewebsize.com/ suggests Google is currently indexing around 46 billion web pages. Running a regex across that amount of data would be a lot more than 10x slower.

You say you would be happy to wait days for a result, but what incentive would Google have to run long-running regex processing tasks, without showing you any ads or gathering any useful info in the process?

I wish it would happen, but I can't see any incentive for the big players in search to do it at the moment. Like you say, so few people would use it.

1 comments

"Running a regex across that amount of data would be a lot more than 10x slower"

Would it really? I'd like to see some hard data on that.

"what incentive would Google have to run long-running regex processing tasks, without showing you any ads or gathering any useful info in the process?"

What incentive does Google have for allowing regexes to be used in searches of source code, which it already does?

It's useful, and it gets Google goodwill from its users. Plus, many of its own employees probably benefit from it.

The number of users of Google's regex code search feature is probably no greater than the number of people who'd use regexes in general search, perhaps even smaller.

As far as ads go, I'd bet the vast majority of people who use Google's code search engine run ad blockers and don't see any ads anyway. I very much doubt that Google gets much if any profit from running it. And yet they do it.

>>Would it really? I'd like to see some hard data on that.

The process would be this:

-> User submits Regex

-> Google fetches all documents in it's database (46 billion documents according to mryan) - If we assume 1kb of data per document (wich is probably way to small), google just fetched 43869 GigaByte of data

-> now google somehow iterates over said 43869Gb (we assume we have a lot of RAM btw.) and check if the regex matches any of them

-> Search results are delivered to user (days later?)

I can not give you any "hard facts", but the problem is that if you can not build an index, you have to look at each individual document. And in google's case the amount of documents is just way too high.