Hacker News new | ask | show | jobs
by smoove 5459 days ago
I guess the main problem is that you really can't build an index for regexes, you would need to apply the search regex "live" to all the documents the searchengine knows - this will not scale at all.

Also, if you let a user search for any regex, it would be really easy to overload the server, by entering very complex regexes.

1 comments

Only a small minority of users even know what regexes are, and fewer still use them.

I'm not sure that allowing regexes would put an undue burden on the search engines. But if it ever becomes an issue, the search engine could easily deal with the problem by simply slowing down the search if it contains a regex.

I'd happily wait 2x, 5x, or even 10x as long for my query to complete if I could use a regex. For some important queries for which non-regex searches are inadequate, I'd even be willing to wait hours or days, since the alternative would be not being able to perform the search at all (or returning so many false positives as to be useless).

http://www.worldwidewebsize.com/ suggests Google is currently indexing around 46 billion web pages. Running a regex across that amount of data would be a lot more than 10x slower.

You say you would be happy to wait days for a result, but what incentive would Google have to run long-running regex processing tasks, without showing you any ads or gathering any useful info in the process?

I wish it would happen, but I can't see any incentive for the big players in search to do it at the moment. Like you say, so few people would use it.

"Running a regex across that amount of data would be a lot more than 10x slower"

Would it really? I'd like to see some hard data on that.

"what incentive would Google have to run long-running regex processing tasks, without showing you any ads or gathering any useful info in the process?"

What incentive does Google have for allowing regexes to be used in searches of source code, which it already does?

It's useful, and it gets Google goodwill from its users. Plus, many of its own employees probably benefit from it.

The number of users of Google's regex code search feature is probably no greater than the number of people who'd use regexes in general search, perhaps even smaller.

As far as ads go, I'd bet the vast majority of people who use Google's code search engine run ad blockers and don't see any ads anyway. I very much doubt that Google gets much if any profit from running it. And yet they do it.

>>Would it really? I'd like to see some hard data on that.

The process would be this:

-> User submits Regex

-> Google fetches all documents in it's database (46 billion documents according to mryan) - If we assume 1kb of data per document (wich is probably way to small), google just fetched 43869 GigaByte of data

-> now google somehow iterates over said 43869Gb (we assume we have a lot of RAM btw.) and check if the regex matches any of them

-> Search results are delivered to user (days later?)

I can not give you any "hard facts", but the problem is that if you can not build an index, you have to look at each individual document. And in google's case the amount of documents is just way too high.

I guess it would be more like 10000x or more.

Standard search works because any search will contain words that decrease the target space by several orders of magnitude (and of it does not, nobody Will notice if the search engine returns almost-random documents)

For the regexes that people would want to search with, decreasing the search space is infeasible. So, you would, effectively, have to grep large parts of the Internet. Even at Google, that would take serious time. It also would be expensive.

It is possible to limit the regex space and get reasonable run times, for example by requiring the regex to contain sufficiently infrequent character sequences with \w boundary delimiters. However, I expect any such limited set would be equally well served by a Google search followed by a client-side grep.

If you disagree, I would be interested in what regexes you would allow and how you would implement searching fotpr them.

The slowdown is nowhere near what you've estimated.

To run a regular expression, you have to search through the entire corpus.

Right now, when you type a search in, it searches through its index to look only through the top documents that match those words.

My guess is that it would be more like 1000000x slower.

You could build an index that maps character pairs and triples to documents. Then, if the regex contains two or more consecutive characters without a ? that makes the characters optional, you can use such an index to limit the search space. For some queries, this would make things feasible. For example, it should make the search for xqrur.qqq(aargh)+. feasible. Problem of course, is that users will not make such searches.

I even guess that, if someone actually built an engine that handled all regular expressions and found out a way to handle DDOS attacks, people would complain that they want more (no, you cannot parse HTML with regular expressions)

You could certainly support some subset of regular expressions - e.g. prefixes or suffixes. You could also accelerate special cases by using that approach.

The fact is though, that searching a body of text with an arbitrary regular expression is an O(n) operation. And with the entire Web, O(n) is infeasible to do for every search. And it is O(n) only if you mean regular expression in the narrow technical sense. I believe that the best current algorithms for extended regular expressions (e.g. the ones in Perl) which support features like back-references are exponential in the worst case.