Hacker News new | ask | show | jobs
by Someone 5460 days ago
You could build an index that maps character pairs and triples to documents. Then, if the regex contains two or more consecutive characters without a ? that makes the characters optional, you can use such an index to limit the search space. For some queries, this would make things feasible. For example, it should make the search for xqrur.qqq(aargh)+. feasible. Problem of course, is that users will not make such searches.

I even guess that, if someone actually built an engine that handled all regular expressions and found out a way to handle DDOS attacks, people would complain that they want more (no, you cannot parse HTML with regular expressions)

1 comments

You could certainly support some subset of regular expressions - e.g. prefixes or suffixes. You could also accelerate special cases by using that approach.

The fact is though, that searching a body of text with an arbitrary regular expression is an O(n) operation. And with the entire Web, O(n) is infeasible to do for every search. And it is O(n) only if you mean regular expression in the narrow technical sense. I believe that the best current algorithms for extended regular expressions (e.g. the ones in Perl) which support features like back-references are exponential in the worst case.