| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jiggawatts 1898 days ago

Interesting, but not quite what I had in mind.

My idea was loosely based on a code search engine Google had for a while before they killed it off.

I'm not sure exactly how they did it, but I very strongly suspected that they implemented regular expressions over database indexes.

A simple b-tree index is just a set of sorted strings. If you have a lot of sorted strings and you squint at it, you can see how it picks out common prefixes, like a tree. (You can literally store it with the prefixes factored out, and then you have a trie.)

For a fairly wide range of regular expressions, and a tree of sorted strings, you can implement a search over the b-tree. E.g.: the search "[bx](a+)foo" can be implemented via finding the range starting with "b", then the range starting with "x". For each range, find the sub-range starting with "ba", "xa", etc...

Each step takes logarithmic time, and can be done in parallel. In theory, it takes only about a hundred times longer to find a ten matches in a million strings than it would take to match a single string.

Wouldn't it be nice if you could take an algorithm such as "match regex" that normally takes a 'scalar' string and have the compiler automatically create a version that can take an array of sorted values, like the database index?

The idea is that much like how array programming languages eschew scalars and pass arrays all over the place to gain efficiency, my language would pass sets all over the place, and gain a different type of efficiency.