Hacker News new | ask | show | jobs
by shayanjm 574 days ago
At what point do we drop the term "regular" expressions altogether for stuff like this? This is going to sound pedantic since I know that most popularly-used regex implementations are themselves non-regular, but I feel like we're just piling more and more stuff on top of good-old-regexes and trying to turn the concept into a catch-all for anything that does pattern matching on text.

I guess it just feels icky that "regular expressions" has inherent meaning (i.e. can be represented entirely by a finite automaton) which has become completely diluted at this point.

That rant aside, cool paper. The idea of bridging formal language theory with modern computational tooling feels timely. I think I would've liked to see more exploration of oracle-based costs, for instance:

* What happens when oracle outputs are inconsistent/uncertain?

* What happens as oracle interactions become more computationally expensive?

2 comments

Your rant is a bit unfounded for this paper as they do actually take completely standard regular expressions (no backtracking or anything like that) and extend it with one more construct. Calling it «semantic regular expressions» seems perfectly reasonable to me. What else to call it?

As for outside of the computer science sphere (which I find is quite consistent in their terminology): I do agree that it seems like it’s a lost cause and «regex» is now synonymous for «pattern matching using this one specific syntax» :(

Regular has a meaning, and it isn't this. It's had this meaning since the 1950s. OTOH, these expressions do not generate a regular language. That's a good reason to me.

They call it "semantic regular expression" because it apparently already is a lost cause. "Regular expressions with TMs embedded" doesn't quite have the same ring to it. Nobody would see it as a regexp.

> OTOH, these expressions do not generate a regular language.

Okay sure you're technically correct here, but only because these expressions generate a subset of a regular language. The LLM can only be invoked on a substring that can expressed as a regular expression, and then it's only used to remove strings from the language. Their results are based heavily on how regular expressions work. A "semantic context-free grammar" would have different type of characteristics and behavior.

Maybe throwing in the word "extended" or "augmented" would be a bit more clear, but as I reader I definitely would expect "regular expression" to be part of the name.

Removing strings from the language is what makes it non-regular. E.g., a regular language cannot contain a^n b^n (that is: the string is only accepted when it has an identical amount of a's and b's), but it sure as hell can contain a^m b^n. Removing the strings where m != n is what makes a language context-free.
To be fair, I could see this basically allowing a form of "back reference" lookup that lets you offload the back reference to other parts. For example, `/(.); \1 = (.)/.exec("foo; foo = anything")`, but instead of doing a back reference, you could have an oracle lookup.

I haven't looked at the examples in this paper, yet. But I'm having fun imagining ways this could be used.

Nested querying is not something that is standard for regular grammars, amongst other aspects introduced in this paper that implicitly require things like memory (again, not standard for regular grammars)
>which has become completely diluted at this point.

Isn't it common in TCS to consider the class X with oracle access to decide L though?