Hacker News new | ask | show | jobs
by judofyr 574 days ago
Your rant is a bit unfounded for this paper as they do actually take completely standard regular expressions (no backtracking or anything like that) and extend it with one more construct. Calling it «semantic regular expressions» seems perfectly reasonable to me. What else to call it?

As for outside of the computer science sphere (which I find is quite consistent in their terminology): I do agree that it seems like it’s a lost cause and «regex» is now synonymous for «pattern matching using this one specific syntax» :(

2 comments

Regular has a meaning, and it isn't this. It's had this meaning since the 1950s. OTOH, these expressions do not generate a regular language. That's a good reason to me.

They call it "semantic regular expression" because it apparently already is a lost cause. "Regular expressions with TMs embedded" doesn't quite have the same ring to it. Nobody would see it as a regexp.

> OTOH, these expressions do not generate a regular language.

Okay sure you're technically correct here, but only because these expressions generate a subset of a regular language. The LLM can only be invoked on a substring that can expressed as a regular expression, and then it's only used to remove strings from the language. Their results are based heavily on how regular expressions work. A "semantic context-free grammar" would have different type of characteristics and behavior.

Maybe throwing in the word "extended" or "augmented" would be a bit more clear, but as I reader I definitely would expect "regular expression" to be part of the name.

Removing strings from the language is what makes it non-regular. E.g., a regular language cannot contain a^n b^n (that is: the string is only accepted when it has an identical amount of a's and b's), but it sure as hell can contain a^m b^n. Removing the strings where m != n is what makes a language context-free.
To be fair, I could see this basically allowing a form of "back reference" lookup that lets you offload the back reference to other parts. For example, `/(.); \1 = (.)/.exec("foo; foo = anything")`, but instead of doing a back reference, you could have an oracle lookup.

I haven't looked at the examples in this paper, yet. But I'm having fun imagining ways this could be used.

Nested querying is not something that is standard for regular grammars, amongst other aspects introduced in this paper that implicitly require things like memory (again, not standard for regular grammars)