Hacker News new | ask | show | jobs
by m463 2245 days ago
I think what regex's need is a really powerful syntax and language aware regex editor.

I've been using regexs for most of my career, and still struggle to get them right on first writing.

The #1 problem I run into is:

what is a literal character and what is a control character?

for example, both these are very common:

- match a parenthesis character or a period character

- use a parenthesis to group a match or use a period to match any one character

You would think I would learn it once, and be good.

but my #2 problem confounds this:

what is a literal character and what is a control character - in the language I am using?

for example I might need to escape a period to make it a literal for a regex.

If I am checking the files filexc and file.c and want to match the second, the regex I want is

  ^.*\.c$
in perl, I could say:

  $rx = "^.*\\.c\$";    ($" is a thing)
  if ($f =~ /$regex/) { ...
better would be:

  if ($f =~ /^.*\.c$/) {...
in python I would write

  m = re.search("^.*\\.c$",f)
better would be:

  m = re.search(r'^.*\.c$', f)
in a shell script, I might say:

  grep "^.*\\.c$"
EDIT: crap, I had to escape my comment because the asterisk in the regex was making my text italic
7 comments

The biggest problem I've noticed for regex is we use it every once in a while and once it works we move onto other things. And a few weeks/months later, we have forgotten much of it and have to relearn it all over again. Whereas, you generally use your programming language ( C++, C#, Java, etc ) everyday to keep your skills sharp, regex is generally "set it and forget it" situation for most people. And as you noted, different languages/shells/etc implement their own flavor of regex which can trip you up.

It's similar to SQL when you think about it. You set up a query to get the data you need and move on to other things. And every RDBMs implements their own flavor of SQL which can complicate things.

> And a few weeks/months later, we have forgotten much of it and have to relearn it all over again.

This has been exactly my situation. I love regex because it's so powerful but damn do I hate working with it due to the relearning process.

I don't think regex is that specifically hard to learn, but you're fighting the escaping in your favorite language at the same time, so your learning is confounded.
Please do not forget the fact that after a couple of months, you want to make a small change, but you forgot the edge cases when you first created the regex :D
Unit tests will help with this.
I would have to unit-test my command line commands :)
If they are critical: why not? If they aren't: you can live with missing a corner case or two. :-)
I don't see the problem here. The regex itself is exactly the same, it's just that different languages have different string literal syntaxes (and some have dedicated regex syntax, thus solving the problem of double-escaping).

The only regex engine where this is a problem is Vim's, because there are characters that a special unless escaped, and characters that are normal but become special when escaped. And as if that wasn't enough, there are config options to determine which characters those are. My usual practice is to prefix all Vim regexes with \v so that all the special characters are at least consistent.

> I don't see the problem here.

I think the problem is that it's like shooting off a boat in choppy waters. You not only have to decide what you're aiming at (problem #1), you have to anticipate how the motion of the boat will affect it (problem #2).

I use regexbuddy and it does a lot of this. Huge downside is that it's $40 and Windows only.

You can do things like write a more generic regex and then select your language (e.g Python 2 or 3, Java, Perl, so on), And a few common actions, such as "iterate over all matches in a string" and it will auto-generate a code stub for you. Whenever teammates of mine are working on a weird regex they usually email me to double check it for them. (My response is usually that they're trying to do too much with one regex, haha)

Does this help?

https://regex101.com

Or an alternative open source one that you can self-host is https://github.com/gskinner/regexr and https://regexr.com/
pretty good, for problem #1

actually, very good.

doesn't seem to address problem #2 unless I missed something

>what is a literal character and what is a control character?

I read a tip back in the Perl 5 book, that you can just escape any character if you don't remember if it has a special meaning. (You'll still get the literal character even if it didn't need escaping.)

So I basically do that a lot. Never had any issue with control characters.

In my example, if you over-escape the first period or under-escape the second period the regex will undermatch or overmatch.

for me it took a few tries to get filexc to not match and file.c to match in my example in the 3 languages.

My comment was just about what you call "under-escap[ing] the second period."

I mean if you don't know if a ,;:"%$$@!_€|~ or any other character means something you can just write \ before that character. In other words, without thinking of whether it means something in your regex language. I don't think = means anything, but I would write \= to match an equals sign, so I don't even have to think about it. So my comment was about matching literal characters of any kind. I would have used a \ for the literal period out of habit, just because it wasn't [A-Za-z]. So my version would have been right the first time.

Do you think an extension for vscode might be worth it?
RegExBuddy comes to mind.