Hacker News new | ask | show | jobs
by ykonstant 816 days ago
Does gpt produce efficient regex? Are there any experts here that can assess the quality and correctness of gpt-generated regex? I wonder how regex responses by gpt are validated if the prompter does not have the knowledge to read the output.
3 comments

You don't have to be an expert; you should very rarely be using regexes so complex that you can't understand them.
It might not be obvious when you hit that point, bad regexes can be subtle, just see that old cloudflare postmortem.
Even simple regexs can be problematic, e.g. Gitlab RCE bug through ExifTools

https://devcraft.io/2021/05/04/exiftool-arbitrary-code-execu...

> "a\ > ""

> The second quote was not escaped because in the regex $tok =~ /(\\+)$/ the $ will match the end of a string, but also match before a newline at the end of a string, so the code thinks that the quote is being escaped when it’s escaping the newline.

...and if you can understand them then you clearly understand regex enough not to need ChatGPT to write them
I understand assembly too.
The only time you'd want to write assembly in production code would be if you need to hand roll some optimisation. So I don't really understand your point here,
That was one of my first uh oh moments with gpt. Getting code that clearly had untestable/unreadable regexen, which given the source must have meant the regex were gpt generated. So much is going to go wrong, and soon.
what does gpt say how we should validate email addresses?
Prompt:

'I'm writing a nodejs javascript application and I need a regex to validate emails in my server. Can you write a regex that will safely and efficiently match emails?'

GPT4 / Gemini Advanced / Claude 3 Sonnet

GPT4: `const emailRegex = /^[a-zA-Z0-9._-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/;` Full answser: https://justpaste.it/cg4cl

Gemini Advanced: `const emailRegex = /^[a-zA-Z0-9.!#$%&'+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)$/;` Full answer: https://justpaste.it/589a5

Claude 3: `const emailRegex = /^([a-zA-Z0-9._%-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})$/;` Full answer: https://justpaste.it/82r2v

Whereas email more or less lasts forever (mailbox contents), and has to be backwards compatible with older versions back to (at least) RFC 821/822, or those before. It also allows almost any character (when escaped at 821 level) in the host or domain part (domain names allow any byte value).

So a Internet email address match pattern has to be: "..*@..*", anything else can reject otherwise valid addresses.

That however does not account for earlier source routed addresses, not the old style UUCP bang paths. However those can probably be ignored for newly generated email.

I regularly use an email address with a "+" in the host part. When I used qmail, I often used addresses like: "foo-a/b-bar-tat@DOMAIN". Mainly for auto filtering received messages from mailing lists.

Still doesn't support internationalized domain names.
Terrible answers as far as I can tell, especially Chat got would throw out many valid email addresses.
chatgpt-4:

^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

https://chat.openai.com/share/696f7046-7f43-4331-b12b-538566...

chatgpt-3.5:

^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

https://chat.openai.com/share/aaa09ae8-3fd9-4df7-a417-948436...

…which both excludes addresses allowed by the RFC and includes addresses disallowed by the RFC. (For example, the RFC disallows two consecutive dots in the local-part.)
I take the descriptivist approach to email validation, rather than the prescriptivist.

I know an email has to have a domain name after the @ so I know where to send it.

I also know it has to have something before the @ so the domain’s email server knows how to handle it.

But do I care if the email server is supports sub addresses, characters outside of the commonly supported range (eg quotation marks and spaces), or even characters which aren’t part of the RFC? I do not.

If the user gives me that email, I’ll trust them. Worst case they won’t receive the verification email and will need to double check it. But it’s a lot better than those websites who try to tell me my email is invalid because their regex is too picky.

The HTML email regex validation [1] is probably the best rule to use for validating an email address in most user applications. It prohibits IP address domain literals (which the emailcore people have basically said is of limited utility [2]), and quoted strings in the localpart. Its biggest fault is allowing multiple dots to appear next to each other, which is a lot of faff to put in a regex when you already have to individually spell out every special character in atext.

[1] https://html.spec.whatwg.org/multipage/input.html#email-stat...

[2] https://datatracker.ietf.org/doc/draft-ietf-emailcore-as/

I generally agree, but the two consecutive dots (or leading/trailing dots) are an example that would very likely be a typo and that you wouldn’t particularly want to send. Similar for unbalanced quotes, angle brackets, and other grammar elements.
I wonder whether simply (regex) replacing a sequence of .'s with a single one as part of a post-processing step would be effective.
Yeah, that's about as far as I've ever been comfortable going in terms of validating email addresses too: some stuff followed by "@" followed by more stuff.

Though I guess adding a check for invalid dot patterns might be worthwhile.

What is maybe more important to note, it completely disallows the language of some 4/5 of the humanity. And partially disallows some 2/3 of the rest.
Actually pretty good response if the programmer bothers to read all of it

I'd be more emphatic that you shouldn't rely on regexes to validate emails and that this should only be used as an "in the form validation" first step to warn of user input error, but the gist is there

> This regex is *practical for most applications* (??), striking a balance between complexity and adherence to the standard. It allows for basic validation but does not fully enforce the specifications of RFC 5322, which are much more intricate and challenging to implement in a single regex pattern.

^ ("challenging"? Didn't I see that emails validation requires at least a grammar and not just a regex?)

> For example, it doesn't account for quoted strings (which can include spaces) in the local part, nor does it fully validate all possible TLDs. Implementing a regex that fully complies with the RFC specifications is impractical due to their complexity and the flexibility allowed in the specifications.

> For applications requiring strict compliance, it's often recommended to use a library or built-in function for email validation provided by the programming language or framework you're using, as these are more likely to handle the nuances and edge cases correctly. Additionally, the ultimate test of an email address's validity is sending a confirmation email to it.

Not good at all, but a little better than expected. I use + in email addresses prominently and there are so many websites who don't even allow that...
Remember to first punycode the domain part of an email address before trying to validate it, or it will not work with internationalized domain names.
Support for IDN email addresses is still patchy at best. Many systems can’t send to them; many email hosts still can’t handle being configured for them.
There really ought to be a regex repository of common use cases like these so we don't have to reinvent the wheel or dig up a random codebase that we hope is correct to copy from every time.