Hacker News new | ask | show | jobs
by Freak_NL 4 days ago
This all old hat, unfortunately, and also a thing which will be gotten wrong by developers for years to come. Just shouting 'give me a regex for validating email addresses' will make an LLM like ChatGPT happily output bullshit suggesting some overlong regex which is flawed precisely as outlined by the linked article, even though no one is arguing for those long unmaintainable regexes once they've seen the light.

Ah well.

Where there is still room for improvement is in how email addresses are often made a little bit anonymous by a lot of websites. Did you ever see something like 'j*h@gmail.com'? Oh wow, that neatly leaves out John Smith's full name! Like showing only the last four numbers of an IBAN or credit card.

Except for us edge cases with a personal domain, where I then get 'm*l@myfullname.nl'. So stop that. Store it next to the bit of knowledge about validating email addresses — the bits of knowledge you use to correct junior developers and senior idiots.

2 comments

I just tried this with Claude Opus 4.8 and I think it don't see any of those issues:

The first sentence is that there is no single regex that perfectly validates every technically valid email address. I think that is a good start.

It then recommends the regex used for <input type="email"> and explains that this would cover the majority of email addresses used by actual people. It also shows an improved regex that handles dot-atom local parts, quoted strings, domain names, and IPv4 domain literals, but doesn't cover things such as comments, full IPv6 literals, or internationalized addresses.

It ends with the only correct advice (in my optionion): Send a confirmation email.

Does it say 'don't bother with a regex beyond checking it contains an @ surrounded by arbitrary pieces of text?' This still sounds like it is leading developers to conclude that they should use a too complex regex and then send a confirmation email.

Claude Sonnet says:

> A practical email regex that covers the vast majority of real-world addresses: > > ^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$

Which is still way more complex than needed (and takes effort to read), and buggy according to years of blog posts written about this topic.

Of course the problem is the developer asking for a regex at all, but the must-regex-email instinct seems heavily engrained in our collective psyche.

I have no idea what other pay-to-play models say.

This is a nice sibling problem to validation. In both cases, the bug is assuming an email address has a predictable human structure