| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by helloTree 4688 days ago

What is it with the obsession with regular expressions? They are useful things, sure, but I just use them in connection with grep or if I search for strings and normally they are pretty basic, e.g.

$ grep -r -n --color "foo*bar" src

If I want to validate input data with the machine I just use a parser.

2 comments

ghshephard 4687 days ago

Regexes are an elegant and very powerful way to validate data in scripts in a concise (and if they aren't abused) easy to read fashion. There are almost infinite number of examples, but let's say I want to verify that a field is a 64 Bit hexadecimal MAC address

   $mac =~ ^[A-Fa-f0-9]{16}$

Gets the job done. How else, but a regular expression so concisely?

And, when you say, "If I want to validate input data with the machine I just use a parser." - that's pretty much what a regex engine is - a sophisticated parser, and the regular expression is the "commands" that you feed to it to parse the input text.

link

ghshephard 4687 days ago

Here is another one I just did tonight - I wanted to match IPv4 addresses, but didn't want to validate anything with a leading 0 (specifies octal format, which 99.9999% of the time is not what people want), but I do want to accept a leading 0 if it's the only value (I.E. 3.0.2.1, 0.0.0.0, etc...)

regex_ipv4='^((25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])$'

Gets the job done.

How else would you do it?

You can then build up a library of these, and use them on other projects.

link

mjhoy 4687 days ago

> How else would you do it?

Taking your question generally, I was curious to see what it might look like as a parser, since I find that regex a little hard to read. Here's an implementation with Haskell's parsec:

https://gist.github.com/mjhoy/6751909

link

MichaelSalib 4687 days ago

Maybe something like this:

  def is_ipv4_addr(s):
     try:
        octets = s.split('.')
        assert len(octets) == 4
        for o in octets:
            assert 0 <= int(o.lstrip(0) or '0') < 256
     except:
        return False
     return True

It is longer; on the other hand, it is easier to read and more importantly easier to verify correctness.

link

ghshephard 4686 days ago

Would:

  1. 12 .13. 14
  089.23.45.67

Both match that? (Your general point is made though - RegExes look fine to the person that just crafted them, but are opaque to the casual observer)

link

clarry 4687 days ago

I think you forgot to verify that an octet doesn't have leading zeros (unless its value actually is zero).

link

MichaelSalib 4687 days ago

I didn't forget: the (o.lstrip(0) or '0') expression does that.

Actually, that should be o.lstrip('0')...

link

clarry 4686 days ago

Wrong.

  >>> is_ipv4_addr("01.0.0.0")
  True

It should reject that (i.e. return False) because the first octet contains a leading zero. But you're just stripping the zero away, ignoring its existence. For no effect, because converting with int() already ignores them for you.

Your code is also ok with bizarre inputs like "0..." :-)

Regexes really do have their strengths -- they compactly express a state machine, and you can always break the expression into parts which'll show exactly what the state machine will accept. They could also be much more readable if people bothered to break them into parts instead of typing it out all inside a long string that becomes really difficult to parse visually. There are other notations to improve readability, for example rx in emacs: http://www.emacswiki.org/emacs/rx

A seemingly simple regex can be implemented in imperative code and it might look clean and pretty until you get the logic exactly right and amend it to handle all the corner cases that are not obvious at first sight. For comparison I did the exercise in old-fashioned C (and the indentation got messed up along the way, sigh).

https://pastebin.mozilla.org/3171656

A state machine would be more appropriate in my opinion.

link

helloTree 4686 days ago

Ok then maybe it's just a matter of taste. I like the parser approach more:

import Text.ParserCombinators.ReadP -- or the parser lib of your choice

import Data.Char

...

macP = count 16 (satisfy isHexDigit)

link

hnriot 4687 days ago

I don't think your example does what you think it does, match fo followed by o zero or more times, followed by bar.

when you do understand regex, you'll be amazed at the myriad of things you can do with it.

link

helloTree 4686 days ago

You are absolutely right, I forgot the '.'.

link