Hacker News new | ask | show | jobs
by zzzcpan 3769 days ago
> Perl devs don't think twice about shelling out to an external binary

No, most of them do. Perl ecosystem has a killer feature called cpantesters, that allows everyone to see which modules work on which systems out of the box. You should always check cpantesters matrix before choosing a particular dependency.

> Even if regexs are not needed

They got overly complicated over the years, but they are needed. They are DSLs to make things easier when working with strings. I.e. so you wouldn't have to write 20 lines of hard to grasp code with bytes.Index(), bytes.HasSuffix(), bytes.TrimRight(), etc., like people do in Go, but a single nice regexp and therefore reduce your chances to make a mistake in that code.

3 comments

> so you wouldn't have to write 20 lines of hard to grasp code with bytes.Index(), bytes.HasSuffix(), bytes.TrimRight(), etc., like people do in Go

Go has regexps, and a very good implementation at it.

Depending on what you do and on the specific code-path, compiling and/or executing a regexp might be slower than manually parsing the string. Go standard library is pretty concerned with performance (much more than Python's or Ruby's, for instance), so it tends to avoid regexps.

It shouldn't be like that, that's the problem. Regular expressions should be compiled into a native code and be even faster than a bunch of hand written bytes.HasSuffix() combinations.
Your previous post said that they are a very useful DSL for Perl so that "people don't have to do like they do in Go".

Both Perl and Go implement regexps, and neither or them compile them to native code. So I don't get your previous comment at all.

The main difference is that, in Perl, if you ever had to write manual string parsing, it would be much much slower than using regexps as Perl is an interpreted language. So regexps are needed to perform fast string parsing. In Go, you have regexps if you want, or you can go even faster if you feel it's required.

> Both Perl and Go implement regexps, and neither or them compile them to native code. So I don't get your previous comment at all.

Ok, I'll try to explain.

People feel discouraged to use regexps in Go, because they are very slow for many typical parsing and validating cases and require extra step of compilation and all of the additional code complexity associated with that. So, people do parsing manually instead, with all of its problems. It's not that they need that performance, almost no one does, but the whole idea behind regular expressions is not working, parsing code is still bad most of the time.

You've made me curious: Is there a language out there which does this, i.e. compiles Regex down to native code which is then as fast/faster than hand-coded bytes.hasSuffix(..) calls?
I found this with a bit of searching and clicking around on Stackoverflow: https://www.colm.net/open-source/ragel/ (via http://stackoverflow.com/a/15608037).

I didn't look long enough to know if there's an easy way to convert a regular expression to Ragel syntax.

> Go has regexps, and a very good implementation at it.

In my experience, porting code from Perl to Go, Go's regexp package is vastly inferior to Perl's, in multiple areas, speed, memory, unicode handling (eg: \b works on ascii-only in Go), etc. For example, for some large regexps handling url blacklists, reduced programmatically with Perl's awesome regexp assembly tools, I had to rely on PCRE in the end, Go just could not cope with that (not even the c++ re2). I do avoid regexps, regexps are usually best avoided, and all that, but there are areas in which they are by far the best option. In those areas, I postulate, from my own experience, that Perl's implementation is king. Speed, memory usage, Unicode.

> (not even the c++ re2)

Did you try using RE2's "set" functionality?

No, I did not get that far, would've meant a larger rewrite of the ecosystem, the data files were created by other tools, already in "alternate form" [1] needing to be used by other programs as well. I stopped trying to load them with re2 (both Go and C++), after glancing over all those gigabytes of RSS, while Perl kept them in the 2-300 MB range. PCRE was a good compromise at the time, but with other tradeoffs, because C libs seem to be frowned upon in the Go community, ie. semi-official voices arguing how best to avoid them. :/ (eg: blocking inside C isn't under the gomaxprocs limit, costly overhead crossing the C boundaries, static binary troubles, less portability and so on)

#1. perl -MRegexp::Assemble -E'my @list = qw< foo fo0z bar baz >; my $rx = Regexp::Assemble->new->add( @list )->re; say $rx'

(?^:(?:fo(?:0z|o)|ba[rz]))

cpantesters looks very useful. [1]

I wonder if there's anything like that for Python and Ruby.

[1]: for example, http://cpantesters.org/author/D/DAMOG.html

Less code is generally better. But I've noticed a lot of folks still using ^ or $ when what they really mean is \A or \z
What's the difference? ^ and $ is basically all I remember from when I read Mastering Regular Expressions
\A and \z always match beginning/end of the string.

^ and $ can be changed to mean beginning/end of each line in the string with the /m flag.