Hacker News new | ask | show | jobs
by wjsetzer 2080 days ago
This is unrelated to multicore, but Ocaml is a language I want to like. I wanted to learn OCaml with the make a lisp project. That is, until I realized it doesn't have Perl regex built in (yes, I have been spoiled by Python, which has practically everything in the standard library). The best way to get Perl regex was a rarely updated 3rd party library which was missing key features like lookahead and lookbehind.
4 comments

https://ocaml.janestreet.com/ocaml-core/109.55.00/tmp/re2/Re...

should do the trick! ocaml core is well maintained AFAIK

Here's a link to the latest documentation: https://ocaml.janestreet.com/ocaml-core/latest/doc/re2/Re2__...
S expressions are one of the most used serialization formats in OCaml. You can get pretty far relying on the standard parsers and printers. It's what I would do for a lisp in OCaml. Maybe you wanted to DIY as a learning thing? There are other options for parsing including parser combinators or using Menhir/ocamllex which are interesting in their own right.

https://dev.realworldocaml.org/data-serialization.html

https://dev.realworldocaml.org/parsing-with-ocamllex-and-men...

This isn't really a good answer for someone who wants a regex engine. People use regexes for a lot of things where a full-fledged parser would be overkill, and I'm not sure what serialization has to do with anything.
I completed the make a lisp project in Reason (alternative syntax for OCaml) a few months ago.

I used the PCRE library, which has pretty much all the features you expect, and it is actively maintained. Note: the heavy lifting is done by C libraries.

https://opam.ocaml.org/packages/pcre/

If you want to see how I integrated it with the interpreter the code is here:

https://github.com/briangorman/reason-mal/blob/master/reader...

Might be outdated (or not), but this page has tons and tons of examples, incl. lookahead: http://pleac.sourceforge.net/pleac_ocaml/patternmatching.htm...

And the GitHub page of the aforementioned library: https://mmottl.github.io/pcre-ocaml/

Were you going to parse S-expressions with regular expressions? I guess that saves you from learning how to write loops and conditionals, and what is substr() called (and what arguments it takes) in this new language, but is not that against the point of learning the new language?
The make a lisp tutorial provides a PCRE regular expression to generate the tokens that are later fed into the reader.
Now, I am a grumpy old fart, but I would suggest to make a one-pass recursive descent lexer+reader. That should be trivial for the MAL lisp. WRiting an ocaml recursive descent parser should be amazingly straightforward, especially since you can just tailcall the different states.
While re-using an existing regexp like

    [\s,]*(~@|[\[\]{}()'`~^@]|"(?:\\.|[^\\"])*"?|;.*|[^\s\[\]{}('"`,;)]*)
is easier than writing a tokenizer manually (keeping track offsets and looping and stuff), writing that regexp is definitely harder than writing the tokenizer like this:

    curr, end = 0, len(s)
    while True:
        while curr < end and isspace(s[curr]):
            curr += 1

        if curr >= end:
            break
            
        if s[curr:curr + 2] == "~@":
            yield s[curr:curr + 2]
            curr += 2

        elif isspecial(s[curr]):        # isspecial(c) matches c against []{}()'`~^@
            yield s[curr]
            curr += 1

        elif isquote(s[curr]):          # isquote(c) matches c against "
            start = curr
            curr += 1

            # check this condition out: you can totally support several quotes,
            # and accurately match the closing and opening ones. Imagine doing it
            # with a regexp: either duplicate it, or use some back-referencing magic

            while curr < end and not (s[curr] == s[start] and s[curr-1] != '\\'):
               curr += 1
            curr += 1        # we want to include the closing quote
            yield s[start:curr]

        elif s[curr] == ';':
            yield s[curr:]
            break

        else:
            start = curr
            while curr < end and not (isspace(s[curr]) or isspecial(s[curr]) or isquote(s[curr])):
                curr += 1
            yield s[start:curr]
Yeah, it's more verbose and somewhat repetitive, but on the other hand, it's way more readable, and debuggable too: with regexps, it's always a mystery which part of it exactly didn't match what you wanted or captured something you didn't want to match. Here, the loop invariants and preconditions are almost immediately obvious.
Not only that: having a generator generate the tokens means you can do it in one pass, while writitng code that has the clarity of 2 passes.