| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by wjsetzer 2080 days ago
	This is unrelated to multicore, but Ocaml is a language I want to like. I wanted to learn OCaml with the make a lisp project. That is, until I realized it doesn't have Perl regex built in (yes, I have been spoiled by Python, which has practically everything in the standard library). The best way to get Perl regex was a rarely updated 3rd party library which was missing key features like lookahead and lookbehind.

4 comments

jlrubin 2080 days ago

https://ocaml.janestreet.com/ocaml-core/109.55.00/tmp/re2/Re...

should do the trick! ocaml core is well maintained AFAIK

link

laylomo2 2080 days ago

Here's a link to the latest documentation: https://ocaml.janestreet.com/ocaml-core/latest/doc/re2/Re2__...

link

philzook 2080 days ago

S expressions are one of the most used serialization formats in OCaml. You can get pretty far relying on the standard parsers and printers. It's what I would do for a lisp in OCaml. Maybe you wanted to DIY as a learning thing? There are other options for parsing including parser combinators or using Menhir/ocamllex which are interesting in their own right.

https://dev.realworldocaml.org/data-serialization.html

https://dev.realworldocaml.org/parsing-with-ocamllex-and-men...

link

djur 2080 days ago

This isn't really a good answer for someone who wants a regex engine. People use regexes for a lot of things where a full-fledged parser would be overkill, and I'm not sure what serialization has to do with anything.

link

bgorman 2080 days ago

I completed the make a lisp project in Reason (alternative syntax for OCaml) a few months ago.

I used the PCRE library, which has pretty much all the features you expect, and it is actively maintained. Note: the heavy lifting is done by C libraries.

https://opam.ocaml.org/packages/pcre/

If you want to see how I integrated it with the interpreter the code is here:

https://github.com/briangorman/reason-mal/blob/master/reader...

link

johnisgood 2079 days ago

Might be outdated (or not), but this page has tons and tons of examples, incl. lookahead: http://pleac.sourceforge.net/pleac_ocaml/patternmatching.htm...

And the GitHub page of the aforementioned library: https://mmottl.github.io/pcre-ocaml/

link

Joker_vD 2080 days ago

Were you going to parse S-expressions with regular expressions? I guess that saves you from learning how to write loops and conditionals, and what is substr() called (and what arguments it takes) in this new language, but is not that against the point of learning the new language?

link

bgorman 2080 days ago

The make a lisp tutorial provides a PCRE regular expression to generate the tokens that are later fed into the reader.

link

bjoli 2080 days ago

Now, I am a grumpy old fart, but I would suggest to make a one-pass recursive descent lexer+reader. That should be trivial for the MAL lisp. WRiting an ocaml recursive descent parser should be amazingly straightforward, especially since you can just tailcall the different states.

link

Joker_vD 2079 days ago

While re-using an existing regexp like

    [\s,]*(~@|[\[\]{}()'`~^@]|"(?:\\.|[^\\"])*"?|;.*|[^\s\[\]{}('"`,;)]*)

is easier than writing a tokenizer manually (keeping track offsets and looping and stuff), writing that regexp is definitely harder than writing the tokenizer like this:

    curr, end = 0, len(s)
    while True:
        while curr < end and isspace(s[curr]):
            curr += 1

        if curr >= end:
            break
            
        if s[curr:curr + 2] == "~@":
            yield s[curr:curr + 2]
            curr += 2

        elif isspecial(s[curr]):        # isspecial(c) matches c against []{}()'`~^@
            yield s[curr]
            curr += 1

        elif isquote(s[curr]):          # isquote(c) matches c against "
            start = curr
            curr += 1

            # check this condition out: you can totally support several quotes,
            # and accurately match the closing and opening ones. Imagine doing it
            # with a regexp: either duplicate it, or use some back-referencing magic

            while curr < end and not (s[curr] == s[start] and s[curr-1] != '\\'):
               curr += 1
            curr += 1        # we want to include the closing quote
            yield s[start:curr]

        elif s[curr] == ';':
            yield s[curr:]
            break

        else:
            start = curr
            while curr < end and not (isspace(s[curr]) or isspecial(s[curr]) or isquote(s[curr])):
                curr += 1
            yield s[start:curr]

Yeah, it's more verbose and somewhat repetitive, but on the other hand, it's way more readable, and debuggable too: with regexps, it's always a mystery which part of it exactly didn't match what you wanted or captured something you didn't want to match. Here, the loop invariants and preconditions are almost immediately obvious.

link

bjoli 2076 days ago

Not only that: having a generator generate the tokens means you can do it in one pass, while writitng code that has the clarity of 2 passes.

link