| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kragen 291 days ago

I think this might depend on the language you're writing in.

Historically, at least, it's pretty verbose to define a data type in Python compared to languages that are more designed for writing compilers. Consider these definitions from my prototype Bicicleta interpreter, which is written in ML, specifically OCaml:

    type methods = NoDefs
                               (* name, body, is_positional ... *)
                   | Definition of string * bicexpr * bool * methods
    and bicexpr = Name of string
                  | Call of bicexpr * string
                  | Literal of string option * methods
                  | Derivation of bicexpr * string option * methods
                  | StringConstant of string
                  | Integer of int
                  | Float of float
                  | NativeMethod of (lookup -> bicobj)

Those ten lines of code would be ten classes in Python with an average of 1.6 attributes each. Using dataclasses or attrs, that would be 36 lines of code, and then (if you're doing it the OO way) every function that I defined on one of these OCaml types becomes a method implemented in each class implementing a particular protocol, with a copy of its argument signature in every class. (If you used namedtuple instead, it's no less code, but you write it on less lines.) So, for example, this function on bicexprs

    let rec freevars = function
        Name n -> stringset [n]
      | Integer _ | StringConstant _ | Float _ -> stringset ["prog"]
      | NativeMethod _ -> stringset []
      | Literal (Some selfname, methods) -> 
            StringSet.diff (freevars_methods methods) (stringset [selfname])
      | Literal (None, methods) -> freevars_methods methods
      | Derivation(object_, self, methods) ->
            StringSet.union (freevars object_) (freevars (Literal(self, methods)))
      | Call(object_, _) -> freevars object_

becomes six to eight method definitions in the different classes. (You can cut it down to six if you define an abstract base class for the constant classes.) And Literal.freevars needs an if-then-else. So that's another 20 lines of code.

Python does support pattern-matching now, so functions like this might not be any more verbose than the ML version if you program them the same way instead of in the OO fashion. I haven't tried using Python pattern-matching, so I don't really know.

In general, though, Python is more verbose than ML-family languages for this kind of thing by a factor of about 2–4, and that's before you count the test code you need in Python to get the kind of confidence in correctness that ML's type-checking gives you with no extra code. To my knowledge, Mypy doesn't do the kinds of pattern-matching-exhaustiveness checks that ML compilers do.

I've sometimes "cheated" by trying to write code like this in Python using regular tuples rather than named tuples. You can definitely make it work, but it's a real pain to debug.

Quoting Andy Chu from https://andychu.net/projects/:

> Python is not the right language for [implementing] languages. I will use OCaml for subsequent projects like this.

Python does have GC and dynamic dispatch, though, and those count for a lot.

3 comments

vidarh 290 days ago

GC doesn't matter much for a simple compiler, as you either don't need to allocate much (single pass, Wirth-style compilers that generate code inline) or most of what you allocate becomes garbage all at once at the end (AST).

In my half-finished Ruby compiler prototype, even before I added type tagging, and so allocated every integer on the heap, I just didn't add a GC for a long time because it was fine to just leak, because the compiler isn't generally long running.