Hacker News new | ask | show | jobs
by sedachv 5052 days ago
To provide more precedents and a little history:

The first C "interpreters" I know of were for Lisp machines: Symbolics' C compiler (http://www.bitsavers.org/pdf/symbolics/software/genera_8/Use...) and Scott Burson's (hn user ScottBurson) ZetaC for TI Explorers/LMIs and Symbolics 3600s (now available under the public domain: http://www.bitsavers.org/bits/TI/Explorer/zeta-c/). Neither of them are interpreters, just "interactive" compilers like Lisp ones are.

I am writing a C to Common Lisp translator right now (https://github.com/vsedach/Vacietis). This is surprisingly easy because C is largely a small subset of Common Lisp. Pointers are trivial to implement with closures (Oleg explains how: http://okmij.org/ftp/Scheme/pointer-as-closure.txt but I discovered the technique independently around 2004). The only problem is how to deal with casting arrays of integers (or whatever) to arrays of bytes. But that's a problem for portable C software anyway. I think I'll also need a little source fudging magic for setjmp/longjmp. Otherwise the project is now where you can compile-file/load a C file just like you do a Lisp file by setting the readtable. There's a few things I need to finish with #includes, enums, stdlib and the variable-length struct hack, but that should be done in the next few weeks.

This should also extend to "compiling" C to other languages like JavaScript, without having to go through the whole "emulate LLVM or MIPS" garbage that other projects like that do. I think I figured out how to do gotos in JavaScript by using a trampoline with local CPS-rewriting, which is IMO the largest challenge for an interoperable C->JS translator.

As to how to do this for C++, don't ask me. According to the CERN people, CINT has "slightly less than 400,000 lines of code." (http://root.cern.ch/drupal/content/cint). What a joke.

3 comments

That Oleg link has me nerd-snipped.

What I can't wrap my head around is how one would implement pointer-arithmetic with these closures? C pointers are not just references to cells, but those cells are guaranteed to be contiguous (up to a certain limit, be it an VM allocation unit, say "page", or all available system unit in VM-less systems)

That is to say, C pointers are not like ML references. Along with SET and REF they also allow addition, subtraction, scaling, etc.

For closures to model C pointers, wouldn't they need to order the allocation of cells in some manner? say, big array? And if so, this could get expensive very quickly (worst-case being "modeling" of entire memory, i.e. emulation) without certifying compiler or at least some exhaustive pointer analysis.

Hope I'm wrong on this.

> What I can't wrap my head around is how one would implement pointer-arithmetic with these closures?

  (defstruct memptr
    mem
    (ptr 0))

  (defun allocate-memory (size) ;; shared by malloc and static allocation
    (make-memptr :mem (make-array size :adjustable t :initial-element 0)))

  (defstruct place-ptr
    closure)

  (defmacro vacietis.c:mkptr& (place) ;; need to deal w/function pointers
    (let ((new-value (gensym)))
      `(make-place-ptr :closure (lambda (&optional ,new-value)
                                  (if ,new-value
                                      (setf ,place ,new-value)
                                      ,place)))))

  (defun vacietis.c:deref* (ptr)
    (etypecase ptr
      (memptr (aref (memptr-mem ptr) (memptr-ptr ptr)))
      (place-ptr (funcall (place-ptr-closure ptr)))))

  (defun (setf vacietis.c:deref*) (new-value ptr)
    (etypecase ptr
      (memptr (setf (aref (memptr-mem ptr) (memptr-ptr ptr)) new-value))
      (plate-ptr (funcall (place-ptr-closure ptr) new-value))))

  (defmethod vacietis.c:+ ((x number) (y number))
    (+ x y))

  (defmethod vacietis.c:+ ((ptr memptr) (x integer))
    (make-memptr :mem (memptr-mem ptr) :ptr (+ x (memptr-ptr ptr))))

  (defmethod vacietis.c:- ((ptr1 memptr) (ptr2 memptr))
    (assert (eq (memptr-mem ptr1) (memptr-mem ptr2)) ()
            "Trying to subtract pointers from two different memory segments")
    (make-memptr :mem (memptr-mem ptr1) :ptr (- (memptr-ptr ptr1) (memptr-ptr ptr2))))
As you can see I don't do anything with type declarations, so arithmetic performance is going to be terrible. Multiple levels of pointer indirection work correctly with pointer arithmetic, since all pointers which can be legally added/subtracted are first-class objects.
The C standard basically only guarantees that pointer arithmetic works when the pointers involved all point to the same array object (it also allows for a pointer just off the end of an array object). Other pointer arithmetic or comparison is undefined behavior and an implementation can do whatever it pleases.
So I implemented this stricter definition of C pointers and it's neither interesting, nor representative of Real World uses that I know of. Need to investigate a bit more.
stricter definition of C pointers - this made me laugh, as the less-strict definition of C pointers is called undefined behavior.

This leads me to this statement based on what you said: interesting Real World C programs make use of undefined behavior

> interesting Real World C programs make use of undefined behavior

I assume you've worked with a significant amount of real world C programs? Because they surprisingly often do. The difficulty with porting many programs to 64-bit for instance, is due to relying on implementation defined behavior.

I assume you've worked with a significant amount of real world C programs?

Only embedded systems (AVR & PIC24). I have much more experience in C++, which I've used for both Desktop apps and telco server components.

The beauty of C (over C++) is that the standard is actually readable. C++ especially is a quagmire of undefined behavior. The scary thing about C/C++ is that its easy to hit undefined (or, at least, as you state, implementation defined) behavior and not even realize. Often the code looks valid, does what it looks like it does, yet is actually undefined or implementation defined and will break elsewhere.

With that said, while I don't expect everyone to have memorized the standard, I do hope most would have at least enough familiarity to avoid most cases of undefined behavior.

Emscripten in no way emulates LLVM (as far as I know it's the only C-to-JS compiler that uses LLVM, and hence must be what you're referring to!) — it compiles LLVM to JS, with no emulation, and there's currently work going on to re-implement it as a LLVM backend (it currently is an entirely separate codebase that takes LLVM bitcode in and spits JS out).
My mistake, I thought Emscripten was an LLVM VM in JS. I'll take a look at it to see what the output actually does.
Looking at issues in https://github.com/kripken/emscripten/wiki/Filesystem-Guide it seems to be there would be a big win in having an LLVM VM - you'd be able to have synchronous IO and have threads.
The only problem is how to deal with casting arrays of integers (or whatever) to arrays of bytes. But that's a problem for portable C software anyway.

Byte-wise access to objects is legal and completely portable between conforming implementations. What isn't portable are arbitrary type conversions through pointer casts as these violate the effective typing rules. Such casts may break in practice due to mis-alignment or because of aliasing behind the optimizers back.

In a way, C is a strongly typed language - the type system is just really unsound.

> Byte-wise access to objects is legal and completely portable between conforming implementations.

That makes absolutely no sense. Just think about endianness for example. Type conversions in C are extremely tricky, and in many cases are not guaranteed to be portable across different compilers even on the same architecture. There is a good explanation of what you can and cannot count on in chapter 6 of Harbison and Steele's C A Reference Manual.

There's also a perfectly good list of what you can and cannot count on in the ISO C standard, no need for secondary literature.

While the values of the bytes are not specified, the ability to get at them is, and a conforming implementation needs to provide this ability.