> Building lists out of pairs and then using them as your intermediate format creates a lot of garbage.
I wonder if a Lisp-like language which used vectors rather than pairs as its fundamental data structure might be a better fit for severely memory-constrained systems? On average, vectors take up half the memory consumption of lists.
Such a language would end up looking rather different to Lisp though. Cons cells encourage CAR/CDR and recursion. A vector-centric language would naturally lead to a more iterative programming style.
I know that, but that wasn't my point. Yes, all major Lisps support vectors; but they all use lists a lot more than they use vectors. One of Lisp's major features is "code as data", but that code is almost all lists of lists (as opposed to vectors of vectors). I was talking about a Lisp-like language in which vectors, not lists, are the primary data structure. Possibly even one which doesn't have cons cells at all, only vectors. I am suggesting such a language might possibly work better in severely memory poor environments than a more traditional Lisp would.
Then you're also aware that one of the beautiful things of Lisp is how it allows to abstract the hardware beyond the core primitives, there is nothing that prevents an optimizing Lisp compiler to map what looks like lists to vectors at the machine code level.
One reason why such optimization isn't common is most likely due to the current usage of Lisp across the industry, and the commercial implementations being a niche product.
In principle, you can save the same amount of memory with CDR-coding.
In practice though, you run into some issues (1) RPLACD/set-car! makes CDR-coding much more complicated (I suppose you can just ban it – in Racket, pairs are immutable by default, although there is also a separate mutable pair type); (2) CDR-coding generally only works if you construct the list up-front, the existence of CONS can encourage a coding style in which you don't do that; (3) to fix (2), you can force the list to be CDR-coded by duplicating it, but you have to remember to do that at right points – if you don't do it, you'll miss out on the benefits of CDR-coding, but if you do it when you don't have to you are unnecessarily harming performance; (4) since CDR-coded and non-CDR-coded lists are the same type, and indistinguishable without peeling off the covers of the implementation, it makes it harder for the programmer to address (2) and (3) correctly.
Well 1MB is more than enough; that was the maximum virtual address space size of a PDP-10 where many of the precursors to common lisp were developed. You can do a fair amount with 100kB as well. uLisp runs with 34k minimum (32k ROM + 2k RAM).
The RAM usage is tied to the size of the reachable set, plus some slack filled with garbage that depends on how you tune the garbage collection thresholds.
By today's standards, the RAM usage isn't necessarily huge.
Here is the TXR Lisp compiler recompiling stdlib/compiler.tl -> stdlib/compiler.tlo, as seen in top:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11488 kaz 20 0 17800 14804 2964 R 98.1 0.7 0:07.07 txr
^^^^^ ^^^^^
On the order of a bash session. It's a lot of RAM by 1982 standards at the institution level, and even 1992 standards at the consumer level, but today it means nothing.
You can easily see a Bash process a footprint on that order.
It could be reduced by tuning the garbage collector. One way to do that is to build for less memory use (useful for embedded). Here it is with txr rebuilt using #define CONFIG_SMALL_MEM 1 in config.h:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12838 kaz 20 0 11964 9768 3140 R 99.0 0.5 0:10.39 txr
Bash footprints for comparison:
$ ps aux | head -1 ; ps aux | grep bash
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
kaz 1093 0.0 0.1 9288 2132 pts/2 Ss+ May15 0:01 -bash
kaz 2833 0.0 0.0 8904 1992 pts/0 Ss May15 0:00 -bash
kaz 3509 0.0 0.2 10532 4988 pts/1 Ss+ May15 0:28 -bash
kaz 7898 0.0 0.1 8968 2212 pts/3 Ss+ May20 0:00 -bash
Lists are used for everything: the compiler produces a list-based assembly code which is used from then through assembly. There is an optimizer which divides it into basic blocks, which are objects put into a graph, but the instructions still being lists. The peephole pattern matching is done on lists. The compiler does not bother using destructive append (nconc) for stitching together fragments of code; just straight garbage-generating appends. Same with most of the other rewriting that happens later.
In a computer in 1960, your compiler would be capped to the physical memory available. That would be the RAM use. The garbage collector would have to be called whenever the memory is exhausted, or else the show would stop. A successful compilation would demonstrate that the compiler needed no more memory than what the machine has. The closer its actual usage would be to the available memory, the longer it would take, due to the frequent garbage collections required to stay afloat.
I'd say that given people's expectations today, shaped by experiences with everyday software, they likely greatly overestimate how much RAM you need for Lisp compiling.
Appendix: the VM footprint number doesn't really give us a breakdown since it includes executable and shared libs mappings. I ran the second compile again with the memory-optimized build, and this time captured a pmap.
Here you can also see the full command, confirming the compile job:
You can see the 11700K fairly closely matches the earlier VIRT figure of 11964.
Anyway, look at the [ anon ] heap area: it's like 6-something megs. That's it. That's where all the dynamic Lisp stuff is. All the predefined symbols and function bindings and whatnot, and all the objects allocated during the compile job.
libz is new; I integrated libz into TXR in just the most recent release. It happens to be number 277, so I code named it (L)Z77.
I wonder if a Lisp-like language which used vectors rather than pairs as its fundamental data structure might be a better fit for severely memory-constrained systems? On average, vectors take up half the memory consumption of lists.
Such a language would end up looking rather different to Lisp though. Cons cells encourage CAR/CDR and recursion. A vector-centric language would naturally lead to a more iterative programming style.