Hacker News new | ask | show | jobs
by laurenth 695 days ago
Author here,

That's correct! Unlike Bash and other modern shells, the POSIX standard doesn't include arrays or any other data structures. The way we found around this limitation is to use arithmetic expansion and indexed shell variables (that are starting with `_` as you noted) to get random memory access.

2 comments

Since I experimented with something similar in the past to mimick multidimensional arrays: depending on the implementation this can absolutely _kill_ performance. IIRC, Dash does a linear lookup of variable names, so when you create tons of variables each lookup starts taking longer and longer.
I hope you're not compiling C to sh for performance reasons.
It's not about performance, it's about viability. If the result is so slow that it's unusable, it doesn't matter how portable it ends up being.
We haven't found this to be an issue for Pnut. One of the metric we use for performance is how much time it takes to bootstrap Pnut, and dash takes around a minute which is about the time taken by bash. This is with Pnut allocating around 150KB of memory when compiling itself, showing that Dash can still be useful even when hundreds of KBs are allocated.

One thing we did notice is that subshells can be a bottleneck when the environment is large, and so we avoided subshells as much as possible in the runtime library. Did you observe the same in your testing?

> We haven't found this to be an issue for Pnut. One of the metric we use for performance is how much time it takes to bootstrap Pnut, and dash takes around a minute which is about the time taken by bash. This is with Pnut allocating around 150KB of memory when compiling itself, showing that Dash can still be useful even when hundreds of KBs are allocated.

Interesting. When you say "even when hundreds of KBs are allocated", do you mean this is allocating variables with large values, or tons of small variables? My case was the latter, and with that I saw a noticeable slowdown on Dash.

Simplest repro case:

  $ cat many_vars_bench.sh
  #!/bin/sh
  
  _side=500
  
  i=0
  while [ "${i}" -lt "${_side}" ]; do
    j=0
    while [ "${j}" -lt "${_side}" ]; do
      eval "matrix_${i}_${j}=$((i+j))" || exit 1
      : $(( j+=1 ))
    done
    i=$((i+1))
  done
  
  $ time bash many_vars_bench.sh
  5.60user 0.12system 0:05.78elapsed 99%CPU (0avgtext+0avgdata 57636maxresident)k
  0inputs+0outputs (0major+13020minor)pagefaults 0swaps
  
  $ time dash many_vars_bench.sh
  40.75user 0.14system 0:41.22elapsed 99%CPU (0avgtext+0avgdata 19972maxresident)k
  0inputs+0outputs (0major+4951minor)pagefaults 0swaps
Dash was ~8 times slower. Increase the side of the square "matrix" for a proportionally bigger slowdown (this one uses 250003 variables).

> One thing we did notice is that subshells can be a bottleneck when the environment is large, and so we avoided subshells as much as possible in the runtime library. Did you observe the same in your testing?

Yes, launching a new process is generally expensive and so is spawning a subshell. If the shell is something like Bash (with a lot of startup/environment setup cost) then you'll feel this more than something like Dash, where the whole point was to make the shell small and snappy for init scripts: https://wiki.ubuntu.com/DashAsBinSh#Why_was_this_change_made...

In my limited testing, Bash generally came out on top for single-process performance, while Dash came out on top for scripts with more use of subshells.

I used almost the same idea, but with files in my https://github.com/steveschnepp/shlibs