Hacker News new | ask | show | jobs
by zimpenfish 1590 days ago
> Even though Python isn't the fastest language out there, it's likely still faster than the shell command above.

Taking these two command lines:

   jo -p name=JP object=$(jo fruit=Orange point=$(jo x=10 y=20) number=17) sunday=false >/dev/null

   python -c 'import json;print(json.dumps({"name": "JP", "object": {"fruit": "Orange", "point": {"x": 10, "y": 20}, "number": 17}, "sunday": False}))' >/dev/null
For jo (x86_64, Rosetta2), python2 (x86_64, Rosetta2), jo (arm64), and python3 (arm64), running 1000 iterations, with `tai64n` doing the timing.

    2022-02-05 21:25:38.357228500 start-jo-x86
    2022-02-05 21:25:45.319337500 stop-jo
    2022-02-05 21:25:45.319338500 start-python2-x86
    2022-02-05 21:26:18.876235500 stop-python2-x86
    2022-02-05 21:26:18.876235500 start-jo-arm
    2022-02-05 21:26:22.316063500 stop-jo-arm
    2022-02-05 21:26:22.316064500 start-python3-arm
    2022-02-05 21:26:40.379063500 stop-python3-arm
I make it: 7s for jo-x86, 33.5s for python2-x86, 3.5s for jo-arm, 18s for python3-arm.

Test script is at https://pastebin.com/4tTVrDia

5 comments

python3 is (relatively) slow to startup, and this is something that got significantly worse with 2->3 migration:

  $ time python3 -c ''
  real    0m0.029s

  $ time python2 -c ''
  real    0m0.010s

  $ time bash -c ''
  real    0m0.001s
Which means - you probably don't want to have python scripts on a busy webserver, being called from classic cgi-bin (do people still use those?), or run it as -exec argument to a "find" iterating over many thousands files. Maybe a couple more of such examples. For most use-cases though, that's still fast enough.
I get 14,823s for python3 and 4,667s for jo on my system.

I also wrote my own tool, xidel [1]:

    time for i in $(seq 1 $count); do xidel -se '{"name": "JP", "object": {"fruit": "Orange", "point": {"x": 10, "y": 20}, "number": 17}, "sunday": false()}' > /dev/null; done     

    
which gives me 1,575s

But if you actually want to repeat something a thousand times, you would use a loop in the query for 0,017s:

    time xidel -se 'for $i in 1 to 1000 return {"name": "JP", "object": {"fruit": "Orange", "point": {"x": 10, "y": 20}, "number": 17}, "sunday": false()}'  > /dev/null
  
  
(a python3 loop gives me 0,029s)

[1] https://videlibri.de/xidel.html

what about how long for a human to read it and debug it when it gets beyond trivial?
Fair question. I think jo does tend to get more crufty if you're doing anything reasonably complex with multilevel structures, especially with arrays.

But jo does come into its own when you're wanting to use shell variables.

    > jo mypid=$$ set_or_not=$WEASEL

    > python -c 'import json,os;print(json.dumps({"set_or_not":os.getenv("WEASEL"), "mypid":os.getpid()}))'
I'm not disagreeing that python is slow, but why would you choose to do either in a shell script?

    $ time cat<<EOL
    {"name": "JP", "object": {"fruit": "Orange", "point": {"x": 10, "y": 20}, "number": 17}, 
    "sunday": false}  
    EOL
    {"name": "JP", "object": {"fruit": "Orange", "point": {"x": 10, "y": 20}, "number": 17}, 
    "sunday": false}

    real 0m0.002s
    user 0m0.000s
    sys  0m0.002s
> why would you choose to do either in a shell script?

In the normal case, you'd have variables interpolated in there, not static JSON. And then you run into the quoting problems that jo was created to work around...

Now put a thousand of those JSON objects in a list, invoking jo for every element.
That wasn't the claim made in the original post though, was it? The claim was that the Python snippet would be quicker than the jo snippet.

"Even though Python isn't the fastest language out there, it's likely still faster than the shell command above."

Which is most definitely is not - it's 5x slower.

(Probably not a huge issue in the real world if you're writing a shell script, mind, given that bash itself isn't a performance demon. But claims have to be tested.)

That’s because you’re making a false assumption about the environment prior to executing the statement.

If you are in a shell session and have to choose between executing python -c or calling jo, the latter is faster as you’ve demonstrated. But that’s not a realistic assumption.

Statements like these are almost certainly part of some combined work. The data you’re feeding to jo comes from somewhere. Its output is written somewhere.

You can’t convince me that if you’re already inside some Python script, that invoking json.dumps() is slower than calling jo from within a shell script.

At no point did I claim that launching Python AND running that json.dumps() is faster than running that shell command. I only stated that the json.dumps() is.

> if you’re already inside some Python script [...]

You're not going to shell out to `jo` and that's fine - it's not what `jo` was created for; it's explicitly a shell command to help you work around the annoyance of getting quoting right when constructing JSON from the command line (which I've had to do a lot and I'm pretty sure many people have to.)

> If you are in a shell session [and want to create JSON] ... that’s not a realistic assumption.

Of course it is. People create JSON in shell scripts all the time! That's why things like `jq` exist - because this is what people do!

I actually did that for a more realistic comparison.

Example for jo:

  docker run --rm -it debian bash
  apt update && apt install -y jo nano
  nano bash-loop.sh && chmod +x bash-loop.sh
  
  #!/bin/bash
  for ((i=0;i<1000;i++)); 
  do 
     jo -p name=JP object=$(jo fruit=Orange point=$(jo x=10 y=20) number=17) sunday=false
  done
  
  time ./bash-loop.sh >/dev/null
Example for Python 3:

  docker run --rm -it debian bash
  apt update && apt install -y python3 nano
  nano python-loop.py
  
  import json
  for i in range(1000):
    print(json.dumps({"name": "JP", "object": {"fruit": "Orange", "point": {"x": 10, "y": 20}, "number": 17}, "sunday": False}))
  
  time python3 python-loop.py >/dev/null
Versions:

  Debian GNU/Linux 11 (bullseye)
  jo 1.3
  Python 3.9.2
Results for jo:

  real    0m2.230s
  user    0m1.106s
  sys     0m1.076s
Results for Python 3:

  real    0m0.027s
  user    0m0.021s
  sys     0m0.005s
So it seems like you're probably right about how individual invocations scale for larger amounts of invocations in non-trivial cases!

Note: jo seems to pretty print because of the "-p" parameter, which is not the case with Python, might not be a 1:1 comparison in this case. Would be better to remove it. Though when i did that, the performance improvement was maybe 1%, not significant.

Admittedly, it would be nice to test with actually random data to make sure that nothing gets optimized away, such as just replacing one of the numbers in JSON with a random value, say, the UNIX timestamp. But then you'd have to prepare all of the data beforehand (to avoid differences due to using Python to get those timestamps, or one of the GNU tools), or time the execution separately however you wish.

Edit to explain my rationale: Why bother doing this? Because i disagree with the sibling comment:

> The claim was that the Python snippet would be quicker than the jo snippet.

In my eyes that's almost meaningless, since in practice when you'll actually care about the runtimes will be when working with larger amounts of data, or alternatively really large files. Therefore this should be tested, not just the startup times, which become irrelevant in most real world programs, except for cases when you'd make a separate invocation per request, which you sometimes shouldn't do.

Edit #2: here's a lazy edit that uses the UNIX time and makes the data more dynamic, ignoring the overhead to retrieve this value, to get a ballpark figure.

Use time value for jo:

  jo -p name=JP object=$(jo fruit=Orange point=$(jo x=10 y=20) number=$(date +%s)) sunday=false
Use time value for Python 3:

  import time
  ...
  print(json.dumps({"name": "JP", "object": {"fruit": "Orange", "point": {"x": 10, "y": 20}, "number": int(time.time())}, "sunday": False}))
Results for jo:

  real    0m2.794s
  user    0m1.422s
  sys     0m1.313s
Results for Python 3:

  real    0m0.027s
  user    0m0.020s
  sys     0m0.006s
Seems like nothing changed much.

Edit #3: probably should have started with a test to verify whether the initially observed performance differences (Python being slower due to startup time) were also present.

Single iteration results for jo:

  real    0m0.003s
  user    0m0.000s
  sys     0m0.002s
Single iteration results for Python 3:

  real    0m0.022s
  user    0m0.017s
  sys     0m0.004s
Seems to also more or less match those results.