Hacker News new | ask | show | jobs
by jpivarski 1648 days ago
I've been doing some tests with JSON recently, so I have some exact numbers for a particular sample. Suppose you have JSON like the following:

    MULTIPLIER = int(10e6)
    json_string = b"[" + b", ".join([
        b'[{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}, {"x": 3.3, "y": [1, 2, 3]}],' +
        b'[],' +
        b'[{"x": 4.4, "y": [1, 2, 3, 4]}, {"x": 5.5, "y": [1, 2, 3, 4, 5]}]'
    ] * MULTIPLIER) + b"]"
It's a complex structure with many array elements (30 million), but those elements have a common type. It's 1.4 GB of uncompressed JSON. If I convert it to a Parquet file (a natural format for data like these), it would be 1.6 GB of uncompressed Parquet. It can get much smaller with compression, but since the same numbers are repeating in the above, compressing it would not be a fair comparison. (Note that I'm using 3 bytes per float in the JSON; uncompressed Parquet uses 8 bytes per float. I should generate something like the above with random numbers and then compress the Parquet.)

Reading the JSON into Python dicts and lists using the standard library `json` module takes 70 seconds and uses 20 GB of RAM (for the dicts and lists, not counting the original string).

Reading the Parquet file into an Awkward Array takes 3.3 seconds and uses 1.34 GB of RAM (for just the array).

Reading the JSON file into an Awkward Array takes 39 seconds and uses the same 1.34 GB of RAM. If you have a JSONSchema for the file, an experimental method (https://github.com/scikit-hep/awkward-1.0/pull/1165#issuecom...) would reduce the reading time to 6.0 seconds, since most of that time is spent discovering the schema of the JSON data dynamically.

The main thing, though, is that you can compute the already-loaded data in faster and more concise ways. For instance, if you wanted to slice and call a NumPy ufunc on the above data, like

    output = np.square(array["y", ..., 1:])
the equivalent Python would be

    output = []
    for sublist in python_objects:
        tmp1 = []
        for record in sublist:
            tmp2 = []
            for number in record["y"][1:]:
                tmp2.append(np.square(number))
            tmp1.append(tmp2)
        output.append(tmp1)
The Python code runs in 140 seconds and the array expression runs in 2.4 seconds (current version; in the unreleased version 2 it's 1.5 seconds).

For both loading data and performing operations on them, it's orders of magnitude faster than pure Python—for the same reasons the same can be said of NumPy. What we're adding here is the ability to do this kind of thing on non-rectangular arrays of numbers.