| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kylebarron 855 days ago

Sorry, this is not true _at all_ for geospatial data.

A quick benchmark [0] shows that saving to GeoPackage, FlatGeobuf, and GeoParquet are roughly 10x faster than saving to CSV. Additionally, the CSV is much larger than any other format.

[0]: https://gist.github.com/kylebarron/f632bbf95dbb81c571e4e64cd...

1 comments

culebron21 855 days ago

And here's my quick benchmark, dataset from my full-time job:

  > import geopandas as gpd
  > import pandas as pd
  > from shapely.geometry import Point

  > d = pd.read_csv('data/tracks/2024_01_01.csv')
  > d.shape
  (3690166, 4)
  > list(d)
  ['user_id', 'timestamp', 'lat', 'lon']

  > %%timeit -n 1
  > d.to_csv('/tmp/test.csv')
  14.9 s ± 1.18 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

  > d2 = gpd.GeoDataFrame(d.drop(['lon', 'lat'], axis=1), geometry=gpd.GeoSeries([Point(*i) for i in d[['lon', 'lat']].values]), crs=4326)
  > d2.shape, list(d2)
  ((3690166, 3), ['user_id', 'timestamp', 'geometry'])

  > %%timeit -n 1
  > d2.to_file('/tmp/test.gpkg')
  4min 32s ± 7.5 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

  > %%timeit -n 1
  > d.to_csv('/tmp/test.csv.gz')
  37.4 s ± 291 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

  > ls -lah /tmp/test*
  -rw-rw-r-- 1 culebron culebron 228M мар 26 21:10 /tmp/test.csv
  -rw-rw-r-- 1 culebron culebron  63M мар 26 22:03 /tmp/test.csv.gz
  -rw-r--r-- 1 culebron culebron 423M мар 26 21:58 /tmp/test.gpkg

CSV saved in 15s, GPKG in 272s. 18x slowdown.

I guess your dataset is countries borders, isn't it? Something that 1) has few records and makes a small r-tree, and 2) contains linestrings/polygons that can be densified, similar to Google Polyline algorithm.

But a lot of geospatial data is just sets of points. For instance: housing per entire country (couple of million points). Address database (IIRC 20+M points). Or GPS logs of multiple users, received from logging database, ordered by time, not assembled in tracks -- several million per day.

For such datasets, use CSV, don't abuse indexed formats. (Unless you store it for a long time and actually use the index for spatial search, multiple times.)

link

kylebarron 855 days ago

Your issue is that you're using the default (old) binding to GDAL, based on Fiona [0].

You need to use pyogrio [1], its vectorized counterpart, instead. Make sure you use `engine="pyogrio"` when calling `to_file` [2]. Fiona does a loop in Python, while pyogrio is exclusively compiled. So pyogrio is usually about 10-15x faster than fiona. Soon, in pyogrio version 0.8, it will be another ~2-4x faster than pyogrio is now [3].

[0]: https://github.com/Toblerity/Fiona

[1]: https://github.com/geopandas/pyogrio

[2]: https://geopandas.org/en/stable/docs/reference/api/geopandas...

[3]: https://github.com/geopandas/pyogrio/pull/346

link

culebron21 855 days ago

CSV is still faster than geo-formats with pyogrio. From what I saw, it writes most of the file quickly, then spends a lot of time, I think, building the index.

        > %%timeit -n 1
        > d.to_csv('/tmp/test.csv')
        10.8 s ± 1.05 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

        > %%timeit -n 1
        > d2.to_file('/tmp/test.gpkg', engine='pyogrio')
        1min 15s ± 5.96 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

        > %%timeit -n 1
        > d.to_csv('/tmp/test.csv.gz')
        35.3 s ± 1.37 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

        > %%timeit -n 1
        > d2.to_file('/tmp/test.fgb', driver='FlatGeobuf', engine='pyogrio')
        19.9 s ± 512 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

        > ls -lah /tmp/test*
        -rw-rw-r-- 1 culebron culebron 228M мар 27 11:02 /tmp/test.csv
        -rw-rw-r-- 1 culebron culebron  63M мар 27 11:27 /tmp/test.csv.gz
        -rw-rw-r-- 1 culebron culebron 545M мар 27 11:52 /tmp/test.fgb
        -rw-r--r-- 1 culebron culebron 423M мар 27 11:14 /tmp/test.gpkg

link

culebron21 855 days ago

Still CSV is 2x smaller than GPKG with this kind of data. And CSV.gz is 7x smaller.

link

kylebarron 855 days ago

That's why I'm working on the GeoParquet spec [0]! It gives you both compression-by-default and super fast reads and writes! So it's usually as small as gzipped CSV, if not smaller, while being faster to read and write than GeoPackage.

Try using `GeoDataFrame.to_parquet` and `GeoPandas.read_parquet`

[0]: https://github.com/opengeospatial/geoparquet

link

culebron21 854 days ago

...but this has spared me today some irritation at work. Thanks!

link