|
And here's my quick benchmark, dataset from my full-time job: > import geopandas as gpd
> import pandas as pd
> from shapely.geometry import Point
> d = pd.read_csv('data/tracks/2024_01_01.csv')
> d.shape
(3690166, 4)
> list(d)
['user_id', 'timestamp', 'lat', 'lon']
> %%timeit -n 1
> d.to_csv('/tmp/test.csv')
14.9 s ± 1.18 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
> d2 = gpd.GeoDataFrame(d.drop(['lon', 'lat'], axis=1), geometry=gpd.GeoSeries([Point(*i) for i in d[['lon', 'lat']].values]), crs=4326)
> d2.shape, list(d2)
((3690166, 3), ['user_id', 'timestamp', 'geometry'])
> %%timeit -n 1
> d2.to_file('/tmp/test.gpkg')
4min 32s ± 7.5 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
> %%timeit -n 1
> d.to_csv('/tmp/test.csv.gz')
37.4 s ± 291 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> ls -lah /tmp/test*
-rw-rw-r-- 1 culebron culebron 228M мар 26 21:10 /tmp/test.csv
-rw-rw-r-- 1 culebron culebron 63M мар 26 22:03 /tmp/test.csv.gz
-rw-r--r-- 1 culebron culebron 423M мар 26 21:58 /tmp/test.gpkg
CSV saved in 15s, GPKG in 272s. 18x slowdown.I guess your dataset is countries borders, isn't it? Something that 1) has few records and makes a small r-tree, and 2) contains linestrings/polygons that can be densified, similar to Google Polyline algorithm. But a lot of geospatial data is just sets of points. For instance: housing per entire country (couple of million points). Address database (IIRC 20+M points). Or GPS logs of multiple users, received from logging database, ordered by time, not assembled in tracks -- several million per day. For such datasets, use CSV, don't abuse indexed formats. (Unless you store it for a long time and actually use the index for spatial search, multiple times.) |
You need to use pyogrio [1], its vectorized counterpart, instead. Make sure you use `engine="pyogrio"` when calling `to_file` [2]. Fiona does a loop in Python, while pyogrio is exclusively compiled. So pyogrio is usually about 10-15x faster than fiona. Soon, in pyogrio version 0.8, it will be another ~2-4x faster than pyogrio is now [3].
[0]: https://github.com/Toblerity/Fiona
[1]: https://github.com/geopandas/pyogrio
[2]: https://geopandas.org/en/stable/docs/reference/api/geopandas...
[3]: https://github.com/geopandas/pyogrio/pull/346