Hacker News new | ask | show | jobs
by wcedmisten 709 days ago
I recently discovered DuckDB's Read_OSM() function [1], which lets you query OSM PBF files directly.

For example, it's simple to count the cafes in North America in under 30s:

  SELECT COUNT(*) FROM st_readOSM('/home/wcedmisten/Downloads/north-america-latest.osm.pbf') WHERE tags['amenity'] = ['cafe'];
  ┌──────────────┐
  │ count_star() │
  │    int64     │
  ├──────────────┤
  │        57150 │
  └──────────────┘
  Run Time (s): real 24.643 user 379.067204 sys 3.696217

Unfortunately, I discovered there are still some bugs [2] that need to be ironed out, but it seems very promising for doing high performance queries with minimal effort.

[1]: https://duckdb.org/docs/extensions/spatial.html#st_readosm--...

[2]: https://github.com/duckdb/duckdb_spatial/issues/349

3 comments

Fascinating use of DuckDB!

Can I ask where you get official OSM PBF data from? (I found these two links, but not sure what data these contain)

https://planet.openstreetmap.org/pbf/

http://download.geofabrik.de/

Those are the most popular sources, and I've used both!

The first one is the official OpenStreetMap data, which contains the "planet file" - i.e. all the data for the entire world. But because OSM has so much stuff in it, the planet file is a whopping 76 GB, which can take a long time to process for most tasks. I also recommend using the torrent file for faster download speeds.

As a result of the planet's size, the German company Geofabrik provides unofficial "extracts" of the data, which are filtered down to a specific region. E.g. all the data in a particular continent, country, or U.S. state. If you click on the "Sub Region" link it will show countries, and if you click on those it will show states.

I was able to find all 10 Whole Foods in the City of Chicago in 22.6s with DuckDB. It's amazing! (there are tons more Whole Foods in the metro area, but it found the exact 10 in the city)

        SELECT 
            tags['addr:city'][1] city,
            tags['addr:state'][1] state,
            tags['brand'][1] brand,
            *, 
        FROM st_readosm('us-latest.osm.pbf')
        WHERE 1=1
        and city = 'Chicago'
        and state = 'IL'
        and brand = 'Whole Foods Market'
I'm sure there are ways to make this faster (partitioning, indexing, COPY TO native format, etc.) but querying a 9.8GB compressed raw format file with data (in key-value fields stored as strings) for the entire United States at this speed is pretty impressive to me.
That's great. I was impressed with DuckDB's speed with the OSM data. I'm doing it with all the data, but I intended to have: - a single file that can be queried - a single file that can be easily compressed

When I index the entire United States, admittedly with a subset of tags, I can search in milliseconds. I'll take it.

1-2 seconds in Overpass which also uses OSM:

https://overpass-turbo.eu/s/1NTy

So it's GeoFabrik is the same data, but regionalized. This sounds like what I need. I already use DuckDB, so this is great.

I appreciate your taking the time to share this tidbit! It's a game changer in what I do (geospatial).

That's cool, but not what I would call high performance. If you do these often you would want an index, and should only take single digit ms.
The reason I call it high performance is that it avoids the hours/days of processing (for the planet file)[1] that would be required for pulling the data out of PBF and indexing it. And you'd also need RAM at least the size of the planet to even get that level of speed.

You could certainly amortize this cost for repeated queries, but for one-off queries I haven't seen anything faster.

[1]: https://wiki.openstreetmap.org/wiki/Osm2pgsql/benchmarks

You wouldn't need hundreds of gigs of ram to answer this query in 5ms. You'd need a few mb or so to answer this query, after initial indexing is done of course.
How long do you think it would take to index it?
Depends on the machine :) hours maybe?
So for a one-off query DuckDB is tons faster, and easier.
I can help answer this a bit.

Processing the data with code and SQLite takes about 4 hours for the United States and Canada. If I can figure out how to parallelize some of the work, it could be faster.

The benefit of using SQLite is its low memory footprint. I can have this 40GB file (the uncompressed version) and use a 256MB container instance. It shows how valuable SQLite is in terms of resources. It takes up disk space, but I followed up with the zstd compression (from the post).

Was this more work and should've I have used Postgres and osm2pql, yes. Was it fun, well yeah, because I have another feature, which I have not written about yet, which I highly value.

Not near a computer to try this out but I'd be surprised if you couldn't get a huge speed up by selecting the whole file into a real table first and querying against that. DuckDB should be able to better vectorize operations then
I did try DuckDB at first. Not for tags, but the relationships between the nodes, ways, and relations. It was slow. Really fast to important, however!