Hacker News new | ask | show | jobs
by wenc 709 days ago
Fascinating use of DuckDB!

Can I ask where you get official OSM PBF data from? (I found these two links, but not sure what data these contain)

https://planet.openstreetmap.org/pbf/

http://download.geofabrik.de/

1 comments

Those are the most popular sources, and I've used both!

The first one is the official OpenStreetMap data, which contains the "planet file" - i.e. all the data for the entire world. But because OSM has so much stuff in it, the planet file is a whopping 76 GB, which can take a long time to process for most tasks. I also recommend using the torrent file for faster download speeds.

As a result of the planet's size, the German company Geofabrik provides unofficial "extracts" of the data, which are filtered down to a specific region. E.g. all the data in a particular continent, country, or U.S. state. If you click on the "Sub Region" link it will show countries, and if you click on those it will show states.

I was able to find all 10 Whole Foods in the City of Chicago in 22.6s with DuckDB. It's amazing! (there are tons more Whole Foods in the metro area, but it found the exact 10 in the city)

        SELECT 
            tags['addr:city'][1] city,
            tags['addr:state'][1] state,
            tags['brand'][1] brand,
            *, 
        FROM st_readosm('us-latest.osm.pbf')
        WHERE 1=1
        and city = 'Chicago'
        and state = 'IL'
        and brand = 'Whole Foods Market'
I'm sure there are ways to make this faster (partitioning, indexing, COPY TO native format, etc.) but querying a 9.8GB compressed raw format file with data (in key-value fields stored as strings) for the entire United States at this speed is pretty impressive to me.
That's great. I was impressed with DuckDB's speed with the OSM data. I'm doing it with all the data, but I intended to have: - a single file that can be queried - a single file that can be easily compressed

When I index the entire United States, admittedly with a subset of tags, I can search in milliseconds. I'll take it.

1-2 seconds in Overpass which also uses OSM:

https://overpass-turbo.eu/s/1NTy

So it's GeoFabrik is the same data, but regionalized. This sounds like what I need. I already use DuckDB, so this is great.

I appreciate your taking the time to share this tidbit! It's a game changer in what I do (geospatial).