Hacker News new | ask | show | jobs
by westurner 540 days ago
Why is weather data stored in netcdf instead of tensors or sparse tensors?

Also, SQLite supports virtual tables that can be backed by Content Range requests; https://www.sqlite.org/vtab.html

sqlite-wasm-http, sql.js-httpvfs; HTTP VFS: https://www.npmjs.com/package/sqlite-wasm-http

sqlite-parquet-vtable: https://github.com/cldellow/sqlite-parquet-vtable

Could there be a sqlite-netcdf-vtable or a sqlite-gribs-vtable, or is the dimensionality too much for SQLite?

From https://news.ycombinator.com/item?id=31824578 :

> It looks like e.g. sqlite-parquet-vtable implements shadow tables to memoize row group filters. How does JOIN performance vary amongst sqlite virtual table implementations?

https://news.ycombinator.com/item?id=42264274

SpatialLite does geo vector search with SQLite.

datasette can JOIN across multiple SQLite databases.

Perhaps datasette and datasette-lite could support xarray and thus NetCDF-style multidimensional arrays in WASM in the browser with HTTP Content Range requests to fetch and cache just the data requested

"The NetCDF header": https://climateestimate.net/content/netcdfs-and-basic-coding... :

> The header can also be used to verify the order of dimensions that a variable is saved in (which you will have to know to use, unless you’re using a tool like xarray that lets you refer to dimensions by name) - for a 3-dimensional variable, `lon,lat,time` is common, but some files will have the `time` variable first.

"Loading a subset of a NetCDF file": https://climateestimate.net/content/netcdfs-and-basic-coding...

From https://news.ycombinator.com/item?id=42260094 :

> xeus-sqlite-kernel > "Loading SQLite databases from a remote URL" https://github.com/jupyterlite/xeus-sqlite-kernel/issues/6#i...

  %FETCH <url> <filename>
2 comments

> Why is weather data stored in netcdf instead of tensors or sparse tensors?

NetCDF is a "tensor", at least in the sense of being a self-describing multi-dimensional array format. The bigger problem is that it's not a Cloud-Optimized format, which is why Zarr has become popular.

> Also, SQLite supports virtual tables that can be backed by Content Range requests

The multi-dimensional equivalent of this is "virtual Zarr". I made this library to create virtual Zarr stores pointing at archival data (e.g. netCDF and GRIB)

https://github.com/zarr-developers/VirtualiZarr

> xarray and thus NetCDF-style multidimensional arrays in WASM in the browser with HTTP Content Range requests to fetch and cache just the data requested

Pretty sure you can do this today already using Xarray and fsspec.

In theory it could be done. It is sort of analogous to what GribStream is doing already.

The grib2 files are the storage. They are sorted by time in the path and so that is used like a primary index. And then grib2 is just a binary format to decode to extract what you want.

I originally was going to write this as a plugin for Clickhouse but in the end I made it a Golang API cause then I'm less constrained to other things. Like, for example, I'd like to create and endpoint to live encode the gribfiles into MP4 so the data can be served as Video. And then with any video player you would be able to playback, jump to times, etc.

I might still write a clickhouse integration though because it would be amazing to join and combine with other datasets on the fly.

> It is sort of analogous to what GribStream is doing already.

The difference is presumably that you are doing some large rechunking operation on your server to hide from the user the fact that the data is actually in multiple files?

Cool project btw, would love to hear a little more about how it works underneath :)

Yeah, exactly.

I basically scrape all the grib index files to know all the offsets into all variables for all time. I store that in clickhouse.

When the API gets a request for a time range, set of coordinates and a set of weather parameters, first I pre-compute the mapping of (lat,lon) into the 1 dimensional index in the gridded data. That is a constant across the whole dataset. Then I query the clickhouse table to find out all the files+offset that need to be processed and all of them are queued into a multi-processing pool. And then processing each parameter implies parsing a grib file. I wrote a grib2 parser from scratch in golang so as to extract the data in a streaming fashion. As in... I don't extract the whole grid only to lookup the coordinates in it. I already pre-computed the index, so I can just decode every value in the grid in order and when I hit and index that I'm looking for, I copy it to a fixed size buffer with the extracted data. When You have all the pre-computed indexes then you don't even need to finish downloading the file, I just drop the connection immediately.

It is pretty cool. It is running in very humble hardware so I'm hoping I'll get some traction so I can throw more money into it. It should scale pretty linearly.

I've tested doing multi-year requests and the golang program never goes over 80Mb of memory usage. The CPUs get pegged so that is the limiting factor.

Grib2 complex packing (what the NBM dataset uses) implies lots of bit-packing. So there is a ton more to optimize using SIMD instructions. I've been toying with it a bit but I don't want to mission creep into that yet (fascinating though!).

I'm tempted to port this https://github.com/fast-pack/simdcomp to native go ASM.

That's pretty cool! Quite specific to this file format/workload, but this is an important enough problem that people might well be interested in a tailored solution like this :)
Thank you!