Hacker News new | ask | show | jobs
by masfuerte 5 days ago
> To make processing this massive dataset practical, we built a Julia pipeline to extract the bits directly into a DuckDB database.

The raw data is a bit more than 1GB per annum.

The data of interest is 176 bits every 12.5 minutes for 19 years. That is, about 17MB of data. Possibly multiplied by the number of satellites, roughly thirty.

It's not big data.

2 comments

The dataset was 136GB (about 7GB per annum), and the Python implementation took 45 hours for each run. The Julia code that processed the whole dataset and built the database took 5 hours, which made iterative development much more pleasant. Of course, later stages in the pipeline had much less data to process and so were much faster. With metadata and indices, that was about 3GB. It's bigger than your estimate since there are multiple observations of the same satellite.
Though I take your point that it’s not big data by the conventional use (i.e. requiring a distributed computing to process). The phrasing in the original article was better: “To make iterative analysis practical, we wrote a Julia pipeline: NetCDF source files are converted to Apache Arrow, then thread-parallel bit extraction is performed into a DuckDB database.”