| HN Mirror

Thanks, I'm with you on the big data buzzwords and trying to avoid overengineering things (one of my favorite HN posts ever is https://news.ycombinator.com/item?id=8908462).

Right now the data scanning is just a fork of https://github.com/BurntSushi/xsv/ running in a container with plenty of ram, and we've handled files in the ~20GB range with no problem. I think we could actually scale up to ~100GB files with xsv, which seems to cover 99%+ of data providers we're running in to. Providers might be processing massive amounts of data but the eventual deliverable they share with their customers is rarely too big for one machine.

That said, we will probably move away from our super simple stack towards running a Spark cluster in the medium term. Not for "big data" (actually I expect Spark to have higher latency and possibly to be slower for a moderate sized dataset than the rust solution) but because we want to be able to run multiple parallel scans over the datasets for upcoming future features. Some of that will involve a DAG of dependencies (e.g., do type detection first to figure out fields that are categorical, then generate visualizations where the plots are grouped by whatever the values are of the categorical field). There are also a bunch of nice libraries in the spark world for more comprehensive stats.