| I ran into a practical limitation while working on ML feature engineering and multi-omics data. At some point, the problem stops being “how many rows” and becomes “how many columns”.
Thousands, then tens of thousands, sometimes more. What I observed in practice: - Standard SQL databases usually cap out around ~1,000–1,600 columns.
- Columnar formats like Parquet can handle width, but typically require Spark or Python pipelines.
- OLAP engines are fast, but tend to assume relatively narrow schemas.
- Feature stores often work around this by exploding data into joins or multiple tables. At extreme width, metadata handling, query planning, and even SQL parsing become bottlenecks. I experimented with a different approach:
- no joins
- no transactions
- columns distributed instead of rows
- SELECT as the primary operation With this design, it’s possible to run native SQL selects on tables with hundreds of thousands to millions of columns, with predictable (sub-second) latency when accessing a subset of columns. On a small cluster (2 servers, AMD EPYC, 128 GB RAM each), rough numbers look like:
- creating a 1M-column table: ~6 minutes
- inserting a single column with 1M values: ~2 seconds
- selecting ~60 columns over ~5,000 rows: ~1 second I’m curious how others here approach ultra-wide datasets.
Have you seen architectures that work cleanly at this width without resorting to heavy ETL or complex joins? |
If you can drop the “distributed” part, then plug DuckDB (https://duckdb.org/) and query Parquet (out of the box) or Vortex (https://duckdb.org/docs/stable/core_extensions/vortex.html) with it.