| Back at my old job in ~2016, we built a cheap homegrown data warehouse via Postgres, SQLite and Lambda. Basically, it worked like this: - All of our data lived in compressed SQLite DBs on S3. - Upon receiving a query, Postgres would use a custom foreign data wrapper we built. - This FDW would forward the query to a web service. - This web service would start one lambda per SQLite file. Each lambda would fetch the file, query it, and return the result to the web service. - This web service would re-issue lambdas as needed and return the results to the FDW. - Postgres (hosted on a memory-optimized EC2 instance) would aggregate. It was straight magic. Separated compute + storage with basically zero cost and better performance than Redshift and Vertica. All of our data was time-series data, so it was extraordinarily easy to partition. Also, it was also considerably cheaper than Athena. On Athena, our queries would cost us ~$5/TB (which hasn't changed today!), so it was easily >$100 for most queries and we were running thousands of queries per hour. I still think, to this day, that the inevitable open-source solution for DWs might look like this. Insert your data as SQLite or DuckDB into a bucket, pop in a Postgres extension, create a FDW, and `terraform apply` the lambdas + api gateway. It'll be harder for non-timeseries data but you can probably make something that stores other partitions. |
- instead of S3, we now use R2.
- instead of Postgres+Sqlite3, we use DuckDB+CSV/Parquet.
- instead of Lambda, we use AWS AppRunner (considering moving it to Fly.io or Workers).
It worked gloriously for variety of analytical workloads, even if slower had we used Clickhouse/Timescale/Redshift/Elasticsearch.