Hacker News new | ask | show | jobs
by ingenieroariel 798 days ago
I went through a similar phase with a process that started with global OSM and Whosonfirst to process a pipeline. Google costs kept going up (7k a month with airflow + bigquery) and I was able to replace it with a one time $7k hardware purchase. We were able to do it since the process was using H3 indices early on and the resulting intermediate datasets all fit on ram.

System is a Mac Studio with 128GB + Asahi Linux + mmapped parquet files and DuckDB, it also runs airflow for us and with Nix can be used to accelerate developer builds and run the airflow tasks for the data team.

GCP is nice when it is free/cheap but they keep tabs on what you are doing and may surprise you at any point in time with ever higher bills without higher usage.

5 comments

DuckDB is the real magic. On an nvme disk with decent amount of RAM, it goes brrrrrr.

I would love it if somehow Postgres got duckdb powered columnstore tables.

I know hydra.so is doing columnstores.

DuckDB being able to query parquet files directly is a big win IMO.

I wish we could bulk insert parquet files into stock PG.

BigQuery is nice but it's definitely a major foot-gun in terms of cost. It's surprisingly easy to rack up high costs with say a misconfigured dashboard or a developer just testing stuff.
Definitively agree here. Once the data is in BigQuery, people will start doing ad-hoc queries and building Grafana dashboards on top of it.

And sooner or later (usually sooner) somebody will build a fancy Grafana dashboard and set it to refresh every 5 second and you will not notice it until it's too late.

Frankly I think this is just a sign that it's a power tool for power users.
Sadly my colleagues aren't always "power users"
Nobody starts as a power user
That is a very cool setup!

My org would never allow that as we're in a highly regulated and security conscious space.

Totally agree about the BQ costs. The free tier is great and I think pretty generous, but if you're not very careful with enforcing table creation only with partitioning and clustering as much as possible, and don't enforce some training for devs on how to deal with columnar DB's if they're not familiar, the bills can get pretty crazy quickly.

You made me curious. Since you are using Linux, why Mac and not PC? Wouldn't PC be cheaper? Or was there any other factors?
My stab: A Mac Studio will have 400 GB/s or 800 GB/s of memory bandwidth. Not that you can't get there on x86 e.g. a 12 channel Epyc Genoa setup can do 460 GB/s or 920 GB/s total when doubled up but now you're talking about buying 2 latest gen Epycs and 24 high speed dims to get the raw bandwidth back all while ignoring the access is a bit different.

Curious to see if there were other reasons.

> and may surprise you at any point in time with ever higher bills without higher usage.

What? really? Do they change your pricing plan? How can they charge more for the same usage?

When you queried their 'Open Data' datasets and linked with your own it was absurdly cheap for some time. Granted we used our hacking skills to make sure the really big queries ran in the free tier and only smaller datasets got in the private tables.

I kept getting emails about small changes and the bills got bigger all over the place including BigQuery and how they dealt with queries on public datasets. Bill got higher.

There is a non zero chance I conflated things. But from my point of view: I created a system and let it running for years - afterwards bills got higher out of the blue and I moved out.