| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ingenieroariel 798 days ago

I went through a similar phase with a process that started with global OSM and Whosonfirst to process a pipeline. Google costs kept going up (7k a month with airflow + bigquery) and I was able to replace it with a one time $7k hardware purchase. We were able to do it since the process was using H3 indices early on and the resulting intermediate datasets all fit on ram.

System is a Mac Studio with 128GB + Asahi Linux + mmapped parquet files and DuckDB, it also runs airflow for us and with Nix can be used to accelerate developer builds and run the airflow tasks for the data team.

GCP is nice when it is free/cheap but they keep tabs on what you are doing and may surprise you at any point in time with ever higher bills without higher usage.

5 comments

nojvek 798 days ago

DuckDB is the real magic. On an nvme disk with decent amount of RAM, it goes brrrrrr.

I would love it if somehow Postgres got duckdb powered columnstore tables.

I know hydra.so is doing columnstores.

DuckDB being able to query parquet files directly is a big win IMO.

I wish we could bulk insert parquet files into stock PG.

link

jfim 798 days ago

BigQuery is nice but it's definitely a major foot-gun in terms of cost. It's surprisingly easy to rack up high costs with say a misconfigured dashboard or a developer just testing stuff.

link

mrgaro 798 days ago

Definitively agree here. Once the data is in BigQuery, people will start doing ad-hoc queries and building Grafana dashboards on top of it.

And sooner or later (usually sooner) somebody will build a fancy Grafana dashboard and set it to refresh every 5 second and you will not notice it until it's too late.

link

dekhn 798 days ago

Frankly I think this is just a sign that it's a power tool for power users.

link

lillecarl 798 days ago

Sadly my colleagues aren't always "power users"

link

brailsafe 798 days ago

Nobody starts as a power user

link

ZeroCool2u 798 days ago

That is a very cool setup!

My org would never allow that as we're in a highly regulated and security conscious space.

Totally agree about the BQ costs. The free tier is great and I think pretty generous, but if you're not very careful with enforcing table creation only with partitioning and clustering as much as possible, and don't enforce some training for devs on how to deal with columnar DB's if they're not familiar, the bills can get pretty crazy quickly.

link

fikama 797 days ago

You made me curious. Since you are using Linux, why Mac and not PC? Wouldn't PC be cheaper? Or was there any other factors?

link

zamadatix 796 days ago

My stab: A Mac Studio will have 400 GB/s or 800 GB/s of memory bandwidth. Not that you can't get there on x86 e.g. a 12 channel Epyc Genoa setup can do 460 GB/s or 920 GB/s total when doubled up but now you're talking about buying 2 latest gen Epycs and 24 high speed dims to get the raw bandwidth back all while ignoring the access is a bit different.

Curious to see if there were other reasons.

link

hawk_ 798 days ago

> and may surprise you at any point in time with ever higher bills without higher usage.

What? really? Do they change your pricing plan? How can they charge more for the same usage?

link

ingenieroariel 798 days ago

When you queried their 'Open Data' datasets and linked with your own it was absurdly cheap for some time. Granted we used our hacking skills to make sure the really big queries ran in the free tier and only smaller datasets got in the private tables.

I kept getting emails about small changes and the bills got bigger all over the place including BigQuery and how they dealt with queries on public datasets. Bill got higher.

There is a non zero chance I conflated things. But from my point of view: I created a system and let it running for years - afterwards bills got higher out of the blue and I moved out.

link