Hacker News new | ask | show | jobs
by oncethere 1872 days ago
It's interesting to see how PG can be configured, but why not just use a real warehouse like Snowflake?

Also, do you have any numbers on how PG performs once it's configured?

6 comments

I'm glad I'm not in the situation of needing to make the judgement call, but Postgres' ecosystem might be part of the answer. For example, Snowflake has basic geospatial support, but PG has insane and useful stuff in PostGIS like ST_ClusterDBSCAN: https://postgis.net/docs/ST_ClusterDBSCAN.html

Foreign data wrappers are another thing that might be compelling — dunno if Snowflake has an equivalent.

I don't have any numbers but PG has served me fine for basic pseudo-warehousing. Relative to real solutions, it's pretty bad at truly columnar workloads: scans across a small number of columns in wide tables. The "Use Fewer Columns" advice FTA is solid. This hasn't been a deal breaker though. Analytical query time has been fine if not great up to low tens of GB table size, beyond that it gets rough.

Yeah, I think using Snowflake or BigQuery or something is ultimately the better move. But sometimes folks use what they know (what they're comfortable managing, tuning, deploying, whatever).

In my own testing PG performed very similarly to a 'real' warehouse. It's hard to measure because I didn't have the same datasets across several warehouses. Maybe in the future I'll try running something against a few to see.

I really wanted to migrate an analytics project I was working on from Elasticsearch to Postgres: however, when we sat down and ran production-scale proofs of concepts for the change, ClickHouse handily outclassed all the Postgres-based solutions we tried. (A Real DBA might have been able to solve this for us: I did some tuning, but I’m not an expert). ClickHouse, however, worked near-optimally out of the box.
Snowflake and big query are cloud solutions. Some companies have a need for self hosted databases etc.

And often having a homogenous database stack is a plus. If your production systems are all MySQL, then trying to get away with using MySQL for analytics too is a smart move etc.

I’ve seen so many tiddly data warehouses. Most companies don’t need web scale, and they overbuild and over complicate when they could be running on a simpler homogenous stack etc.

> why not just use a real warehouse like Snowflake?

Cost is the most common reason I’ve seen. An RDS instance is about 1/2 the cost of Redshift per CPU and then Snowflake is slightly more expensive than Redshift (often worth the extra $$).

Also, if you’re dealing with less than 10GB of data the difference in performance will be barely noticeable, so at modest scale the most cost effective solution is RDS.

> why not just use a real warehouse like Snowflake?

Because you don't want to learn and maintain a new kind of software in a small organisation with limiter resources, for example. Each new language, framework, database and other kind of tool adds a lot of cognitive load for everyone involved.

If your project has less than a few million rows, chances are PostgreSQL will run any analytical query you like in just a few seconds. So you can keep things simpler by avoiding adding any extra tools.
> why not just use a real warehouse like Snowflake

Customers who demand / are legally obliged to ensure their data does not leave their territory, is one big reason.

Snowflake give you some options, but if you get a big contract with a customer in another region your entire analytics platform is unavailable to that customer.