Hacker News new | ask | show | jobs
by truth_seeker 2429 days ago
Say for an example, I am using PostgreSQL 12 + CitusDB extension

Data cleaning -> PL/SQL and various inbuilt functions for the transformation of data (or new UDF if required at all)

Processing -> PostgreSQL Parallel processing on the local node and Citus DB extension for distributed computing and sharding

Analytics -> Many options here. Materialized views OR Triggers OR Streaming computation with PipelineDB extension OR Using Logical replication for stream computation

ML -> PG support variety of statistics functions. It also supports PL/R and PL/Python extension to interface with ML libraries.

Also, there are various kinds of Foreign Data Wrappers supported by PG - https://wiki.postgresql.org/wiki/Foreign_data_wrappers

1 comments

Yeah that's not going to work for what people call analytics workloads today.

PG is great but it's not suitable to be a feature store and sure as hell not suitable to fan out ML workloads. In a modern ML stack, PG might play the role of the slow but reliable master store that the rest of the ML pipeline feeds off.

> hell not suitable to fan out ML workloads

depends on the scale? Not everyone processes petabytes of data.

> PG might play the role of the slow

You have any benchmark in your hand to support this? I believe highly optimized C code in PG can be significantly faster than Scala inside Spark.

> I believe highly optimized C code in PG can be significantly faster than Scala inside Spark.

There's no question about this. If you can express your task in terms of PG on a single instance, then you probably should.

When you get to more complex tasks, like running input through GloVe and pushing ngrams to a temporal store, PG offers very little - which is fine, it's not at all what PG is designed for. Inter-node IO eclipses single node perf, which is why Spark is used despite being a terribly inefficient thing (although in the case of Spark, it's so inefficient that for interim sized workloads you'd actually be better off vertically scaling a single node and using something else). PG won't help at all with these tasks.

Also, that smorgasbord of extensions GP listed isn't offered by any cloud vendor as a managed service afaik, meaning you must roll and manage your own. Depending on your needs, that might be a show stopper.

> like running input through GloVe and pushing ngrams to a temporal store

why exactly you think PG will not do this well?

Tell me how you'd do it and I'll tell you why it won't work :)
gloves are stored in table: token -> vector. Function tokenizes text and store in another table: texd_id, token

Then you join first and second table.

Also, I think typical scenario is to resolve embeddings in your model code or data input pipeline.