| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by missosoup 2428 days ago
	Yeah that's not going to work for what people call analytics workloads today. PG is great but it's not suitable to be a feature store and sure as hell not suitable to fan out ML workloads. In a modern ML stack, PG might play the role of the slow but reliable master store that the rest of the ML pipeline feeds off.

1 comments

riku_iki 2428 days ago

> hell not suitable to fan out ML workloads

depends on the scale? Not everyone processes petabytes of data.

> PG might play the role of the slow

You have any benchmark in your hand to support this? I believe highly optimized C code in PG can be significantly faster than Scala inside Spark.

link

missosoup 2428 days ago

> I believe highly optimized C code in PG can be significantly faster than Scala inside Spark.

There's no question about this. If you can express your task in terms of PG on a single instance, then you probably should.

When you get to more complex tasks, like running input through GloVe and pushing ngrams to a temporal store, PG offers very little - which is fine, it's not at all what PG is designed for. Inter-node IO eclipses single node perf, which is why Spark is used despite being a terribly inefficient thing (although in the case of Spark, it's so inefficient that for interim sized workloads you'd actually be better off vertically scaling a single node and using something else). PG won't help at all with these tasks.

Also, that smorgasbord of extensions GP listed isn't offered by any cloud vendor as a managed service afaik, meaning you must roll and manage your own. Depending on your needs, that might be a show stopper.

link

riku_iki 2428 days ago

> like running input through GloVe and pushing ngrams to a temporal store

why exactly you think PG will not do this well?

link

missosoup 2427 days ago

Tell me how you'd do it and I'll tell you why it won't work :)

link

riku_iki 2427 days ago

gloves are stored in table: token -> vector. Function tokenizes text and store in another table: texd_id, token

Then you join first and second table.

Also, I think typical scenario is to resolve embeddings in your model code or data input pipeline.

link

missosoup 2427 days ago

> Also, I think typical scenario is to resolve embeddings in your model code or data input pipeline.

Correct. PG has no place in this workload other than being the final store for the model output. And even then, you'd be using a column store like Redshift or Clickhouse. PG not even suitable for the ngram counters because its ingest rates are way too slow to keep up with a fanned out model spitting out millions of ngrams per second in addition to everything else going on in the pipeline.

You -could- probably do it all in PG. But that'd be a silly esoteric challenge exercise and not something anyone would try on a project. I am sure you recognise that.

link