Hacker News new | ask | show | jobs
by missosoup 2428 days ago
> I believe highly optimized C code in PG can be significantly faster than Scala inside Spark.

There's no question about this. If you can express your task in terms of PG on a single instance, then you probably should.

When you get to more complex tasks, like running input through GloVe and pushing ngrams to a temporal store, PG offers very little - which is fine, it's not at all what PG is designed for. Inter-node IO eclipses single node perf, which is why Spark is used despite being a terribly inefficient thing (although in the case of Spark, it's so inefficient that for interim sized workloads you'd actually be better off vertically scaling a single node and using something else). PG won't help at all with these tasks.

Also, that smorgasbord of extensions GP listed isn't offered by any cloud vendor as a managed service afaik, meaning you must roll and manage your own. Depending on your needs, that might be a show stopper.

1 comments

> like running input through GloVe and pushing ngrams to a temporal store

why exactly you think PG will not do this well?

Tell me how you'd do it and I'll tell you why it won't work :)
gloves are stored in table: token -> vector. Function tokenizes text and store in another table: texd_id, token

Then you join first and second table.

Also, I think typical scenario is to resolve embeddings in your model code or data input pipeline.

> Also, I think typical scenario is to resolve embeddings in your model code or data input pipeline.

Correct. PG has no place in this workload other than being the final store for the model output. And even then, you'd be using a column store like Redshift or Clickhouse. PG not even suitable for the ngram counters because its ingest rates are way too slow to keep up with a fanned out model spitting out millions of ngrams per second in addition to everything else going on in the pipeline.

You -could- probably do it all in PG. But that'd be a silly esoteric challenge exercise and not something anyone would try on a project. I am sure you recognise that.

I would say "fanned out model spitting out millions of ngrams per second" is much more unusual exercise comparing to using PG for ETL workload.
A typical twitter post will have about 50 2/3/4-grams. Let's ignore skipgrams. The twitter decahose will throw about 600 of these at you per second. That's 30k barebones ngrams per second to keep with the decahose.

But you have a year worth of historical data that you want to work with. If you're able to process 1m ngrams per second, it'll take a couple of days to get through that. You probably want to get closer to 10m/s if you're tweaking your model and want to iterate reasonably quickly. Of course there's ways to optimise all that and batch it and whatnot, but basically any big data tasks with the need to work on historical data and iterate on their models, quickly end up with kafka clusters piping millions of messages per second to keep those iteration times productive.

Ultimately this post is about Spark, and the comment that started this was someone listing PG 'replacements' for traditional ML pipeline components. If you need Spark, you're at scales where PG has no place.