> Also, I think typical scenario is to resolve embeddings in your model code or data input pipeline.
Correct. PG has no place in this workload other than being the final store for the model output. And even then, you'd be using a column store like Redshift or Clickhouse. PG not even suitable for the ngram counters because its ingest rates are way too slow to keep up with a fanned out model spitting out millions of ngrams per second in addition to everything else going on in the pipeline.
You -could- probably do it all in PG. But that'd be a silly esoteric challenge exercise and not something anyone would try on a project. I am sure you recognise that.
A typical twitter post will have about 50 2/3/4-grams. Let's ignore skipgrams. The twitter decahose will throw about 600 of these at you per second. That's 30k barebones ngrams per second to keep with the decahose.
But you have a year worth of historical data that you want to work with. If you're able to process 1m ngrams per second, it'll take a couple of days to get through that. You probably want to get closer to 10m/s if you're tweaking your model and want to iterate reasonably quickly. Of course there's ways to optimise all that and batch it and whatnot, but basically any big data tasks with the need to work on historical data and iterate on their models, quickly end up with kafka clusters piping millions of messages per second to keep those iteration times productive.
Ultimately this post is about Spark, and the comment that started this was someone listing PG 'replacements' for traditional ML pipeline components. If you need Spark, you're at scales where PG has no place.
That's why I mentioned scale in my first comment. For sub-TB datasizes with 16 cores CPU and NVME raid (you can get such machine for less than $1k nowdays) PG will be just fine.
Also in typical ML pipeline as I mentioned you can generate ngrams in input function of your model (Dataset API in TF), you don't need to store it somewhere.