Hacker News new | ask | show | jobs
by ekzhu 1504 days ago
Great idea! I see this is implemented using the Python language interface supported by PostgreSQL and importing sklearn models. I always wonder how scalable this is considering the serialization-deserialization overhead between Postgres' core and Python. Do you see any significant performance difference between this and training the sklearn models directly on something like Dataframes?
1 comments

This is an interesting benchmark I'll try to code up. Although, it seems a bit like an apples/oranges comparison, since a Dataframe in memory had to come from somewhere, either a CSV or database like Postgres, in which case I have my money on Postgres outcompeting the standalone process parsing CSV.

In the end though, it'll be important to have benchmarks for all the key steps in the process, both in terms of memory and compute. Off a hunch, I think the memory inefficiency involved in high level pandas operations is more likely to be a driving force to move operations into lower layers, than CPU runtime.

The Dataframe is loaded from disk true, but it is possible that batch loading is faster (esp. with structured data) than row-by-row translation Postgres types into Python types. Would be interesting to see the benchmark results.

> I think the memory inefficiency involved in high level pandas operations is more likely to be a driving force to move operations into lower layers, than CPU runtime.

Indeed. Not only memory but also inefficiency related to Python itself. It would be great if feature engineering pipelines can be pushed down to lower layers as well. But for now, the usability of Python is still unparallel.