Hacker News new | ask | show | jobs
by dmitrykoval 1872 days ago
Following similar observations I was wondering if one can actually execute SQL queries inside of a Python process with the access to native Python functions and Numpy as UDFs. Thanks to Apache Arrow one can essentially combine DataFrame API with SQL within data analysis workflows, without the need to copy the data and write operators in a mix of C++ and Python, all within the confines of the same Python process.

So I implemented Vinum, which allows to execute queries which may invoke Numpy or Python functions as UDFs available to the interpreter. For example: "SELECT value, np.log(value) FROM t WHERE ..".

https://github.com/dmitrykoval/vinum

Finally, DuckDB makes a great progress integrating pandas dataframes into the API, with UDFs support coming soon. I would certainly recommend giving it a shot for OLAP workflows.

1 comments

Also I think SQLite lets you call Python functions from the SQL program.
That's correct, but SQLite would require to serialize/deserialize the data sent to Python func (from C to Python and back), while Arrow allows to get a "view" of the same data without making a copy. Which is probably not an issue in OLTP workloads, but may become more visible in OLAP.