A PostgreSQL Extension that allows you to access any other datasource

Y	Hacker News new \| ask \| show \| jobs

	A PostgreSQL Extension that allows you to access any other datasource (multicorn.org)
	140 points by treo 5338 days ago

4 comments

rdunklau 5338 days ago

Multicorn developer here, I'll be glad to answer any question.

link

wulczer 5338 days ago

Crap, I'm working on something exactly like this :D Was about to upload it to Github just today, but still wanted to fix a few things.

Yours looks quite more finished, one thing that would be very useful is handling Python to Postgres type conversion, like PL/Python does. After that, the next step is caching I/O functions for the duration of the scan, which PL/Python also does.

Nice to see that the idea made sense for more than one person, hope Multicorn will rock!

link

rdunklau 5338 days ago

Sorry to beat you at it ;)

We currently have some very rudimentary python to postgres type conversion, but this area still need a lot of improvements.

You should release your code, I'm sure you have a wide range of ideas worth merging into Multicorn !

link

wulczer 5338 days ago

Actually, the ideal thing to happen would be for PL/Python to somehow become a shared library that provides the typecasting routines, but I'm afraid that's a very big task and difficult to get accepted upstream. It's sad that PL/Python already includes all the code you need to transform Python lists to Postgres arrays, bools to PG bools and so on and every project like this one has to duplicate all that logic.

BTW: in my approach I opted for passing the entire Qual node as a string (using nodeToString) and my plan was to write a Python parser for that format that takes a string and returns a Python object. Not sure how hard that will turn out.

link

rdunklau 5338 days ago

I totally agree with you regarding PL/Python. I may try to ask upstream what they think about it.

link

danmaz74 5338 days ago

This looks interesting, thanks for sharing.

But what kind of performance is it possible to get with complex queries on these external sources? Does the PG server need to load all the data in memory from the source to do filters, joins and sorts?

link

rdunklau 5338 days ago

Thank you for your interest!

The postgresql plan is parsed, and passed as a list of "quals", objects representing simple filters. As an implementer, you don't HAVE to enforce those, since postgresql will recheck them for you anyway, but they can be quite handy.

For example, you can look at the imap foreign data wrapper (https://github.com/Kozea/Multicorn/blob/master/python/multic...) to see how the conditions from postgresql are converted to an IMAP filter, allowing for server side filtering.

The required columns are also provided, so if you don't need the email payload, the foreign data wrapper will not fetch it.

For joins, it will depend on the execution plan. There is still plenty of work on parsing the postgresql execution plan into something more useful, but the current set of "optimizations" is sufficient for our main use cases.

link

danmaz74 5338 days ago

Thank you for the explanation. I don't have an immediate need for this, but it's an interesting approach. Good luck!

link

leopard 5337 days ago

I try to install this extension on Ubuntu 11.10. First of all error in instalation -

gcc -g -O2 -fPIC -fPIC -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wformat-security -fno-strict-aliasing -fwrapv -g -lpythonpython2: -fpic -L/usr/lib -Wl,-Bsymbolic-functions -Wl,--as-needed -Wl,--as-needed -Wl,--as-needed -lpythonpython2: -shared -o src/multicorn.so src/multicorn.o /usr/bin/ld: cannot find -lpythonpython2: /usr/bin/ld: cannot find -lpythonpython2: collect2: ld returned 1 exit status make: * [src/multicorn.so] Error 1 rm src/multicorn.o ERROR: command returned 2

This is because Ubuntu have - python and python2.7 bin files. I fixed this by creating symlink on python.

After instalation another problem:

$ psql psql (9.1.1) Type "help" for help.

leo=# CREATE EXTENSION multicorn; ERROR: could not load library "/usr/lib/postgresql/9.1/lib/multicorn.so": /usr/lib/postgresql/9.1/lib/multicorn.so: undefined symbol: _Py_NoneStruct

All tested on specialy created for this system: $ python -V Python 2.7.2+ $ psql -V psql (PostgreSQL) 9.1.1 contains support for command-line editing

On this my tests end.

link

rdunklau 5337 days ago

Thank you for this report. I'll try to make it work on Ubuntu, expect the documentation to be updated in the next few days.

PS: if you're the 'leopard' who requested a redmine account, it should be activated now, feel free to report it there.

link

leopard 5337 days ago

Ok, I already added this to tracker. I am very interesting in this extension, its amazing.

link

muyuu 5338 days ago

I have well over 10K messages in gmail.

Will this work?

I was thinking in downloading the whole thing via POP3, creating a Unix mailbox and indexing that.

link

rdunklau 5338 days ago

It depends on what your use case is. If you have well targeted queries (eg, with filters on "From" or "Subject" headers, and only query the message headers (not the payload) you can have a pretty good performance.

What is your use case, exactly ?

link

muyuu 5337 days ago

Most of my filtering would be based on the contents of the subject header. I have a site that sends me copies of particular transactions, and these few thousand messages are the ones I'd like to query. Ideally I'd like to have them downloaded for statistical processing.

link

Jach 5337 days ago

I was at OSCON Data this year that held a talk about the new foreign data connectors piece, it's great that Postgres has them now. Want twitter data? Make an interface to translate twitter data to rows, it's that easy! LucidDB has had foreign data connectors for a while but you have to use Java at least for part of it. I like the generator-style of yielding list-rows shown here: https://github.com/Kozea/Multicorn/blob/master/python/multic...

link