Hacker News new | ask | show | jobs
by gscho 2044 days ago
This is really cool. I had not heard of Foreign Data Wrappers for Postgres before! Are these used in production commonly or more of a toy thing?
8 comments

We use them in production at Splitgraph [0] to power our DDN (like a CDN, but for data). We make a PostgreSQL-compatible endpoint available to the public to query any of the tens of thousands of open datasets by referencing them as virtual tables: they're not hosted by us but we proxy to them using Postgres FDWs. When a query comes in, we intercept it and redirect it to a FDW instance that handles query translation and planning from the PG dialect to that of the backend data source.

We wrote an FDW for Socrata-powered [1] government open data portals to query the public datasets that we index in the Splitgraph catalog as a proof-of-concept. However, there are plenty of other FDWs that we're working on integrating to let people add their own backend data sources (RDS, Snowflake etc).

FDW plugin quality varies (some of them can't push down all predicates or JOINs) but it's definitely an interesting way to think about accessing data. We also added a lot of scaffolding around foreign data wrappers in our open-source tool [2] that makes it easy to add a FDW-managed data source to a PostgreSQL instance.

[0] https://www.splitgraph.com/blog/data-delivery-network-launch

[1] https://www.tylertech.com/products/socrata

[2] https://www.splitgraph.com/blog/foreign-data-wrappers

Then you might be even more surprised to find out that Foreign Data Wrappers are Postgres' implementation of the ANSI SQL standard extension known as "SQL/MED" [1], where MED stands for "management of external data".

I spent a number of years working on a data federation/virtualization engine - and SQL/MED is very much related to that.

It's actually a relatively unknown topic by many software/data engineers I have worked with, but things like GraphQL federation (example, Apollo GraphQL) or some of the more popular tools such as Presto, Dremio, Denodo, etc.) are where more advanced versions of this are today.

SQL/MED and what Postgres can do is quite cool, but just know that any time you have system boundaries you cross (e.g. between heterogenous systems), things like joins, data types, and many other things becoming a bit more difficult - or you just have to think about them more. But very cool tech.

I've used SQL/MED in Postgres, FDW, linked servers in SQL Server, database links in Oracle, and more advanced virtualization/federation engines also.

If you haven't been exposed to this area before, highly recommended as another tool to know about for the toolkit.

[1] https://en.wikipedia.org/wiki/SQL/MED

There are generally two classes for FDWs: Postgres<->Postgres and Postgres->Everything else.

The first one is generally suitable for production and is very useful for sharded Postgres when you want to communicate across shards without having to go back out through the application.

The second one's mileage really varies. Some implementations might or might not be prod ready or mig target only specific version combinations. Can be very useful for data engineering or analytics use cases for quick ETL into a staging database. Or for data migrations between database vendors.

This could be of interest to you:

Ville Tuulos - How to Build a SQL-based Data Warehouse for 100+ Billion Rows in Python

PyData SV 2014 - In this talk, we show how and why AdRoll built a custom, high-performance data warehouse in Python which can handle hundreds of billions of data points with sub-minute latency on a small cluster of servers. This feat is made possible by a non-trivial combination of compressed data structures, meta-programming, and just-in-time compilation using Numba, a compiler for numerical Python. To enable smooth interoperability with existing tools, the system provides a standard SQL-interface using Multicorn and Foreign Data Wrappers in PostgreSQL.

https://www.youtube.com/watch?v=xnfnv6WT1Ng

I definitely think they are used in production though I haven't tried it myself.

But I can find ancedots of people using it production on the web[1].

1: https://carto.com/blog/postgres-fdw/

Currently using this one https://github.com/pramsey/pgsql-ogr-fdw for an image migration project. Using it to pull image metadata thru ODBC. While it's primary design is for GIS data, it works well for any ODBC database.
Here's a great list of Postgres FDWs: https://wiki.postgresql.org/wiki/Foreign_data_wrappers

I had no idea PG has native FDWs for Twitter and S3. That's pretty awesome.

It's been used quite a lot, especially when dealing with legacy databases or having to support multiple databases with a single codebase.

There is one concern though. Both Google Cloud SQL and AWS RDS support only postgres_fdw, so one should manage their own storage, cluster and backups.

I've written a DB2 FDW that is used in production. Not as good as direct access of course, but very practical still.