Hacker News new | ask | show | jobs
by timwis 1704 days ago
This sounds great! I’ve wanted to create an open data portal for a while that lets you spin up a (ephemeral, read-only) Postgres database of a dataset and run queries on it, maybe with a notebook. Sounds like this might be perfect!
2 comments

[Disclaimer: I work for TileDB, Inc]

We have developed just this[1] except using a MariaDB storage engine (MyTile) we've written. You can serverless run queries against TileDB arrays without spinning up a MariaDB instance. You can run any type of query MariaDB supports (joins, aggregates, CTE, etc). I've linked the basic documentation and an example notebook below[2]. You can run SQL queries from python/R or even JS or curl. We support a number of data return formats, i.e. arrow and JSON to facilitate use cases.

I'll also mention that we have a number of public example dataset[3] in TileDB cloud, such as the NYC taxi data used in this notebook[4], which you can explore!

[1] https://docs.tiledb.com/cloud/api-reference/serverless-sql

[2] https://cloud.tiledb.com/notebooks/details/TileDB-Inc/Quicks...

[3] https://cloud.tiledb.com/explore/arrays

[4] https://cloud.tiledb.com/notebooks/details/TileDB-Inc/tutori...

You might like what we’re building at Splitgraph: https://www.splitgraph.com/connect
Oh wow, very relevant indeed! I guess I thought ephemeral DBs would be better so that a user’s expensive query wouldn’t bog down the db for other users. And rather than just limiting them, enabling them to do whatever queries they could with pg running locally
The query layer is composed of ephemeral Postgres instances that get created on-the-fly. These instances either proxy to live data via FDW, or they lazily download the columnar fragments necessary to resolve a query. This is meant to perform as a horizontally scalable caching layer for shared queries. For more optimized/predictable use-cases you can “check out” a version of a data image ahead-of time. In this case you use the `sgr` command line tool to load the objects of an image into a Postgres database that you can then scale vertically as normal.

Our site is a bit out of date — we’ve got a new marketing site coming shortly, and we’ve gathered an all-star engineering team to take this to the next level. Stay tuned :)

(Or better yet if you want a unified data stack for your team, please get in touch… we can load data from 100+ sources into versioned Splitgraph images, or we can index and proxy to existing data sources with federated querying).

This is something we are currently building at Klarrio on top of YugabyteDB.

We have opted for YugabyteDB table spaces to solve the isolation problem. We allocate a number of dedicated tablet servers to a tenant so that their db objects are close to each other. YugabyteDB executes queries only on those tablet servers where the data relevant to the query is. A resource heavy tenant doesn’t have an impact on other tenants. It’s not really “serverless” but does seem to solve the problem at a cost.

Tenants receive complete databases with grant permissions so they can do as they please.

Because YugabyteDB uses Postgres 11 under the hood, we have an additional sanity layer in form of a Postgres process utility hook which prevents the tenant from escaping the sandbox.