| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by eknkc 706 days ago

Is there a good general purpose solution where I can store a large read only database in s3 or something and do lookups directly on it?

Duckdb can open parquet files over http and query them but I found it to trigger a lot of small requests reading bunch of places from the files. I mean a lot.

I mostly need key / value lookups and could potentially store each key in a seperate object in s3 but for a couple hundred million objects.. It would be a lot more managable to have a single file and maybe a cacheable index.

5 comments

jiggawatts 706 days ago

> trigger a lot of small requests reading bunch of places from the files. I mean a lot.

That’s… the whole point. That’s how Parquet files are supposed to be used. They’re an improvement over CSV or JSON because clients can read small subsets of them efficiently!

For comparison, I’ve tried a few other client products that don’t use Parquet files properly and just read the whole file every time, no matter how trivial the query is.

eknkc 706 days ago

This makes sense but the problem I had with duckdb + parquet is it looks like there is no metadata caching so each and every query triggers a lot of requests.

Duckdb can query a remote duckdb database too, in that case it looks like there is caching. Which might be better.

I wonder if anyone actually worked on a specific file format for this use case (relatively high latency random access) to minimize reads to as little blocks as possible.

jiggawatts 706 days ago

Sounds like a bug or missing feature in DuckDB more than an issue with the format

tionis 706 days ago

You could use a sqlite database and use range queries using something like this: https://github.com/psanford/sqlite3vfshttp https://github.com/phiresky/sql.js-httpvfs

Simon Willison wrote about it: https://simonwillison.net/2022/Aug/10/sqlite-http/

eknkc 706 days ago

Yep this thing is the reason I thought about doing it in the first place. Tried duckdb which has built in support for range requests over http.

Whole idea makes sense but I feel like the file format should be specifically tuned for this use case. Otherwise you end up with a lot of range requests because it was designed for disk access. I wondered if anything was actually designed for that.

hobofan 706 days ago

Parquet and other columnar storage formats are essentially already tuned for that.

A lot of requests in themselves shouldn't be that horrible with Cloudfront nowadays, as you both have low latency and with HTTP2 a low-overhead RPC channel.

There are some potential remedies, but each come with significant architetural impact:

- Bigger range queries; For smallish tables, instead of trying to do point-based access for individual rows, instead retrieve bigger chunks at once and scan through them locally -> Less requests, but likely also more wasted bandwidth

- Compute the specific view live with a remote DuckDB -> Has the downside of having to introduce a DuckDB instance that you have to manage between the browser and S3

- Precompute the data you are interested into new parquest files -> Only works if you can anticipate the query patterns enough

I read in the sibling comment that your main issue seems to be re-reading of metadata. DuckDB is AFAIK able to cache the metadata, but won't across instances. I've seen someone have the same issue, and the problem was that they only created short-lived DuckDB in-memory instances (every time the wanted to run a query), so every time the fresh DB had to retrieve the metadata again.

eknkc 706 days ago

Thanks for the insights. Precomputing is not really suitable for this and the thing is, I'm mostly using it as a lookup table on key / value queries. I know Duckdb is mostly suitable for aggregation but the http range query support was too attractive to pass on.

I did some tests, querying "where col = 'x'". If the database was a remote duckdb native db, it would issue a bunch of http range requests and the second exact call would not trigger any new requests. Also, querying for col = foo and then col = foob would yield less and less requests as I assume it has the necesary data on hand.

Doing it on parquet, with a single long running duckdb cli instance, I get the same requests over and over again. The difference though, I'd need to "attach" the duckdb database under a schema name but would query the parquet file using "select from 'http://.../x.parquet'" syntax. Maybe this causes it to be ephemeral for each query. Will see if the attach syntax also works for parquet.

hobofan 705 days ago

I think both should work, but you have to set the object cache pragma IIRC: https://duckdb.org/docs/configuration/pragmas.html#object-ca...

arcanemachiner 706 days ago

That whole thing still blows my mind.

cdchn 706 days ago

>Is there a good general purpose solution where I can store a large read only database in s3 or something and do lookups directly on it?

I think this is pretty much what AWS Athena is.

tiew9Vii 705 days ago

Cloud backed SQLLite looks like it might be good for this. Doesn’t support S3 though

https://sqlite.org/cloudsqlite/doc/trunk/www/index.wiki

imiric 706 days ago

ClickHouse can also read from S3. I'm not sure how it compares to DuckDB re efficiency, but it worked fine for my simple use case.

masterj 706 days ago

Neither of these support indexes afaik. They are designed to do fast scans / computation.

hodgesrm 706 days ago

It depends on what you mean by "support." ClickHouse as I recall can read min/max indexes from Parquet row groups. One of my colleagues is working on a PR to add support for bloom filter indexes. So that will be covered as well.

Right now one of the main performance problems is that Clickhouse does not cache index metadata yet, so you still have to scan files rather than keeping the metadata in memory. ClickHouse does this for native MergeTree tables. There are a couple of steps to get there but I have no doubt that metadata caching will be properly handled soon.

Disclaimer: I work for Altinity, an enterprise provider for ClickHouse software.

orthecreedence 705 days ago

Depends what you mean by "indexes." DuckDB can read path parameters (ex s3://my-bucket/category=beverages/month=2022-01-01/*/*.parquet) where `category` and `month` can be filtered at the query level, skipping any non-matching files. I think that qualifies as an index. Obviously, you'd have to create these up-front, or risk moving lots of data between paths.

canadiantim 705 days ago

LanceDB