| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by wizwit999 953 days ago
	you know there's this great abstraction over files we came up with in the data world called 'tables'..

2 comments

simonw 953 days ago

If you want something that looks like a table, while still benefiting from not having to download 8GB of Parquet in order to run the query, you can get one using a CTE:

    with midjourney_messages as (
        select
            *
        from read_parquet(
            list_transform(
                generate_series(0, 2),
                n -> 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/' ||
                    format('{:06d}', n) || '.parquet'
            )
        )
    )
    select sum(size) as size from midjourney_messages;

Or you can create a view:

    create view midjourney_messages as
    select * from read_parquet(
        list_transform(
            generate_series(0, 2),
            n -> 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/' ||
                format('{:06d}', n) || '.parquet'
        )
    );
    select sum(size) as size from midjourney_messages;

link

pests 952 days ago

Are there any pros/cons to the raw query, CTE, and view approches?

link

chrisjc 952 days ago

Yes.

Using a view in this example, you can't dynamically change which files are being selected (not even with joins or where clauses). What if new files are generated and suddenly there are more or less files? Therefore you probably wouldn't want to encapsulate your SQL into a view. Most of the time you would probably bind the list of files in your SQL as needed:

   SELECT SUM(size) AS size FROM read_parquet(:files);

But in this case, a table macro/function might also be an option:

   CREATE MACRO GET_TOTAL_SIZE(num_of_files) AS TABLE (
        SELECT
            SUM(size) AS size
        FROM read_parquet(
            list_transform(generate_series(0, num_of_files), n -> 'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/' || format('{:06d}', n) || '.parquet'
            )
        )
    );

    SELECT * FROM GET_MJ_TOTAL_SIZE(55);

Not necessarily related to this article, but CTEs are useful for breaking down a complex query into more understandable chunks. Moreover, you can do interesting things within the CTE's temp-tables like recursion or freezing/materializing a temp-table's results so that it only gets evaluated one time, instead of every time it gets referenced.

http://duckdb.org/docs/sql/query_syntax/with#recursive-ctes http://duckdb.org/docs/sql/query_syntax/with#materialized-ct...

link

simonw 952 days ago

From my experience with other databases my assumption for DuckDB is:

- Using a raw query, a CTE or a view will have no impact at all on query performance - they'll be optimized down to exactly the same operations (or to a query plan that's similar enough that the differences are negligible)

- CTEs are mainly useful for breaking down more complicated queries - so not great for this example, but really useful the moment you start doing anything more complicated.

- Views are effectively persistent CTEs - they're great if you want to permanently "bookmark" more complex pieces of queries to use later.

I wrote a bit more about CTEs here: https://datasette.io/tutorials/data-analysis#ctes

link

chrisjc 952 days ago

and guess what those tables use to store data a lot of the time? Parquet.

I assume that when you say "tables" you mean "external tables" since you're in the data world.

If you didn't then I guess when you say "tables", you mean "tables" with everything including the kitchen sink? Database, compute, etc...? Does the database always have to be running for the table to be accessible? Responsibility for the database hardware, resources or services?

Of course there are fantastic new databases like Snowflake and BigQuery that separate compute and storage... but do they, really? Separating storage and compute is just incredible for scaling, suspend/resume, etc. But can you query a Snowflake/BigQuery table without also having to use their compute? Is there a way that I can just get a "table" and not be forced into using a specific compute-engine and all the other bells and whistles?

So when you say "table", where and how do I get one? And to maintain the theme of the article, a columnar/OLAP/analytics "table" in particular?

As you probably know, there are several (external table) options, Apache Iceberg probably being the most talked about one at the moment. External "tables" are just collections of metadata about your files, or conventions about how to lay your files down. When you query these tables with SQL using athena, redshift, snowflake, duckdb, etc... each and everyone of those query-engines is reading parquet files.

(Snowflake, BigQuery and others are working on features to both manage and read Iceberg tables, so i kinda lied earlier)

link