Hacker News new | ask | show | jobs
by qolop 1459 days ago
Why should I use this instead of SQL?
4 comments

I work on a TUI logfile viewer that uses SQLite as a backend for doing analysis on the log messages (https://lnav.org). However, writing SQL interactively is painful since you can't really provide good auto-complete or preview, which is something I try to provide for most other operations.

The PRQL pipeline syntax would make for a much better experience for lnav since you're able to progressively refine a query without having to jump around. (You've probably noticed that many log services, like Sumologic, already provide a pipeline-style syntax instead of something SQL-like.) The nice thing is that you can simply keep typing to get the results you want and get a preview at each stage. For example, entering "from" and then pressing <TAB> would make it clear to the program that table-names should be suggested. The program could then show the first few lines of the table. Typing "from syslog_log | filter " and then pressing <TAB> would make it clear that columns from the syslog_log table should be suggested (along with some other expression stuff). And, then, the preview of the filtered output could be shown.

In the current implementation, pressing <TAB> just suggests every possible thing in the universe, whether it's appropriate or not. This leaves the poor with not much help after they've typed "SELECT". I find myself having to lookup docs/source to figure out column names or whatever and I wrote the darn thing. Ultimately, I think the analysis functionality just doesn't get used because interactively writing SQL is so user-hostile. So, I'm looking forward to seeing this succeed so that I can integrate it and still be able to use SQLite in the backend.

That's a really good question! (and one we should probably answer explicitly in the [FAQ](https://prql-lang.org/faq/) rather than just implicitly)

The README states that "PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement. Like SQL, it's readable, explicit and declarative. Unlike SQL, it forms a logical pipeline of transformations, and supports abstractions such as variables and functions. It can be used with any database that uses SQL, since it transpiles to SQL."

What that means to me is that PRQL more naturally maps onto how I think about and work with data.

Say I have some dataset, `employees`, and I want to answer some questions about it like, for US employees, what is the maximum and minimum salary and how many employees are there:

    from employees
    filter country == "USA"                       # Each line transforms the previous result.
    aggregate [                                   # `aggregate` reduces column to a value.
      max salary,
      min salary,
      count,                                      # Closing     commas are allowed :)
    ]

Moreover, after each line you have a valid pipeline which you can transform further by adding more steps/lines to your pipeline. This matches more closely how people construct data pipelines in R using dplyr/tidyverse and in Python using Pandas.

If you find that it doesn't map well onto how you think about data pipelines then please let us know as we're constantly looking for more real world examples to help us iterate on the language!

One benefit of SQL is that the Database Engine will do the hard work of optimizing the query plan.

Do you think the SQL complied by PRQL could be as effective and optimized by database engine as the direct-written SQL?

As you said, let the Database Engine do the hard work of optimizing the query plan for you.

I currently have no reason to believe that the PRQL generated SQL would be any worse than hand written SQL. That said, I don't think we've currently looked at any ways of passing hints to the query planner. We're always open to suggestions!

In the worst case, you have full access to the generated SQL, and for absolutely crucial queries you can hand modify that SQL. At least PRQL might have saved you the trouble of writing a cumbersome window function or something like that (see for example the example of picking the top row by some GROUP BY expression).

This reminds me of KUSTO I'm not sure how it compares to SQL in general. But it was really fun to work with for querying Azure application insigts
Because of things like WHERE/HAVING in SQL.
To avoid working with SQL strings.
SQL/jinja like dbt could also avoid working with SQL strings. what would be the better advantage?
There is already an integration for dbt: https://github.com/prql/dbt-prql

For example

    {% prql %}
    from source = {{ source('salesforce', 'in_process') }}
    derive expected_sales = probability * value
    join {{ ref('team', 'team_sales') }} [name]
    group name (
      aggregate (sum expected_sales)
    )
    {% endprql %}
would appear to dbt as

    SELECT
      name,
      SUM(source.probability * source.value) AS expected_sales
    FROM
      {{ source('salesforce', 'in_process') }} AS source
      JOIN {{ ref('team', 'team_sales') }} USING(name)
    GROUP BY
      name
dbt is definitely a use case we are very aware of and I am personally very keen on (since I use that in my $dayjob). With some of the ideas in https://github.com/prql/prql/issues/381 , I think PRQL could really shine in this area!

With your contribution we can get there faster!