Hacker News new | ask | show | jobs
by darklinear 54 days ago
analytics/data engineer here. The approach described here falls apart on most datasets I've seen, because the source data has folds that are almost always out of context of the data itself. Even for a typical simple question a founder might have, like "what's the revenue for product X last month". Perhaps some orders don't have a Stripe record associated, and we receive money through a separate invoicing process. Perhaps there's a high revenue breakage rate between when a purchase is originally placed and when the payment goes through, and so a naive query for point-in-time revenue will almost certainly over-count revenue. The SQL the agent generate might not even yield directionally correct answers.

And that's when the agent even manages to construct a reasonable naive query. I've seen even Opus 4.6 ignore a `is_demo` column in the schema it was given when asked to construct a query for the number of active users.

Where I've seen text-to-SQL work well enough is when you're pointing it at data that's already been well-modeled for analytics such that the naive query a LLM will construct is correct by default. The data is either structured as a wide table such that no joins are necessary, or all the joins are 1:1 fact <-> dimension joins. All metrics are additive and so can be aggregated without asterisks. Columns follow a consistent naming convention, using the business domain terms a user would use in their prompt to the agent.

But that's a much thinner niche that what rawquery is proposing. You can't get around the analytics engineering effort involved in constructing a quality analytics dataset; the LLM will be a best a fuzzy fronted to your data warehouse, coextent with your BI tool.

Note: I do see value in value in rawquery's CLI-first approach to accessing data. In the right hands agents are very helpful at rapidly exploring datasets and validating assumptions on source data; but all the cloud data warehouse products I've interacted are all somewhat fiddly to access locally.

1 comments

This is one of the best comments around.