Hacker News new | ask | show | jobs
by posting_mess 906 days ago
Love how the demo falls pray to what I dont have a term for, "the SQLers assumption"?

It asks ChatGPT to write SQL to get sales data, ChatGPT (or most SQLers) trust that every year-month combo has atleast one entry - which means the graphs its presenting could be wrong. Because if there was no entries for a year-month it it will skip that year-month and make it look like you never had a 0 month.

I've made this mistake before in prod, and without some janky lookup table of every date in existence... you need more code :( Fairly few people actually notice the potentially missing month, but still its a bug n a bad one.

Looks cool regardless though, good luck!

5 comments

Thanks!

You probably refer to one of the demos on our landing page?

I like how you describe the problem. You're absolutely right that SQL seems easy but it's these edge cases that make it hard to get right. Joining metrics with a date spine is definitely a good practice to avoid missing date periods.

I think we could/should teach Dot to do that in the future. It should at least be a feature you can turn on as the data team.

> You probably refer to one of the demos on our landing page?

Indeed, not sure how I ended up there but did on mobile, commented here.

> You're absolutely right that SQL seems easy but it's these edge cases that make it hard to get right

SQL/data analysis is endlessly pesky! I assume it would be easier to spot on tighter increments like "minutely" or "hourly"

> It should at least be a feature you can turn on as the data team.

Some might want the missing points, others wont - sounds like a good option (but id default to "enabled", each to their own though)

I call this "spineless" because you're missing the "vertebrae" of a year-month count.

I find that many places need a spine.

My personal hope is that time_bucket_gapfill(), interpolate(), and others will prove so popular in Timescale that DBMSes adopt them and they become part of the SQL standard. When I have to use a "naked" system without them to do analysis and reporting (read: making dashboards), I wind up creating similar UDFs.
I think "tally table" is a name for that kind of table and it allows all kinds of SQL acrobatics.
>> janky lookup table of every date in existence

Having a date dimension provides an elegant solution in many cases.

If I was analysing TB's of data via SQL, yeah i'd probably agree its better not incur the transfer overhead to perform this check - if it was small org, id say its not great.

Also once you start saying "i want secondly/minutely breakdowns", the dimension (neat term) gets pretty...large (probably less than the TB of data though)

It can. A function that generates date objects between two date objects is also pretty performant for specific uses.
Joining against a generated series is also trivial.