Hacker News new | ask | show | jobs
by richardv 4721 days ago
If you know that you need to be grouping by dates... then it's usually best to store the date fragments in a separate column so you can take advantage of indexes.

The project is pretty cool, but I don't think it's worth adding the dependency lock-in for the functions.

Simply, if you know ahead of time that you want GROUP BY date, then you should create a new column for each interval. `week`, `day`, `hour`.

That way you will have fast queries...

5 comments

You don't need that if you use postgresql:

    create index foo1 on t1 (date(timestamp))
    create index foo2 on t2 (extract(week from timestamp))
Postgres can use almost any row expression as an index.
These are the cool practical reasons to switch from MySQL to Postgres that I so rarely see (usually the reasons are more ad hominem and handwavy).
Really? There are loads of practical arguments: The incredibly rich type system (hello array/schemaless columns), indexes (partial indexes, several different types to choose from), language integration (write index functions in javascript/python/ruby/c), foreign data wrappers (connect pg to mysql/mongodb/etc) to name but a few.
I don't doubt they exist, it's just in the comments on HN, usually the only thing that's touted is that "it's not MySQL" in that it doesn't have the quirks MySQL does (in regards to silently munging data types, etc). Sometimes performance/scalability is also discussed.
A common alternative to this for data warehouses and reporting platforms is to construct a date table in advance, with every date enumerated from, say, 1990 through 2050. Every row represents one day, with day of week, weeks, months, and quarters as columns all set in advance. This also allows for named weeks and months, for example 2013Q3, 2013W30, and 2013M07. I also tend to put in the starting and end dates for the week in each row. It's also useful for left joining against if you need a report to show a 0 for days where you have no data.
For the interest of persons here assembled, this is known as "dimensional modelling". Done properly, it makes querying tremendously simpler and more powerful for end users.
This is the approach recommended by Ralph Kimball for data warehouses. I haven't tried it for OLTP systems. I'd be more inclined to go with ad hoc functional indexes (as recommended by another commenter) based on what queries the system actually runs. But if you have a reporting interface that permits arbitrary group-by functionality, like in a data warehouse, I think it's the way to go.

EDIT: Technically chaz is right and a data warehouse puts all these date columns in a separate table you can join against.

If you just need simple subdivisions of time, like

  group by date(col)
or

  group by date(col), hour(col), floor(minute(col)/5)
an index on col is enough, or at least the last time I checked for my specific queries it was. If you need something like group by hour-of-day you'll need to create an indexed column of course. (this is with mariadb but I think basic mysql handles it too)
Update: Upon testing, it's more subtle than that.

Mysql will still use the column for a covering index and run the query in a single-pass, but it will still generate a temporary table for the results (and do a really slow sort unless you do 'order by null'). It doesn't look like mysql has any way of doing date grouping without temporaries even in the trivial cases. I guess eliminating the sort was enough for my queries.

Using date()-like functions, date_format(), and left() all have identical query plans and roughly comparable performance.

I would suggest denormalizing only after you find a performance bottleneck. As they say, "premature optimization is the root of all evil".