Hacker News new | ask | show | jobs
by datahead 3474 days ago
This technique applies to many RDBMS, not just Oracle (as others have noted). Teradata, PostgreSQL, MS SQL all have 'analytical' functions like this. Analytical functions (over, partition by, etc) are extremely powerful and can help simplify architecture/design for the data science/analytical communities.

One of the persistent issues on my team is the reliance upon a dataframe representation w/ R or python to do this type of aggregation and windowing. Most people will eschew learning the 'advanced' SQL and instead bring data locally to do imperative style munging on it.

This creates a few issues, mainly adding complexity to the analytical stack: - Instead of querying the data and doing ETL/feature engineering in the db- you are moving data around (usually to less powerful machines, such as a laptop) for simple exploration.

- This wastes time and usually results in more dependencies (dplyr for example- no hate Hadley), sometimes even limiting you to single threaded operations. Teradata, for example, is massively parallel and will perform these operations in short order. I've seen Data Scientists wait 6hr for R to do the same thing a SQL query against a prod system returns in 3min.

- Code is not portable. A query can be executed and results retrieved through ODBC, JDBC or native connections. Without these, data engineers are often asked to install R (including libs) on some intermediate machine just to do munging/ETL/feature engineering. If SQL driven, moving from quantitative exploration to operational is quite easy (maybe just a query tune).

All that to say, I'm glad this post is highlighting some of the advanced SQL that I hope more people rely upon. All of these ideas are better articulated in MAD Skills [0]

[0] http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf

2 comments

One minor nit - normally a prod db is subject to availability restrictions that limit your ability to run 3 minute-long queries on them that suck up a bunch of resources. Nightly mirror to a staging/analysis db of sorts is a bit better practice.

Window functions are extraordinarily powerful, as are Common Table Expressions (CTE's). I encourage anyone who uses SQL with any regularity to learn them immediately once they are comfortable with the more straightforward queries and clauses SQL offers. Once you've mastered windowing and CTE's, you'll wonder how you functioned without them.

The admins at my work have always told us (i.e. people wanting to run data analysis) to avoid executing complex queries on production servers.

The workflow I use is to run simple select queries on prod databases. To bring the subset of data I'm interested in down locally Once I have the "extracts" I'm interested in I'll transform the data locally using SAS.

SAS has pretty powerful time interval routines which is what I'd typically use to compute something like the example in the article.

My work has just started looking at "big data" we have a new hadoop databases which is supposed to be used for that so I assume if I ever need to start looking into data stored there it will make more sense to run the SQL "on the database". So the article has some useful info there.

One thing to note here might be that if network is the bottleneck you'd likely be asked the opposite. In my experience if people pay attention to indexes/partitioning they can win dba's hearts, but dbas often don't explain these things because most people don't listen. So to them its easier to say "don't run complex queries".
Part of the problem with people paying attention to indexes/partitioning is that there's no very easy way to display that to the end user. A person knowledgeable in SQL can query the right tables to determine this but at that point you're pretty much thinking like a DBA in the first place. If major vendors had a nice visualization of such a thing built into the interaction tools like SSMS/SQL Developer/PGAdmin3 etc. then it might be easier (these might exist, I'm just not aware of them).

So yeah, the end result is DBA's give the generic "don't run SELECT * and use a WHERE clause when you're querying large tables"

Hopefully that is a dump as well. At least with Hadoop you have a little more control over resources with the right configuration. When we run Hive on our main HDFS cluster without limiting mappers and reducers it will happily bog down the entire cluster to give you what you need - but limiting the mappers, while slowing your throughput - eliminates the risk of resource hog for noncritical queries. Even then, copy of the data with dedicated resources is still preferred because without a lot of mappers and reducers your Hive queries are going to slow to a crawl.
Solar_Fields and Bigger_Cheese, my apologies for not specifying the db. In this case I'm talking about a production level data warehouse specifically designed to perform complex queries and not a prod app db. Yes, I agree- best practice is to separate the two. Glad you found the article interesting.
I'm right there with you about R and imperative transformation and munging. Get your data scientists using sqldf...it is by far the best solution out there for this problem for R users.
I agree with using sqldf, it is a wonderful tool in R. I am sure you are aware already, but for those who are not, sqldf does not support window functions with it being based on sqlite. Just wanted to point this out as the original article emphasizes them.
Actually the sqldf R package supports 4 backend databases (sqlite, h2, postgresql, mysql), not just sqlite, and postgresql does support window functions.