Hacker News new | ask | show | jobs
by nieksand 4859 days ago
I think part of the issue why so many people have gone with Hive is that good, production-ready column stores are expensive. Redshift is posed to change that. If you're shopping in this space, Infobright is also worth checking out.

And even for moderate data sizes (10+ GB per table), row store DBs tend to become painful. This is especially true when you need to support ad-hoc reporting queries, since the usual technique of matching your schema, indexes, and queries won't be effective any more. With true ad-hoc reporting, your only hope becomes lots of shallow indices rather than ones tuned to a particular query.

2 comments

Dimensional modeling (I'm a fan of Kimball's approach) mitigates these problems quite well while still offering very flexible ad-hoc reporting. Works great on a row-based RDBMS, even better on columnar.

http://en.wikipedia.org/wiki/Dimensional_modeling http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimens...

Redshift is indeed a solid product but all these comparisons against Hive are surprising, as that's not the right tool in the first place. Infobright, greenplum, aster, vertica, etc are the products which Redshift seeks to disrupt.

I realised a few years ago that pretty much every database course taught only teaches OLTP. OLAP never really gets a lookin.

At my university, standard normalisation was taught in the "databases" course. OLAP was mentioned as part of the "advanced databases" course.

The database course at that time blew about half its time on building PHP applications to talk to the database. I hate to second guess my professors, but I can't help but feel that a more productive use of the time would have been to teach normalised OLTP in the first half, and dimensionally modelled OLAP in the second half. Better yet, to divide them into two courses and spend some time talking about database history ("here's why network and hierarchical databases sucked") and maybe some introduction to how query planners work.

Do you know of any resource that talks about the same topic as your comment, but in more detail?
Philip Greenspan's SQL tutorial is a nice starting point: http://philip.greenspun.com/sql/

It's a bit old, but still pretty good.

I'm not sure I follow you.
To be more stereotypical: I am intrigued and would like to sign up to your newsletter.
Well ... really, just read a good pair of textbooks on each side of the spectrum. Date's Databases and Kimball's The Data Warehouse Toolkit are good.

Edit: actually, maybe not Date. It's up to you. It's good, but it's controversial because he's not a fan of SQL and so he uses his own language.

The one I used in uni was Ramakrishnan & Gehrke's Database Management. It was OK but there's a certain amount of at-the-time trendy bullshit that to me detracts from a focus on relational databases for their own sake.

Edit 2: and Joe Celko's SQL for Smarties contains good oil on the relational paradigm.

We use Infobright at SnowPlow (https://github.com/snowplow/snowplow), and are currently working on our Redshift integration.

One thing to be aware about with both is the lack of any support for wide tables - Infobright inherits MySQL's limit of 65,535 bytes per row (and UTF8 means 3 bytes per char); with Redshift you can stored wider rows but you can't query them (http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE...). Obviously not a deal breaker, but it locks you firmly in the densely populated rows, relational mindset.

This aside, we're super-excited about Redshift!

Hi, I have started your project previously and while I haven't had a chance to test it out, I must say the idea of using cloud-front as a collector is a superb brilliant idea to scale an analytics platform and would be both very scalable, reliable and economical.

Btw, do you have more experience to share? e.g. with infobright, how many events can be processed per second? what would be the "ETL latency"? can infobright handle 10TB of data easily, any caveat besides the row limit? Thanks.

Hi tszming - happy to share more over email. alex@snowplowanalytics.com