Hacker News new | ask | show | jobs
by mattj 4381 days ago
I've gone through a similar transition (hive to redshift) in a very large scale data environment. Raw Hadoop / cascading is still very useful for more complicated workflows, but redshift is so vastly superior to hive it's not even funny. I thought I would miss adding my own UDFs, but this hasn't been an issue at all. I'm under the impression presto is a similar improvement, but I haven't spent any time with it.

One huge advantage of redshift over hive: you can connect with plain old Postgres libraries, so you can build redshift results into your admin interfaces, one off scripts, and anywhere else you're fine trading a few seconds of latency for extra data.

3 comments

Just as a quick note: You can use Postgres libraries because Redshift is a slightly modified version Postgres 8.1 under the covers. In fact, almost all massively-parallel-processing (MPP) databases are Postgres under the covers (including Microsoft's PDW). It really speaks to how impressive Postgres is at scaling. Even old releases, like 8.1!
I work with the folks who built PDW. It's all SQL Server now. That said, I'm often amazed at how many commercial db products are based on Postgres and other open source dbs. Postgres has a nice page on their site showing all the products derived from it - for example, Netezza or Pivotal's Greenplum. Relational dbs and SQL (especially SQL) are far from dead.

Link: https://wiki.postgresql.org/wiki/PostgreSQL_derived_database...

Anyone who likes Postgres and is looking for a good analytics DB should check out the cstore_fdw Postgres extension. [1] It allows Postgres to create and query files in the Optimized Row Columnar (ORC) format [2] from Hive.

It was created and recently open-sourced by Citus Data (YC S11), who've made it a key component of their MPP Postgres offering.

I don't work there. I'm just a fan.

[1] https://github.com/citusdata/cstore_fdw [2] http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds...

There may have been Postgres inside when Microsoft purchased the company whose technology ended up in PDW, but the last two versions shipped by Microsoft have definitely been based on SQL Server.
Yup! My experience with redshift has actually made me curious to try out Postgres (I've always used MySQL before this). The stricter SQL dialect was a little odd at first, but I think I've become more comfortable with it over a few months.
Do it, I switched to Posgress for pretty much everything last year and I absolutely adore it, it just works, no stupid edge cases, excellent documentation, nice tooling (pgadmin3 is better than commercial products I've seen costing hundreds if not thousands).

I'm barely scratching the surface of what it is capable of yet as well (mostly because I'm in ORM land most of the time).

Postgres is an amazing database, and has some great features. Sadly, you don't get the full power of it in Redshift, but man, are some of the datatypes and functions just so useful, especially in a warehousing environment!
I'm not surprised, given that my experiences with Hive are that it's extremely quirky and hardly ever the fastest way to do anything. Given the fact that people seem to be falling over themselves to reinvent better solutions to the same sorts of problems in the Hadoop space (see: Impala, Shark), I don't think I'm alone on that.
A teammate of mine wrote a post about our redshift setup a few months ago with some more details: http://engineering.pinterest.com/post/75186894499/powering-i...