Hacker News new | ask | show | jobs
With Amazon Redshift SSD, querying a TB of data took less than 10 seconds (flydata.com)
73 points by fujibee 4521 days ago
5 comments

These numbers are not that surprising for an OLAP cluster. Even though Redshift is really architected to run on spinning disks, SSDs will almost always improve the performance.

On the other hand, the load performance is quite poor. On the 12x dw2.large hardware, a good clustered analytical database engine should be able to easily load 1.2TB in less than 15 minutes while the database tables are online and being queried. That it took well over an hour, and with a very simple data model at that, would argue against it being good for "real-time" even with SSDs. (This is not a surprising result though; Redshift is just a clustered PostgreSQL variant, which does not have the best internals for real-time.)

It's not a Postgres variant at all. Postgres is emulated as an interface to the columnar ParAccel database underneath. ParAccel does neat things (compiles your SQL into a program that it runs to answer the question, for instance) and really rips if you can order your data on good keys up front (and then use those keys in your query, of course).

Source: I helped build a very high speed network data analytical tool on top of ParAccel (before it was bought by Amazon and rolled into redshift).

That is not an emulation.

ParAccel, like a large percentage of parallel analytical databases, are forked off the excellent PostgreSQL code base because those internals were designed to be easy to extend and modify. Netezza, Vertica, EMC/Greenplum, Teradata/Aster, et al are all PostgreSQL derivatives as well with varying degrees of divergence. I've designed and built custom parallel derivatives of PostgreSQL for companies too, it is surprisingly straightforward.

There are only a handful of original, high-quality database kernels out there because it is enormously difficult to design one from scratch. Most good databases copy an existing design, or even more conveniently, fork the mature, easily modifiable, BSD-licensed, Stonebraker-designed PostgreSQL kernel. Every basic kernel design has distinctive characteristics that tend to stick with everything derived from them, which leaves an identifiable "fingerprint" on a new database if you know what to look for. You inherit both the strengths and weaknesses of the underlying kernel design.

(Source: I've designed analytical databases engines for a long time.)

But to call that a Postgres variant seems to suggest that they have way more in common than they really do. The trade offs that vertical databases are making are kind of alien for someone who is used to using Postgres.

Really cool work though!

Interesting. What are PostgreSQL's "fingerprints"? What makes it possible to tell a PostgreSQL derivitive from one based on another engine?
SSD drive saved my life when I had to query a 300 GB MySQL table that couldn't fit in my RAM. Since the data was organized by the primary key ( which was random in the SELECT queries), both reads and writes had to come from random places and the whole process became IOPS bound ( ordinary HDD can query only around 75-150 different disk areas per second). So while a normal HDD can achieve good sequential read speed, it SUCKS when it comes to reading data spread randomly.

I was amazed how much improvement I've seen just by getting an SSD - and how cheap it was compared to all other solutions.

It's not cheap. Base price is $0.25 per hour:

http://www.wolframalpha.com/input/?i=%240.25+per+hour+for+a+...

$183 a month.

As far as target audience for this, $183/month is a pittance. From their product site:

"Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools."

That to me screams "enterprise" and "big data" and all sorts of other silly buzz words. Your average startup is probably not going to need this, but their target audience may view that $183/month base price tag favorably.

That's still a bargain compared to running your own Vertica or Greenplum cluster.
That's just the base price if you do nothing. The costs increase when you actually store and query data.
No, that's not the case. You pay for the cluster by the hour.
On the other hand, you can process a lot more data in an hour, so it's fair to charge more.
That is a good point. There's definitely more value provided with SSD.
Is it possible to generate the dataset that you used ? I would like to run a benchmark for myself and downloading a 1 TB file from Amazon unfortunately is not an option.
whoa! sign me up! I wanna develop something with this speed
Are you by chance, a complete moron? Wait a minute...