|
These type of articles baffle me, you're comparing a high-performance analytical database to a batch-orientated SQL engine. The whole point behind these query engines on Hadoop (Hive, Presto, Impala, etc) is to separate the database from the query engine. With these engines you can project schemas over raw data in its original form, without having to load it into a table. With Redshift, or other similar analytical databases, you're forced to define a schema, and then load the data in row by row...bulk inserts are very slow in comparison to Hadoop technologies. Regardless, Hive in general should nver be used for interactive analytics, that's not what it's intended for. Where Hive shines is when you can dump 250TB of raw text data into a folder and then run a SQL query to extract useful information out of it. The extracted data could then be loaded into a RDBMS like RedShift for real-time reporting. With all that being said, if you want to run SQL queries on data in Hadoop at the speeds of Redshift, you should have used Impala with Parquet, which is known to be even faster than Redshift in many cases, and is based on the same technology Google uses (Dremel and F1). The benefits of keeping your data in Hadoop are enormous, not every problem can be solved using SQL. The same data you're querying with Impala could actually be used to do machine learning using Spark or Mahout. Maybe you want to start indexing one of your tables into Solr to provide search capabilities on a subset of your columns to your users...or maybe you want to use Giraph or Sparks' GraphX to do parallel graph computation. The data never moves, there's still only ONE copy of that data in Hadoop, and you can bring any kind of workload to it. |
Generally we're dealing with datasets that are around 1-3TB, and pretty well organized. Its just amazing how forgiving Redshift is when it comes to unusually written SQL and how useful it is to us as a business. Extracting serious insights was once a job that only a few people could do, now its something that anyone with a SQL background can manage.