| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rarestblog 4914 days ago

* I noted in post that "SELECT 1" was done in 100ms round-trip, so I subtracted that from all points where it was reasonable. (No point to subtract it from 8 seconds).

* Not sure about the index support. Didn't try.

My idea was quite simple. I have some data at work (databases up to 30GB). Sometimes we hope to find something better. The main question was - will RedShift help, will it be radically faster? Will it be radically easier?

The answer for me - no, it won't help in my case, we need that 30GB data in real time, it looks like RedShift is more when you have 1TB+ data. Yes, it is radically easier.

2 comments

lcampbell 4914 days ago

Thanks for the reply. I figured for a dataset of that size, the main bottleneck might be not indexing -- maybe RedShift stores row data in a higher-latency medium while keeping indexes in-memory. Curious, I checked the documentation[1] and found this:

> Amazon Redshift doesn’t require indexes or materialized views and so uses less space than traditional relational database systems.

Reading through the rest of their FAQ, it sounds like they echo your conclusion -- RedShift shines the most for use-cases where the dataset is large enough that, to use PostgreSQL, you'd have to shard out multiple instances.

--

[1] http://aws.amazon.com/redshift/faqs/#0030

nieksand 4914 days ago

I don't see your python code defining a distribution or sort key for Redshift which is an important design consideration. (For my own use case of log analysis, I sort on datetime and use an "even" distribution). Also doesn't look like you ran "vacuum" or "analyze" after doing the loads to Redshift. So the query optimizer has no statistics to drive its decisions.

And as others have pointed out, your 30 GB data set is pretty tiny. You could look at some of the in-memory DB options out there if you need to speed things up.