| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by panarky 4859 days ago

The article mentions this briefly, but it should be emphasized: parallel loading from S3 is MUCH faster.

This weekend I loaded 2 billion rows from S3 both ways:

- From a single gzipped object: 4 hours 42 minutes

- From 2000 gzipped slices of 1M rows each: 17 minutes

(Loading from gzipped files is considerably faster, in addition to saving S3 charges.)

The article notes that choice of distribution key is critical. I'd add that choice of sort key is equally important. In my testing, a better sort key improved compression from 1.5:1 to 4:1, and also made common queries 5x faster.

Unfortunately, you only get one dist key and one sort key per table, so less common queries could get slower.

1 comments

fujibee 4859 days ago

Also if you launch the more instance in a cluster, the faster to load. Our survey: http://www.slideshare.net/Hapyrus/scalability-of-amazon-reds... We tried much more files (5MB each) to load, but it takes longer time in total.. We're trying to get appropriate size and file numbers.

link