|
|
|
|
|
by panarky
4859 days ago
|
|
The article mentions this briefly, but it should be emphasized: parallel loading from S3 is MUCH faster. This weekend I loaded 2 billion rows from S3 both ways: - From a single gzipped object: 4 hours 42 minutes - From 2000 gzipped slices of 1M rows each: 17 minutes (Loading from gzipped files is considerably faster, in addition to saving S3 charges.) The article notes that choice of distribution key is critical. I'd add that choice of sort key is equally important. In my testing, a better sort key improved compression from 1.5:1 to 4:1, and also made common queries 5x faster. Unfortunately, you only get one dist key and one sort key per table, so less common queries could get slower. |
|