|
|
|
|
|
by aanfhn
3287 days ago
|
|
Loading a single, large file into Redshift? I get the impression the author has a passing knowledge of Redshift at least. I'm not too familiar with BigQuery, but for Redshift, loading files in batches of total slices in the cluster is the recommended approach. No wonder it took 9+ hours to load that file.
And the author also doesn't mention distributing the data on a particular column. I wonder if he did random distribution. And his problem with the field delimiter really shows lack of experience; of course you can't use a multi-length field delimiter. I've never seen anyone use a comma+space for a delimiter. That file could probably be about 75% of the original file size if he re-created it with just the comma as the field delimiter. Sorry to bash on the author - don't mean to sound harsh but a lot of people are trying to do benchmarks but with minimal context and whatnot |
|
It is indeed a non conventional format but thats how the source CSV was formatted on https://sdm.lbl.gov/fastbit/data/samples.html