| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by openasocket 3287 days ago

The redshift loading result seems suspect to me. I know firsthand that redshift can scale to load over a trillion records / hour (with a big enough cluster). Even with a basic setup this should be at least an order of magnitude faster. I'm not sure exactly what the problem is, maybe try breaking the file up into smaller batches and load those.

It would also be helpful to see what schema you used for redshift, specifically the encoding and the distribution and sort key(s).

To give Athena more of a fighting chance, other people have mentioned Parquet or ORC, but also remember to partition the data. Generally you're supposed to give Athena a directory with data partitioned into different subfolders based on field values. Like if you're dealing with time-series type data you can partition your data in the format "year=<xxxx>/month=<yy>/day=<zz>/<uuid>.csv". I'm guessing you should do that for this data partitioning by eventTime but it kind of depends. Of course then you need some other component to put the data into S3 into the desired format, and you should probably count that as part of the loading time.

DISCLAIMER: work for AWS, not on the redshift or athena teams, though I do use redshift for work.