|
|
|
|
|
by fifilura
767 days ago
|
|
Pyspark is probably the way to go. I just wanted to mention that AWS Athena eats 15G parquet files for breakfast. It is trivial to map the file into Athena. But you can't connect it to anything else than file output. But it can help you to for example write it to smaller chunks. Or choose another output format such as csv (although arbitrary email content in a csv feels like you are set up for parsing errors). The benefit is that there is virtually no setup cost. And processing cost for a 15G file will be just a few cents. |
|