| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by fifilura 814 days ago

Pyspark is probably the way to go.

I just wanted to mention that AWS Athena eats 15G parquet files for breakfast.

It is trivial to map the file into Athena.

But you can't connect it to anything else than file output. But it can help you to for example write it to smaller chunks. Or choose another output format such as csv (although arbitrary email content in a csv feels like you are set up for parsing errors).

The benefit is that there is virtually no setup cost. And processing cost for a 15G file will be just a few cents.

1 comments

calderwoodra 814 days ago

Athena is probably my best bet tbh, especially if I can do a few clicks and just get smaller files. Processing smaller files is a no brainer / pretty easy and could be outsourced to lambda.

fifilura 814 days ago

Yeah the big benefit is that it requires very little setup.

You create a new partitioned table/location from the originally mapped file using a CTAS like so:

  CREATE TABLE new_table_name
  WITH (
    format = 'PARQUET',
    parquet_compression = 'SNAPPY',
    external_location = 's3://your-bucket/path/to/output/'
  ) AS
  SELECT *
  FROM original_table_name
  PARTITIONED BY partition_column_name

You can probably create a hash and partition by the last character if you want 16 evenly sized partitions. Unless you already have a dimension to partition by.