| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rsynnott 3330 days ago
	For a dataset this size, though, you'd probably realistically be using ORC/Parquet rather than CSV with Athena, which would cut query times and cost dramatically. Note that the table shows that Athena is scanning all the data each time; that would not be the case if ORC or Parquet was used.

3 comments

mej10 3330 days ago

Do you know of any good ways of outputting log data directly into Parquet format without hadoop/spark?

link

mason55 3330 days ago

I'm curious about this as well. I was evaluating Athena a few months ago and was surprised that there was no good way to get my data into Parquet or ORC format without spinning up an EMR cluster and loading into a Hive table of the desired format.

My guess is that in the past there was no real use for these formats unless you already had a Hadoop cluster running. If Amazon wants these "Hadoop as a service" concepts to take off it seems like it would be wise for them to make it easier to get data onto S3 in a better format than CSV.

link

arnon 3329 days ago

Mostly what I'm hearing here is:

"You're using it wrong"

"You should start up a Hadoop cluster, create a table, load the CSV into that table, then export it as Parquet, and then load that into S3, so that Athena can scan it"

Wouldn't you be honestly just better off creating a table and loading the CSV into a columnar database directly, like on Amazon RDS (Even if it means you bring the server instance down after you're done)?

link

vamin 3325 days ago

You might look into Secor: https://github.com/pinterest/secor

link

gallamine 3330 days ago

I was able to use Python + Arrow to convert single files into Parquet.

link

mej10 3330 days ago

I will check Arrow out, thanks!

link

rsynnott 3330 days ago

You can use org.apache.parquet.hadoop.ParquetWriter and its superclasses. It doesn't require a Hadoop cluster. There may also be other ways.