|
|
|
|
|
by heuermh
1346 days ago
|
|
We presented using Parquet formats for bioinformatics 2012/13-ish at the Bioinformatics Open Source Conference (BOSC) and got laughed out of the place. While using Apache Spark for bioinformatics [0] never really took off, I still think Parquet formats for bioinformatics [1] is a good idea, especially with DuckDB, Apache Arrow, etc. supporting Parquet out of the box. 0 - https://github.com/bigdatagenomics/adam 1 - https://github.com/bigdatagenomics/bdg-formats |
|
Those upstream tasks tend to be row-oriented. You often iterate over all rows, do something with them, and output new rows in another format. Alternatively, you read the entire input into in-memory data structures, do something, and later serialize the data structures. Using column-oriented formats for such tasks does not feel natural.