|
|
|
|
|
by zaptheimpaler
3482 days ago
|
|
I had very similar experience with Parquet and cross system pains. Pretty much the whole big data space is a giant cluster fuck of poorly documented and ever so slightly incompatible technologies.. with hidden config flags you need to find to get it to work the way you want, classpath issues, tiny incompatibilities between data storage formats and SQL dialects and so on.. Hoping someone on this thread could answer a related question - how do you store data in Parquet when the schema is not known ahead of time? Currently we create an RDD and use Spark to save as Parquet (which I believe has an encoder/decoder for Rows) but this is a problem because we can't stream each record as it comes and use a lot of memory to buffer before writing to disk. |
|