Don’t know about Arrow, but Parquet is a columnar format. Such formats can’t write record-by-record, they need a large number of records to shred into columns in order to realize their columnar benefits. In contrast, appending to RecordIO is little more than writing a binary string. The downside of RecordIO is that you can’t just read some fields in a message and not others. You have to deserialize the whole message. RecordIO is cheap to write and well suited for cases where reading the entire message is not that big a deal. Columnar formats are more suited for the cases where it’s ok to pay the relatively substantial up front encoding cost for vastly greater performance in analytical workloads. Advanced ones contain additional metadata (such as range and hash constraints, the former can be both per file and per block) which the analytical runtime will be able to take advantage of in order to avoid doing the work that doesn’t need to be done.