|
|
|
|
|
by jacquesnadeau
3151 days ago
|
|
It's more than an implementation detail because we're also targeting interoperability between multiple separate technologies. One of the key things that the article didn't fully cover is that Arrow serves two purposes: high performance processing and interoperability. A key part of the vision is: two systems can share a representation of data to avoid serialization/deserialization overhead (and potentially copying in a shared memory environment). This is only possible if the in-memory format is also highly efficient for processing. This allows the processing systems (say Pandas and Dremio) to share a representation, both process against it, and then move the data between each other with zero overhead. If you shared the data representation on the wire and then each application had to transform it to a better structure for processing, you're still paying for a form of ser/deser. By using Arrow for both processing and interoperation you benefit from near-no-cost movement between systems and also a highly efficient representation to process data (including some tools to get you started in the form of the Arrow libraries). |
|