| While I like the construction of the column store and the corresponding API. The claims of the author don't really make sense: > "... columnarization, a technique from the database community for laying out structured records in a format that is more convenient for serialization than the records themselves." Column stores in comparison to row stores don't offer any serialization benefit per se. The main benefits are the following, I will be using a record (A,B,C,D,E) as example with all types u32 (4 bytes): * If you only use some fields you have to load less data from memory/disk into the CPU cache and your working set is more probable to fit into cache. For example when filtering only the records where A=22 and B=45 you only have to actually load x(sizeof(A)+sizeof(B)) = x8 bytes instead of xrecord_size=x20. This can make a very significant difference.
* When using compression to reduce the size of data, columns can often be compressed better because they only contain data of the same type and nature and thus probably share similarities. When using such a small record consisting only of integers it probably won't make a difference. But if e.g. some fields are country abbreviations, textual description or others are ids, one could easily imagine that there are gains. Coming back to the point about serialization, using the same technique as described in the blog post, there won't[1] be a performance difference between column storage and row storage (e.g. using a struct). The method described in the blog post just lets the data array of the original vector be wrapped by a Vec<u8> without even moving the memory, so the method is independent of the data type that is stored in the vectors. Of course it will only work for data types that do not contains references, otherwise we could get illegal memory access after deserialization (which should be guaranteed by the rust type system because only Copy types are allowed). The only thing this benchmark is testing is how fast a vector can be initialized. [1] There can be an space improvement of keeping the data in a column layout compared to row layout when using normal structs. Normal structs normally align the total size to the size of the largest field in the struct. A struct containing i64 and i8 would contain 7 bytes of padding. In a column layout this overhead would be avoided. Still there would not be an improvement in this serialization scheme as it does not actually copy any data. |
Unless I misunderstand your post, I think you may have missed important parts!
More is going on than "just wrapping the data of the original vector as a Vec<u8>". That is what happens when we are handed a Vec<uint>, but for other inner types (pairs, vectors, etc) there is more to do (and, actual code presented showing what that work is).
1. A Vec<(u8, u64)> definitely ends up as a (Vec<u8>, Vec<u64>), thereby avoiding padding you'd have if you just wrote out the elements as structs. You absolutely end up saving space, and part of doing this is definitely copying data. (responding to "as it does not actually copy any data").
2. The types themselves can be vectors, corresponding to a struct with owned pointers (whose elements can also own pointers, etc). This is not something that just casting the source array will deal with. It's important here that Rust hands you ownership, as otherwise it would be totally inappropriate for us to claim the underlying memory (which the code does). That part, recycling owned memory, is one of the big performance wins (about 2.5x faster for me than invoking the allocator each time I need to mint a new array).
3. "Copy" doesn't appear in the first post. The types don't have to be Copy, and indeed Vec<T> is not Copy, even if T is. Not sure where you got that from. But, no such requirement; yay!
Read part 2 for more about Copy, and how when your type is Copy you get a free implementation that does what you suggest (keeping each struct intact), except you can mix and match with Vecs and Options and stuff like that.
Hope this clears up some of the not-sense-making. Feel free to holler with other questions.
Cheers, Frank