Hacker News new | ask | show | jobs
by frankmcsherry 4199 days ago
Hi, author here.

Unless I misunderstand your post, I think you may have missed important parts!

More is going on than "just wrapping the data of the original vector as a Vec<u8>". That is what happens when we are handed a Vec<uint>, but for other inner types (pairs, vectors, etc) there is more to do (and, actual code presented showing what that work is).

1. A Vec<(u8, u64)> definitely ends up as a (Vec<u8>, Vec<u64>), thereby avoiding padding you'd have if you just wrote out the elements as structs. You absolutely end up saving space, and part of doing this is definitely copying data. (responding to "as it does not actually copy any data").

2. The types themselves can be vectors, corresponding to a struct with owned pointers (whose elements can also own pointers, etc). This is not something that just casting the source array will deal with. It's important here that Rust hands you ownership, as otherwise it would be totally inappropriate for us to claim the underlying memory (which the code does). That part, recycling owned memory, is one of the big performance wins (about 2.5x faster for me than invoking the allocator each time I need to mint a new array).

3. "Copy" doesn't appear in the first post. The types don't have to be Copy, and indeed Vec<T> is not Copy, even if T is. Not sure where you got that from. But, no such requirement; yay!

Read part 2 for more about Copy, and how when your type is Copy you get a free implementation that does what you suggest (keeping each struct intact), except you can mix and match with Vecs and Options and stuff like that.

Hope this clears up some of the not-sense-making. Feel free to holler with other questions.

Cheers, Frank

1 comments

Addressing your points:

1. In my comment I acknowledge that there are space savings with types that include padding. I don't understand why you imply in your answer that I didn't understand this point. Regarding my comment about not copying data: it was based on the benchmark[1] as linked in the blog post. The parts you measure do not copy any of the user data! It only converts your internal vectors of user types to a list of vectors of u8 type. So what you are doing is essentially moving the data. But when using move semantics the size of the user data does not matter any more, so there won't be a difference between your column layout and using the more complex types directly inside the vectors if measured the same way as in this benchmark. In my opinion your benchmark is flawed and does not support the argument you make in your blog post.

2. Ok yes that was bad wording on my side. If you provide adaptors for complex types with pointers like Vec than you can of course also serialize those.

3. I guess this is just about the same argument as point 2) and thus redundant.

I made the effort for you to write a serialization framework which does not do "columnarization" but which simulates row layout and also added an adaptor for vec and ran it with the same benchmark, and also reran the original columnar benchmark on my machine. Both benchmarks were compiled exactly the same way. You can find my code here[2]. Here are the results:

==============================

columnarization <uint> 12.1 GB/s, 742k values/s

columnarization <(uint, (uint, uint))> 5.3 GB/s, 107k values/s

columnarization <vec<uint>> 2.1 GB/s, 54k values/s

columnarization <Option<uint>> 1.58 GB/s, 163k values/s

==============================

row (simple_serialize) <uint> 12.1 GB/s, 730k values/s

row (simple_serialize) <(uint, (uint, uint))> 8.83 GB/s, 164 values/s

row (simple_serialize) <vec<uint>> 2.13 GB/s, 56 values/s

row (simple_serialize) <Option<uint>> 8.82 GB/s, 263k values/s

==============================

You see that columnization does not have a performance benefit in your benchmark and it even is significantly slower for Option<uint> type and pairs.

In your blog post you never mentioned that columnization will only has the potential to bring performance benefits to de-/serialization when using types with large padding overheads. I think this discussion would probably helped the blog post. It would be much better if it would either omit the currently wrong performance argument and just focus on the nicely typed API or if it would use a proper benchmark which would support your argument, which from my understanding would be only possible in a very limited set of use case scenarios.

Here is a list of other valid arguments you could have made instead:

* Format saves space at the cost of performance.

* Better than repr(packed) as it will also work on platforms that don't support unaligned access

[1] https://github.com/frankmcsherry/columnar/blob/master/exampl... [2] https://github.com/mkaufmann/simple_serialize

With all due respect, your code seems to be a copy/paste of my code, where you've removed the Pair and Option implementation (allowing the alternate default I discussed in my second blog post) and replaced my string `ColumnarVec` with your string `SimpleSerialize`. So, thank you for measuring my code for me? :D

The reason the new numbers look better for Option with the Copy implementation is that you are now writing 16 bytes for each Option. They are a 50-50 mix of Some(0u) and None, which I wrote out in 5 bytes on average (always a 1 byte "present/not", and then 4 bytes of data on average). The Copy implementation is just writing 3x as much data and padding the throughput numbers, rather than reporting something more like goodput.

I sense that this isn't the right place to shake this out, so I'll stop posting. If there is a better way to follow up, let me know.

[edit: I've got you through github, thanks! I'll buy beers when I swing by TU next]

The changes make all the difference. Your argument was that columnization and thus storing each native type in a data record in seperate columns brings a performance benefit. By removing the specialisation of Pair and Option which save the data in seperate columns I switched back to a classical storage which basically stores the data of one record at one place like a row store. So your code simulates a column store and my a row store.

Using your original benchmark I than show that the row layout brings a large performance win in the benchmark. My numbers show this performance win not just in throughput (bytes per sec.) but als in "goodput" (values per sec.). Check my previous comment. I just noticed though that I sometimes forgot the k in the reported numbers, where it is missing you have ti multiply the number times 1000, I can't edit it anymore.

I guess we can agree to disagree and should continue the discussion in another form ;)

PS: I just updated my github information, I am now at the TU Munich