| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by MITSardine 456 days ago

One case I benchmarked was Bernstein/Bézier and Lagrange element evaluation. This is: given a degree d triangle or tetrahedron, given some barycentric coordinates, get the physical coordinate and the Jacobian matrix of the mapping.

Degree 2, Lagrange:

- runtime: 3.6M/s - Hana: 16.2M/s - Hardcoded: 37.7M/s

Degree 3, Lagrange: 2.6M/s, 6.4M/s, 13.4M/s (same order).

"Runtime" here means everything is done using runtime loops, "Hana" using Boost Hana to make loops compile-time and use some constexpr ordering arrays, "hardcoded" is a very Fortran-looking function with all hardcoded indices and operations all unrolled.

As you see, using Boost Hana does bring about some improvement, but there is still a factor 2x between that and hardcoded. This is all compiled with Release optimization flags. Technically, the Hana implementation is doing the same operations in the same order as the hardcoded version, all indices known at compile time, which is why I say there must be some runtime overhead to using hana::while.

In the case of Bernstein elements, the best solution is to use de Casteljau's recursive algorithm using templates (10x to 15x speedup to runtime recursive depending on degree). But not everything recasts itself nicely as a recursive algorithm, or I didn't find the way for Lagrange anyways. I did enable flto as, from my understanding (looking at call stacks), hana::while creates lambda functions, so perhaps a simple function optimization becomes a cross-unit affair if it calls hana::while. (speculating)

Similar results to compute Bernstein coefficients of the Jacobian matrix determinant of a Q2 tetrahedron, factor 5x from "runtime" to "hana" (only difference is for loops become hana::whiles), factor 3x from "hana" to "hardcoded" (the loops are unrolled). So a factor 15x between naive C++ and code generated files. In the case of this function in particular, we have 4 nested loops, it's branching hell where continues are hit very often.

1 comments

Arech 456 days ago

Hhhmmm, interesting, thanks for reply!

That would be fairly interesting to look at the actual code you've used, and have a look at the codegen. By a chance, is it viable for you to open-source it? I'd guess it should bear lots of interest for Hana author/s.

What compiler/version did you use? For example, MSVC isn't (at least wasn't) good at always evaluating `constexpr` in compile-time...

> hana::while creates lambda functions, so perhaps a simple function optimization becomes a cross-unit affair if it calls hana::while. (speculating)

Hmm, I'd say it (LTO) shouldn't influence, as these lambdas are already fully visible to a compiler.

link

MITSardine 456 days ago

I never thought to contact them, but I might do that, thanks for the suggestion. This is something I tested almost two years ago, I have these benchmarks written down but I've since deleted the code I've used, save for the optimal implementations (though it wouldn't take too long to rewrite it).

I tested with clang on my Mac laptop and gcc on a Linux workstation. Version, not sure. If I test this again to contact the Hana people, I'll try and give all this information. I did test the constexpr ordering arrays by making sure I can pass, say, arr[0] as a template parameter. This is only possible if the value is known at compile time. Though it's also possible the compiler could be lazy in other contexts, as in not actually evaluating at compile time if it figures out the result is not necessary to be known at compile time.

Oh yeah, you're right, I was confusing translation unit and function scope.

link