I was just taking a look and couldn't help but notice the switch statement for your operator[], which likely causes a lot of unnecessary bad speculation at runtime:
Many believe the C++ compiler will magically optimize the switch away, but in some cases, like the example above for CLHEP, it doesn't happen, so you end up with bad performance.
https://github.com/RandyGaul/cute_headers/blob/755849fc2819d...
See an optimized quaternion multiplication implementation in SSE by me here:
https://stackoverflow.com/questions/18542894/how-to-multiply...