Last time I optimized code the hard way was using VTune to channel the right operation in the right pipeline.