| HN Mirror

The problem is less the spurious DRAM accesses etc, as awful as they would be. The compiler problem is really a mix of 1) understanding enough about fixed-bound unit-stride loops to nonoverlapping memory (or transforming access to such) and 2) data layouts that prevent that. E.g. while there are well understood data layouts at each point of the compilation pipeline, it's hard in general for compilers to profitably shift from array of structs to struct of array layouts.

You are correct that, generally speaking, most STL heavy code would be hard to vectorize and unlikely to gain much advantage. (Plus there are the valarray misadventures). You will sometimes see clang and gcc vectorize std::vector if the code is simple enough, and they can assume strict aliasing. Intel's compiler has historically been less aggressive about assuming strict aliasing.

Various proposals are working through the standard committee to add explicit support for SIMD programming. E.g. if something like http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n418... were to be standardized we could write matrix multiply explicitly as:

  using SomeVec = Vector<T>
  for (size_t i=0; i<n; ++i) {
    for (size_t j=0; k<n; j+=SomeVec::size()) {
      SomeVec c_ij = A[i][0] * SomeVec(&B[0],j, Aligned);
      for (size_t k = 1; k < n; ++k) {
        c_ij += A[i][k] * SomeVec(&N[k][j], Aligned);
      }
      c_ij.store(&C[i][j], Aligned);
    }
  }

For my own work on vector languages and compilers I've had an easier time of it since they have been designed to enable simpler SIMD code generation.