You are right, perhaps my original post was unclear. The goal is to verify the transformation between LLVM and low-level assembly. I embellished slightly by equating the semantics of low-level assembly with CPU executions.
I'm pretty sure they're talking about IR to IR transformations (e.g. high level IR optimizations) rather than IR to low level assembly transformations (code generator backends). The former is grad school stuff; the latter is hard.