This is really cool - I understand how the reinforcement loop works for improving performance, but how does it verify that the optimizations applied don't change the semantics/correctness of the code?
This. For now we rely on differential testing against a gold-standard implementation (e.g. unoptimized). For the action space we expose, any semantics-breaking change induced by our tool is a compiler bug.