Maybe try replacing the ALU with one written directly in Verilog, I suspect this will run a lot faster than building it up from 74181+74182 components.
The current state is _very_ fast in simulation to the point where it is uninteresting (there are other things to figure out) to write something as a behavioral model of the '181/'182.
~100 microcode instructions takes about 0.1 seconds to run.
I was thinking more of a behavioral model of the whole ALU, just so that the FPGA tools can map it onto a collection of the smaller ALUs built into each slice.
What clock speed does your latest design synthesize at?
There was already a design of CADR for FPGAs [1] that does synthesize (and boot), I don't know why amszmidt needed to start again from scratch or if his design is a modification of the earlier one.
A similar comment applies to lm-3. Maybe it is built on a fork of the previous repo, it is hard to tell.
And of course .. https://tumbleweed.nu/lm-3 .