So awesome!, rip my weekend plans, I have done similar thing for fun using python and modifying class dictionary (half working), more in scope of multi-objective optimization.
One of the largest programs generated had consisted of about 300 instructions.
This particular program included if/then conditionals, counting down in a loop, concatenation of a numeric value with text, and displaying output.
The system also attempts to optimize the number of programming instructions executed, in which case complexity can actually be considered based upon the resulting behavior, rather than LOC.
That's... decent. It's a lot larger than I ever got with a GA.
I don't think we're ever going to get a GA to produce an air-traffic control system, or a database, or an OS. But 300 lines (that work) is further than I was aware had been possible.
(When I said "decent" in the first paragraph, that sounds like I'm damning it with faint praise. And I kind of am. But I am doing so in light of the Linux kernel, which is something like 100 million lines. I am not doing so to minimize this as an achievement within the world of GA-generated code. I'm just saying that we're a ways from being able to handle most real-world problems with GA-generated code.)