Yes indeed, does that mean similar katas across multiple languages have to be fully duplicated? I'd have expected an external checking system (à la TAP), doing everything internally with languages like Ruby or Python seems like a pretty bad idea.
I think it's a bad idea for most languages. Even with C, C++, Go, Rust, or Swift, it's pretty trivial to exploit anything running within the same process. Any user-submitted code needs to be isolated from any code you need to be able to trust.
A string match for the stdout can still be cheated.
For instance, if you ask me to write a program that computes the first 100 digits of pi, I can just have it print a string literal.
Anything that has no inputs, or that has a small input space, or that is known to be tested with only a few known input cases, can be cheated by cooking the output.
A "cheat-resistant" way to verify that something is working is to choose problems that have a large input space, and randomly probe the space.
Famous examples of this kind of cheating have occurred in compiler benchmark. A compiler can recognize that the program being fed to it is a known benchmark, and produce an optimization of the benchmark as a whole. I.e. "if the abstract syntax tree of 279 nodes is exactly this particular one, spit out this canned piece of code which 'translates' it."
The entire point to the site is that programming challenges using STDIN/OUT is not how you write code. So yes, you can cheat - but who cares. At least you get to write real code. Not to mention the number of things that can be tested for since you are able to use real testing frameworks.
STDOUT/IN based challenges cause developers to have to program esoteric challenges that they would never see in real life, and write code that they would never use for anything other than a challenge site.