Clock speed is just another metric, not the most important one. You also have chips that do several operations per clock cycle, low power consumption chips, chips that do parallel processing. All these factors affect performance as well.
I would argue that neither of those things are metrics to measure for a transistor, since they are affected by how its used. Properties of the transistor itself are what matter: size, drain and source capacitance, leakage current when off, drain-source voltage drop when on, and probably a dozen things I don't really know about.
The properties you mention are mainly "low level" properties, which in the end affect the "high level" properties that are speed and power consumption. For example capacitance limits clock speed and also has an effect on power consumption, and leakage current reduces power efficiency. Capacitance and leakage current themselves depend on size. I think that the physical properties you mention (and maybe others) may be used to create a model, which can then be used to estimate clock speed and power consumption at different operating points.
I would think that they are talking about clock frequency. They do say that you can either have the 37% performance increase, or use less than half the power when at the same performance.
It's probably easier to use that much less power for the same performance than to push the same power in and get higher performance. On smaller size scales power leakage and heat dissipation become bigger issues so you might not be able to push the same amount of energy through as with the larger scale chip.
"Performance" doesn't really mean anything by itself without more details in this context, its just a vauge marketing phrase. Go look at the switching-speed/voltage tradeoff charts that pilom linked to at Anandtech to get some meaningful but still very impressive information.
No, that limit has been hit by the physical structure of Silicon itself. The perf has to do with more transistors per square area, thus you could fit more "cores" into a single chip or for example, it takes me 2mil transistors to make a 5 clock cycle multiplier but I could get a 1 clock cycle multiplier if I used 5mil transistors. I didn't use to have the physical real estate to do it, now I would, so I'd do the same operation in 1/5 the time.