I'm not sure I see the problem. What does it matter that program A is shorter than program B because language A has a richer standard library? Program A still required less code.
Because that's not what's being measured here, you're also mixing in performance, and it's impossible to tell at a glance whether a score is attributable to one or the other or both.