There are many measures of benchmark sizes. One important measure is size of codes that account for 99% of execution time. If your codebase is a million lines but your hotspot is a thousand lines, benchmark result is sensitive to optimization quirks and in some sense benchmark is small.
More on this idea here: http://blog.pyston.org/2014/12/05/python-benchmark-sizes/