Go ahead and move the goalposts now... This took about 2 minutes of research to support the conclusions I know to be true. You can waste time as long as you choose in academia attempting to prove any point, while normal people make real contributions using LLMs.
### An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation
We evaluate TESTPILOT using OpenAI’s gpt3.5-turbo LLM on 25 npm packages with a total of 1,684 API functions. The generated
tests achieve a median statement coverage of 70.2% and branch coverage of 52.8%. In contrast, the state-of-the feedback-directed
JavaScript test generation technique, Nessie, achieves only 51.3% statement coverage and 25.6% branch coverage.
- *Link:* [An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation (arXiv)](https://arxiv.org/abs/2302.06527)
---
### Field Experiment – CodeFuse (12-week deployment)
- Productivity (measured by the number of lines of code produced) increased by 55% for the group using the LLM. Approximately one third of this increase was directly attributable to code generated by the LLM.
- *Link:* [CodeFuse: Generative AI for Code Productivity in the Workplace (BIS Working Paper 1208)](https://www.bis.org/publ/work1208.htm)
The point is that the information is readily available, and rather than actually adding to the discussion they chose to crow “source?”. It’s very lame.
LOC does have a correlation with productivity, as much as devs hate to acknowledge it. I don’t care that you can provide counterexamples to this, or even if the AI on average takes more LOC to accomplish the same task - it still results in more productivity overall because it arrives at the result faster.
Nothing about this is moving goalposts - you and/or the person(s) conducting this study are the ones being misleading!
If you want to measure time to complete a complex task, then measure that. LOC is an intermediate measure. How much more productive is "55% more lines of code"?
I can write a bunch of garbage code really fast with a lot of bugs that doesn't work, or I can write a better program that works properly, slower. Under your framework, the former must be classified as 'better' - but why?
I read the study you reference and there is literally nothing in the study that talks about whether or not tasks were accomplished successfully.
It says:
* Junior devs benefited more than senior devs, then presents a disingenuous argument as to why that's the senior devs' fault (more experienced employees are worse than less experienced employees, who knew?!)
* 11% of the 55% increase in LOC was attributed directly to LLM output
* Makes absolutely no attempt to measure whether or not the extra code was beneficial
Yes, like I said, it’s not hard to provide counterexamples to why more LOC is better, but it’s also missing the forest for the trees to pretend it doesn’t matter at all.
### An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation We evaluate TESTPILOT using OpenAI’s gpt3.5-turbo LLM on 25 npm packages with a total of 1,684 API functions. The generated tests achieve a median statement coverage of 70.2% and branch coverage of 52.8%. In contrast, the state-of-the feedback-directed JavaScript test generation technique, Nessie, achieves only 51.3% statement coverage and 25.6% branch coverage. - *Link:* [An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation (arXiv)](https://arxiv.org/abs/2302.06527)
---
### Field Experiment – CodeFuse (12-week deployment) - Productivity (measured by the number of lines of code produced) increased by 55% for the group using the LLM. Approximately one third of this increase was directly attributable to code generated by the LLM. - *Link:* [CodeFuse: Generative AI for Code Productivity in the Workplace (BIS Working Paper 1208)](https://www.bis.org/publ/work1208.htm)