|
|
|
|
|
by gengstrand
641 days ago
|
|
This piece reminds me of something I did earlier this year https://www.infoq.com/articles/llm-productivity-experiment/ where I conducted an experiment across several LLMs but it was a one-shot prompt about generating unit tests. Though there were significant differences in the results, the conclusions seem to me to be similar. When an LLM is prompted, it generates a response by predicting the most probable continuation or completion of the input. It considers the context provided by the input and generates a response that is coherent, relevant, and contextually appropriate but not necessarily correct. I like the crowdsourcing metaphor. Back when crowdsourcing was the next big think in application development, there was always a curatorial process that filters out low quality content then distills the "wisdom of the crowds" into more actionable results. For AI, that would be called supervised learning which definitely increases the costs. I think that unbiased and authentic experimentation and measurement of hallucinations in generative AI is important and hope that this effort continues. I encourage the folks here to participate in that in order to monitor the real value that LLMs provide and also as an ongoing reminder that human review and supervision will always be a necessity. |
|