|
|
|
|
|
by a_wild_dandan
848 days ago
|
|
Ah, apologies for the misunderstanding. What tests would you suggest to evaluate "muddiness"? What comes to my mind: run the usual gamut of tests, but with the excess context window saturated with irrelevant(?) data. Measure test answer accuracy/verbosity as a function of context saturation percentage. If there's little correlation between these two variables (e.g. 9% saturation is just as accurate/succinct as 99% saturation), then "muddiness" isn't an issue. |
|
A handful of examples show whether it can do it. For example, GPT-4 turbo is downright awful at something like that.