|
|
|
|
|
by operator-name
768 days ago
|
|
If I've understood this correctly, the test is to measure the saftey finetune performance. These commercial models have been finetuned so that they are "safe", and safe models should not blindly quote what they are told. Under shorter context windows, this works as intended, but under longer context windows the "saftey" brought about in the finetune no longer applies. |
|