|
|
|
|
|
by kostaj
27 days ago
|
|
Indeed. I prompted each model ones, plus one retry on errors. Very good point to measure the inter-model disagreement! Will add in the next version. Section "4.2 Agreement w/ peer majority" shows the level of agreement of each model with the majority. Yes, planning of human-labelling the same corpus of 1,000 claims and publishing a second study measuring the models performance against the human-labels on corpus that the models have not seen during training. |
|