|
So, yeah, but an interesting chunk of results: is the inter-rater agreement ACTUALLY random? (Not sure if this is formatted correctly, feel free to comment if you'd do these stats differently) from scipy.stats import binomtest
# stats: {experience_level: (rate, num_in_group)}
stats = {
"lvl1": (20.5, 156),
"lvl2": (22.2, 66),
"lvl3": (23.3, 50),
"lvl4": (21.2, 39),
"lvl5": (20.8, 12),
"lvl6": (28.3, 5),
}
for lvl, (rate, num) in stats.items():
num_tests = num * (num - 1)
res = binomtest(int(rate * num_tests * 0.01), num_tests, 0.2, alternative="greater")
# with multiple comparison correction with bonferroni
print(f"{lvl}: p-value = {res.pvalue:.4f} {'*' if res.pvalue < 0.05 / 6 else ''}")
print('* indicates p-value is < 0.05 after bonferroni correction')
"""
lvl1: p-value = 0.0276
lvl2: p-value = 0.0002 *
lvl3: p-value = 0.0000 *
lvl4: p-value = 0.1337
lvl5: p-value = 0.4826
lvl6: p-value = 0.3704
* indicates p-value is < 0.05 after bonferroni correction
"""
So maybe there's an internal consistency in how these people are trained, and maybe it's not completely dependent on skill level. This is assuming I read the https://www.clearerthinking.org/post/can-astrologers-use-ast... part correctly. |