Hacker News new | ask | show | jobs
by jonwiseman 680 days ago
So, yeah, but an interesting chunk of results: is the inter-rater agreement ACTUALLY random?

(Not sure if this is formatted correctly, feel free to comment if you'd do these stats differently)

    from scipy.stats import binomtest
    # stats: {experience_level: (rate, num_in_group)}
    stats = {
        "lvl1": (20.5, 156),
        "lvl2": (22.2, 66),
        "lvl3": (23.3, 50),
        "lvl4": (21.2, 39),
        "lvl5": (20.8, 12),
        "lvl6": (28.3, 5),
    }
    
    for lvl, (rate, num) in stats.items():
        num_tests = num * (num - 1)
    
        res = binomtest(int(rate * num_tests * 0.01), num_tests, 0.2, alternative="greater")
    
        # with multiple comparison correction with bonferroni
        print(f"{lvl}: p-value = {res.pvalue:.4f} {'*' if res.pvalue < 0.05 / 6 else ''}")
    
    print('* indicates p-value is < 0.05 after bonferroni correction')

    """
    lvl1: p-value = 0.0276 
    lvl2: p-value = 0.0002 *
    lvl3: p-value = 0.0000 *
    lvl4: p-value = 0.1337 
    lvl5: p-value = 0.4826 
    lvl6: p-value = 0.3704 
    * indicates p-value is < 0.05 after bonferroni correction
    """
So maybe there's an internal consistency in how these people are trained, and maybe it's not completely dependent on skill level. This is assuming I read the https://www.clearerthinking.org/post/can-astrologers-use-ast... part correctly.
1 comments

Errata: this is probably not right, the pairwise trials aren't independent.