Cohen's kappa

Cohen's kappa coefficient (κ) is a statistic that measures inter-rater reliability for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation, as κ takes into account the agreement occurring by chance. Cohen's kappa measures the agreement between two raters who each classify N items into C mutually exclusive categories. The formula for κ is:

\[ \kappa = \frac{p_o - p_e}{1 - p_e} \]

where \(p_o\) is the relative observed agreement among raters (proportion of items where both raters agree), and \(p_e\) is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly saying each category. If the raters are in complete agreement then κ = 1. If there is no agreement among the raters other than what would be expected by chance (as given by \(p_e\)), κ = 0.

Background[edit | edit source]

The kappa statistic was introduced by Jacob Cohen in 1960 as a measure of agreement for nominal scales. It is used in various fields such as healthcare, where it helps to assess the reliability of diagnostic tests, and in machine learning, where it is used to measure the performance of classification algorithms.

Calculation[edit | edit source]

To calculate Cohen's kappa, the number of categories into which assignments can be made must be fixed, and the assignments of each item into these categories by the two raters must be known. The formula involves the calculation of several probabilities:

\(p_o\), the observed agreement, is calculated by summing the proportions of items that both raters agree on.
\(p_e\), the expected agreement by chance, is calculated by considering the agreement that would occur if both raters assign categories randomly, based on the marginal totals of the categories.

Interpretation[edit | edit source]

The value of κ can be interpreted as follows:

A κ of 1 indicates perfect agreement.
A κ less than 1 but greater than 0 indicates partial agreement.
A κ of 0 indicates no agreement better than chance.
A κ less than 0 indicates less agreement than expected by chance.

Landis and Koch (1977) provided a commonly used interpretation for the kappa statistic, suggesting that values ≤ 0 as indicating no agreement, 0.01–0.20 as slight, 0.21–0.40 as fair, 0.41–0.60 as moderate, 0.61–0.80 as substantial, and 0.81–1.00 as almost perfect agreement.

Limitations[edit | edit source]

While Cohen's kappa is a widely used and informative statistic for measuring inter-rater reliability, it has its limitations. It assumes that the two raters have equal status and that the categories are mutually exclusive. Furthermore, kappa may be affected by several factors including the number of categories, the distribution of observations across these categories, and the prevalence of the condition being rated.

Applications[edit | edit source]

Cohen's kappa is used in a variety of settings to assess the reliability of categorical assignments. In healthcare, it is used to evaluate the consistency of diagnostic tests between different raters. In psychology, it helps in assessing the reliability of categorical diagnoses. In content analysis and machine learning, kappa provides a measure of the agreement between human annotators or between an algorithm and a human annotator.

Search WikiMD

Ad.Tired of being Overweight? Try W8MD's physician weight loss program.
Semaglutide (Ozempic / Wegovy and Tirzepatide (Mounjaro) available.
Advertise on WikiMD

WikiMD is not a substitute for professional medical advice. See full disclaimer.

Credits:Most images are courtesy of Wikimedia commons, and templates Wikipedia, licensed under CC BY SA or similar.