A statistical methodology for analyzing co-occurrence data from a large sample

Citations of this article
Mendeley users who have this article in their library.


Determining important associations among items in a large database is challenging due to multiple simultaneous hypotheses and the ability to select weak associations that are statistically but not clinically significant. The simple application of the χ2 test among all possible pairs of items results in mostly inappropriate associations surpassing the traditional (α = .05, χ2 = 3.94) threshold. One can choose a stricter threshold to find stronger associations, but the choice may be arbitrary. We combined the volume test of Diaconis and Efron with a p-value plot to select a more rigorous and less arbitrary threshold. The volume test adjusts the p-value of the χ2-statistic. A plot of adjusted p-values (1-p versus Np), where Np is the number of test statistics with a p-value greater than p, should be linear if there are no true associations. The point where the plot deviates from a line can be used as a threshold. We used linear regression to select the threshold in a reproducible fashion. In one experiment, we found that the method selected a threshold similar to that previously obtained by manually reviewing associations. © 2006 Elsevier Inc. All rights reserved.




Cao, H., Hripcsak, G., & Markatou, M. (2007). A statistical methodology for analyzing co-occurrence data from a large sample. Journal of Biomedical Informatics, 40(3), 343–352. https://doi.org/10.1016/j.jbi.2006.11.003

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free