Technical Report TR680:
Rui Wang - Indiana University Bloomington Yong Li - Indiana University Bloomington XiaoFeng Wang - Indiana University Bloomington Haixu Tang - Indiana University Bloomington Xiaoyong Zhou - Indiana University Bloomington
Learning Your Identity and Disease from Research Papers: Information Leaks in Genome Wide Association Study
(Aug 2009), 12 Pages
Genome-wide association studies (GWAS) aim at discovering the association between genetic variations, particularly single-nucleotide polymorphism (SNP), and common diseases, which have been well recognized to be one of the most important and active areas in biomedical research. Also renowned is the privacy implication of such studies, which has been brought into the limelight by the recent attack proposed by Homer et al. Homer's attack demonstrates that it is possible to identify a participant of a GWAS from analyzing the allele frequencies of a large number of SNPs. Such a threat, unfortunately, was found in our research to be significantly understated. In this paper, we demonstrate that individuals can actually be identified from even a relatively small set of statistics, as those routinely published in GWAS papers. We present two attacks. The first one extends Homer's attack with a much more powerful test statistic, based on the correlations among different SNPs described by coefficient of determination ($r^2$). This attack can determine the presence of an individual in a GWAS from the statistics related to a couple of hundred SNPs. The second attack can lead to complete disclosure of hundreds of the participants' SNPs, by analyzing the information derived from the published statistics. We also found that those attacks can succeed even when the precisions of the statistics are low and part of data is missing, which makes the effects of such simple defense limited. We evaluated our attacks on the real human genomes from the International HapMap project, and concluded that such threats are completely realistic.
- Available as: