Proper conditional analysis in the presence of missing data: Application to large scale meta-analysis of tobacco use phenotypes

18Citations
Citations of this article
21Readers
Mendeley users who have this article in their library.

Abstract

Meta-analysis of genetic association studies increases sample size and the power for mapping complex traits. Existing methods are mostly developed for datasets without missing values, i.e. the summary association statistics are measured for all variants in contributing studies. In practice, genotype imputation is not always effective. This may be the case when targeted genotyping/sequencing assays are used or when the un-typed genetic variant is rare. Therefore, contributed summary statistics often contain missing values. Existing methods for imputing missing summary association statistics and using imputed values in meta-analysis, approximate conditional analysis, or simple strategies such as complete case analysis all have theoretical limitations. Applying these approaches can bias genetic effect estimates and lead to seriously inflated type-I or type-II errors in conditional analysis, which is a critical tool for identifying independently associated variants. To address this challenge and complement imputation methods, we developed a method to combine summary statistics across participating studies and consistently estimate joint effects, even when the contributed summary statistics contain large amounts of missing values. Based on this estimator, we proposed a score statistic called PCBS (partial correlation based score statistic) for conditional analysis of single-variant and gene-level associations. Through extensive analysis of simulated and real data, we showed that the new method produces well-calibrated type-I errors and is substantially more powerful than existing approaches. We applied the proposed approach to one of the largest meta-analyses to date for the cigarettes-per-day phenotype. Using the new method, we identified multiple novel independently associated variants at known loci for tobacco use, which were otherwise missed by alternative methods. Together, the phenotypic variance explained by these variants was 1.1%, improving that of previously reported associations by 71%. These findings illustrate the extent of locus allelic heterogeneity and can help pinpoint causal variants.

Figures

  • Table 1. Power and type I errors of meta-analysis of single variant tests in the presence of missing data for continuous outcomes. Datasets were simulated according to the genetic and phenotype model described in METHODS. Meta-analysis was performed to combine 20 cohorts with 1500 individuals each. For each replicate, summary association statistics were generated, and a certain fraction of the generated summary statistics were masked as missing. Scenarios with different combinations of known variant effects, candidate variant effects and fractions of missingness were considered. Six analysis strategies were considered: 1) PCBS; 2) SYN+; 3) ImpG+meta; 4) COJO; 5) DISCARD and 6) REPLACE0. Type I error and power were evaluated using 105 replicates under the significance threshold of α = 0.005.
  • Table 2. Power and type I errors of meta-analysis of gene-level tests in the presence of missing data. Datasets were simulated according to the genetic and phenotype model described in METHODS. Within the gene region, 20% of the variant sites are deemed causal. Meta-analysis was performed to combine 10 cohorts with 2000 individuals each. For each replicate, summary association statistics were generated, and a certain fraction (10%, 30% or 50%) of the generated summary statistics were masked as missing. Scenarios with different combinations of known variant effect, candidate variant effects and fractions of missingness were considered. To evaluate the power loss due to missing data, we also analyzed the full dataset as a gold standard. Type I errors and power were evaluated for three rare variant tests (simple burden, SKAT and VT) using 1 million replicates under the significance threshold of α = 0.005.
  • Table 3. Independently associated variants identified using sequential forward selection with PCBS method. Sequential conditional analyses for the 9 loci were conducted, where we iteratively performed conditional analysis, conditioning on the top variants from earlier rounds. Top association signals at each iteration are shown. The sequential conditional analysis stops when the top association signal is no longer significant under the genome-wide significance threshold α = 5 × 10−8.

References Powered by Scopus

A global reference for human genetic variation

11723Citations
N/AReaders
Get full text

LD score regression distinguishes confounding from polygenicity in genome-wide association studies

3172Citations
N/AReaders
Get full text

Sequence variations in PCSK9, low LDL, and protection against coronary heart disease

2772Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Jiang, Y., Chen, S., McGuire, D., Chen, F., Liu, M., Iacono, W. G., … Liu, D. J. (2018). Proper conditional analysis in the presence of missing data: Application to large scale meta-analysis of tobacco use phenotypes. PLoS Genetics, 14(7). https://doi.org/10.1371/journal.pgen.1007452

Readers over time

‘18‘19‘20‘21‘22‘23‘2402468

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 6

50%

Researcher 6

50%

Readers' Discipline

Tooltip

Medicine and Dentistry 4

33%

Biochemistry, Genetics and Molecular Bi... 4

33%

Engineering 2

17%

Agricultural and Biological Sciences 2

17%

Article Metrics

Tooltip
Mentions
News Mentions: 1
Social Media
Shares, Likes & Comments: 28

Save time finding and organizing research with Mendeley

Sign up for free
0