Problem overview

We will replicate a study originally presented in:


Summary

  • Samples from the Arabidopsis thaliana model organism. Original source of data: Atwell et al., 2010. Link to genotye and phenotype data.

  • The binary phenotype avrB was analyzed, with a total of 87 samples (55 cases, 32 controls).

  • Total number of single-nucleotide polymorphisms (SNPs): 214,032 (all homozygous). For this exercise, we subsampled the SNPs in chromosome 1 to 650 SNPs

  • The categorical covariate was used to correct for population structure. Created with k-means clustering on the three principal components of the empirical kinship matrix. The value of k was optimized to minimize genomic inflation.

Problem statement

In Example 2 we focused on genomic intervals. Here analyze interactions between variants.

Key question: Are there specific combinations of SNPs whose interaction is associated to the phenotype?

Brute force analysis will require the enumeration of \(2^M\) sets of motifs, with \(M = 214,032\) (here \(M = 650\), see above)


Source: Llinares-López et al., 2018 (in review). Figure 1.


Analysis

Import the CASMAP package

library(CASMAP)

Set the paths to the input files (genotype, phenotype and covariate).

genotype_file  <- "X.dat"
phenotype_file <- "Y.dat"
covariate_file <- "C.dat"

Create an object to perform region-based GWAS (interval search).

obj_hoi <- CASMAP(mode="higherOrderEpistasis")

Set hyperparameters of the analysis:

  • alpha: Target Family-Wise Error Rate (FWER).
  • max_comb_size: Maximum number of interacting variants. For example, if set to 4, then only sets with with up to 4 SNPs (inclusive) will be considered. To consider sets of arbitrary length, use value 0 (default).
obj_hoi$setTargetFWER(alpha=0.05)
obj_hoi$setMaxCombinationSize(max_comb_size=0)

Read input files.

The A. thaliana genotypes we analyze here are homozygous (original data are binary). See Example 2 for details on how to encode the variants if this is not the case.

obj_hoi$readFiles(genotype_file=genotype_file, phenotype_file=phenotype_file, covariate_file=covariate_file)

Run the algorithm. Retrieve statistically associated interactions between genomic variants. This execution will take a bit longer than the other two examples.

obj_hoi$execute()

The analysis is finalized. Get the summary results.

summary_results <- obj_hoi$getSummary()
print(summary_results)
## $n.iset.processed
## [1] 25697705
## 
## $n.iset.closed.processed
## [1] 1688572
## 
## $n.iset.testable
## [1] 312045
## 
## $testability.threshold
## [1] 1.44544e-07
## 
## $target.fwer
## [1] 0.05
## 
## $corrected.significance.threshold
## [1] 1.602333e-07

Get the statistically significant sets of SNPs

sig_sets <- obj_hoi$getSignificantInteractions()
str(sig_sets)
## 'data.frame':    1 obs. of  4 variables:
##  $ itemsets  :List of 1
##   ..$ : num  356 537 358 627
##  $ score     : num 31.2
##  $ odds_ratio: num Inf
##  $ pvalue    : num 2.3e-08

Display the hits

head(sig_sets[, c("pvalue", "itemsets")])
##         pvalue           itemsets
## 0 2.299653e-08 356, 537, 358, 627

Save all results to output files.

obj_hoi$writeSummary("output/summary.txt")
obj_hoi$writeProfile("output/profiling.txt")
obj_hoi$writeSignificantInteractions("output/significant_interactions.txt")