Problem overview

We will replicate a study originally presented in:


Summary

  • Samples from the Arabidopsis thaliana model organism. Original source of data: Atwell et al., 2010. Link to genotye and phenotype data.

  • The binary phenotype avrRpm1 (hypersensitive response to the bacterial avirulence gene AvrRpm1) was analyzed, with a total of 84 samples (56 cases, 28 controls).

  • Total number of single-nucleotide polymorphisms (SNPs): 214,022 (all homozygous)

  • The categorical covariate was used to correct for population structure. Created with k-means clustering on the three principal components of the empirical kinship matrix. The value of k was optimized to minimize genomic inflation.

Problem statement

Key question: Are there any intervals of consecutive SNPs that show an association with the phenotype?

Brute force analysis will require the enumeration of \(N^2\) intervals, with \(N = 214,050\)


Source: Llinares-López et al., 2017. Figure 1.


Analysis

Import the CASMAP package

library(CASMAP)

Set the paths to the input files (genotype, phenotype and covariate). Note: for simplicity, the data files are assumed to be located in the same directory as the scripts.

genotype_file  <- "X.dat"
phenotype_file <- "Y.dat"
covariate_file <- "C.dat"

Create an object to perform region-based GWAS (interval search).

obj_rbg <- CASMAP(mode="regionGWAS")

Set hyperparameters of the analysis:

  • alpha: Target Family-Wise Error Rate (FWER).
  • max_comb_size: Maximum number of markers per region. For example, if set to 5, then only intervals with up to 5 consecutive SNPs (inclusive) will be considered. To consider intervals of arbitrary length, use value 0 (default).
obj_rbg$setTargetFWER(alpha=0.05)
obj_rbg$setMaxCombinationSize(max_comb_size=0)

Print the contents of the object.

print(obj_rbg)
## CASMAP object with: 
##  * Mode = regionGWAS 
##  * Target FWER = 0.05 
##  * Maximum combination size = 0 
##  * No input files read

Read input files.

The following is not necessary for A. thaliana. In general, for input genotype files using an additive encoding:

  • 0 = homozygous major
  • 1 = heterozygous
  • 2 = homozygous minor

use the extra input argument ‘encoding’ to select between two encodings:

  • dominant (0 = homozygous major, 1 = heterozygous and homozygous minor)
  • recessive (0 = homozygous major and heterozygous, 1 = homozygous minor).

Note: The covariate file is optional.

obj_rbg$readFiles(genotype_file=genotype_file, phenotype_file=phenotype_file, covariate_file=covariate_file)

Print the state of the object again

print(obj_rbg)
## CASMAP object with: 
##  * Mode = regionGWAS 
##  * Target FWER = 0.05 
##  * Maximum combination size = 0 
##  * Input files read 
##  * Covariate = TRUE

Run significant pattern mining algorithm to retrieve statistically associated genomic regions

obj_rbg$execute()

In Example 1, we retrieved the summary information with getSummary(). We can also save it to a file. Write high-level summary and profiling info related to the execution of the algorithm

obj_rbg$writeSummary("output/summary.txt")
obj_rbg$writeProfile("output/profiling.txt")

Write raw list of (possibly redundant) significantly associated multiplicative interactions of genomic variants

obj_rbg$writeSignificantRegions("output/significant_regions_raw.txt")

Write post-processed list of disjoint clusters of significantly associated genomic regions

obj_rbg$writeSignificantClusterRepresentatives("output/significant_regions_clustered.txt")

Read the output file and determine its size

results_raw = read.table("output/significant_regions_raw.txt", sep="\t", skip=1, header=FALSE)
results_clust = read.table("output/significant_regions_clustered.txt", sep="\t", skip=1, header=FALSE, 
                           col.names=c("p.value", "score", "OR", "index.set", "num.regions", 
                                       "index.start", "index.end")) 

Number of statistically significant intervals (before clustering)

dim(results_raw)[1]
## [1] 25
dim(results_clust)[1]
## [1] 9

Show the top 10 most statistically significant intervals

sort_idx = order(results_clust$p.value)
head(results_clust[sort_idx, c("p.value", "score", "index.start", "index.end")], n=10)
##       p.value   score index.start index.end
## 4 7.92800e-12 46.7835       84223     84225
## 5 2.99241e-10 39.6796       84228     84229
## 3 5.09903e-10 38.6390       84218     84221
## 9 1.42051e-09 36.6405       84404     84408
## 8 2.18699e-09 35.7995       84365     84368
## 6 6.69676e-09 33.6209       84233     84234
## 7 6.69676e-09 33.6209       84236     84237
## 2 1.10408e-08 32.6488       84212     84213
## 1 1.83411e-08 31.6627       84201     84210

Read the contents of the profiling file

contents = readLines("output/profiling.txt")
print(contents)
## [1] "CODE PROFILING"                                                  
## [2] "Total Execution time: 3.46021 (s)."                              
## [3] "\tInitialisation time: 0.007941 (s)."                            
## [4] "\tTime to compute corrected significance threshold: 1.38135 (s)."
## [5] "\tTime to find significant intervals: 2.07076 (s)."              
## [6] "\tPost-processing and cleanup time: 0.000151 (s)."               
## [7] "File I/O time: 0.092898 (s)."                                    
## [8] "Peak memory usage: 115360 (KB)"

Read the contents of the summary file

contents = readLines("output/summary.txt")
print(contents)
##  [1] "DATASET CHARACTERISTICS:"                                            
##  [2] "\tN = 84, n = 28, L = 214022"                                        
##  [3] "RESULTS:"                                                            
##  [4] "Number of intervals processed: 2969724 (0.0129667% of total)."       
##  [5] "Maximum interval length to be processed: unlimited"                  
##  [6] "Associated testability threshold: 2.398833e-08"                      
##  [7] "Number of testable intervals: 1979036"                               
##  [8] "Corrected significance threshold at level 5.000000e-02: 2.526483e-08"
##  [9] "Number of significantly associated intervals found: 25"              
## [10] "Maximum testable interval length: 156"


Conclusions

Correction of population structure

Without correction, the algorithm reports and inflation factor \(\lambda = 1.53\). With correction, the inflation reduces to \(\lambda = 1.13\)). See original publication for other datasets with more significant reduction, e.g. COPD study with original \(\lambda = 16.7\) reduced to \(\lambda = 1.05\).


Source: Llinares-López et al., 2017. Supplementary Figure S35.


Pruning the search space

The algorithm does not explore all possible intervals. If an interval is deemed untestable, then any super interval will be untestable.

Source: Llinares-López et al., 2017. Supplementary Figure S2.


Post-processing of results

Clustering algorithm to avoid reporting redundant hits.

Source: Llinares-López et al., 2017. Supplementary Figure S33.