We will replicate a study originally presented in:
Samples from the Arabidopsis thaliana model organism. Original source of data: Atwell et al., 2010. Link to genotye and phenotype data.
The binary phenotype avrRpm1 (hypersensitive response to the bacterial avirulence gene AvrRpm1) was analyzed, with a total of 84 samples (56 cases, 28 controls).
Total number of single-nucleotide polymorphisms (SNPs): 214,022 (all homozygous)
The categorical covariate was used to correct for population structure. Created with k-means clustering on the three principal components of the empirical kinship matrix. The value of k was optimized to minimize genomic inflation.
Key question: Are there any intervals of consecutive SNPs that show an association with the phenotype?
Brute force analysis will require the enumeration of \(N^2\) intervals, with \(N = 214,050\)
Import the CASMAP package
library(CASMAP)
Set the paths to the input files (genotype, phenotype and covariate). Note: for simplicity, the data files are assumed to be located in the same directory as the scripts.
genotype_file <- "X.dat"
phenotype_file <- "Y.dat"
covariate_file <- "C.dat"
Create an object to perform region-based GWAS (interval search).
obj_rbg <- CASMAP(mode="regionGWAS")
Set hyperparameters of the analysis:
obj_rbg$setTargetFWER(alpha=0.05)
obj_rbg$setMaxCombinationSize(max_comb_size=0)
Print the contents of the object.
print(obj_rbg)
## CASMAP object with:
## * Mode = regionGWAS
## * Target FWER = 0.05
## * Maximum combination size = 0
## * No input files read
Read input files.
The following is not necessary for A. thaliana. In general, for input genotype files using an additive encoding:
use the extra input argument ‘encoding’ to select between two encodings:
Note: The covariate file is optional.
obj_rbg$readFiles(genotype_file=genotype_file, phenotype_file=phenotype_file, covariate_file=covariate_file)
Print the state of the object again
print(obj_rbg)
## CASMAP object with:
## * Mode = regionGWAS
## * Target FWER = 0.05
## * Maximum combination size = 0
## * Input files read
## * Covariate = TRUE
Run significant pattern mining algorithm to retrieve statistically associated genomic regions
obj_rbg$execute()
In Example 1, we retrieved the summary information with getSummary(). We can also save it to a file. Write high-level summary and profiling info related to the execution of the algorithm
obj_rbg$writeSummary("output/summary.txt")
obj_rbg$writeProfile("output/profiling.txt")
Write raw list of (possibly redundant) significantly associated multiplicative interactions of genomic variants
obj_rbg$writeSignificantRegions("output/significant_regions_raw.txt")
Write post-processed list of disjoint clusters of significantly associated genomic regions
obj_rbg$writeSignificantClusterRepresentatives("output/significant_regions_clustered.txt")
Read the output file and determine its size
results_raw = read.table("output/significant_regions_raw.txt", sep="\t", skip=1, header=FALSE)
results_clust = read.table("output/significant_regions_clustered.txt", sep="\t", skip=1, header=FALSE,
col.names=c("p.value", "score", "OR", "index.set", "num.regions",
"index.start", "index.end"))
Number of statistically significant intervals (before clustering)
dim(results_raw)[1]
## [1] 25
dim(results_clust)[1]
## [1] 9
Show the top 10 most statistically significant intervals
sort_idx = order(results_clust$p.value)
head(results_clust[sort_idx, c("p.value", "score", "index.start", "index.end")], n=10)
## p.value score index.start index.end
## 4 7.92800e-12 46.7835 84223 84225
## 5 2.99241e-10 39.6796 84228 84229
## 3 5.09903e-10 38.6390 84218 84221
## 9 1.42051e-09 36.6405 84404 84408
## 8 2.18699e-09 35.7995 84365 84368
## 6 6.69676e-09 33.6209 84233 84234
## 7 6.69676e-09 33.6209 84236 84237
## 2 1.10408e-08 32.6488 84212 84213
## 1 1.83411e-08 31.6627 84201 84210
Read the contents of the profiling file
contents = readLines("output/profiling.txt")
print(contents)
## [1] "CODE PROFILING"
## [2] "Total Execution time: 3.46021 (s)."
## [3] "\tInitialisation time: 0.007941 (s)."
## [4] "\tTime to compute corrected significance threshold: 1.38135 (s)."
## [5] "\tTime to find significant intervals: 2.07076 (s)."
## [6] "\tPost-processing and cleanup time: 0.000151 (s)."
## [7] "File I/O time: 0.092898 (s)."
## [8] "Peak memory usage: 115360 (KB)"
Read the contents of the summary file
contents = readLines("output/summary.txt")
print(contents)
## [1] "DATASET CHARACTERISTICS:"
## [2] "\tN = 84, n = 28, L = 214022"
## [3] "RESULTS:"
## [4] "Number of intervals processed: 2969724 (0.0129667% of total)."
## [5] "Maximum interval length to be processed: unlimited"
## [6] "Associated testability threshold: 2.398833e-08"
## [7] "Number of testable intervals: 1979036"
## [8] "Corrected significance threshold at level 5.000000e-02: 2.526483e-08"
## [9] "Number of significantly associated intervals found: 25"
## [10] "Maximum testable interval length: 156"
Without correction, the algorithm reports and inflation factor \(\lambda = 1.53\). With correction, the inflation reduces to \(\lambda = 1.13\)). See original publication for other datasets with more significant reduction, e.g. COPD study with original \(\lambda = 16.7\) reduced to \(\lambda = 1.05\).
The algorithm does not explore all possible intervals. If an interval is deemed untestable, then any super interval will be untestable.
Clustering algorithm to avoid reporting redundant hits.