We will replicate the study presented in:
Human breast cancer data obtained from MCF-7 breast cancer expression profiles (GSE6462)
The expression data contain samples induced by two different ErbB receptor ligands, EGF and HRG. Our analysis will focus on EGF-induced samples
Total number of genes in the samples: 12,773
Binary labeling of samples: 1 “up-regulated”; 0 otherwise
Considered transcription factors (TFs) predicted to bind in the promoter of the genes (from MSigDB): 397 total
Data preparation from the author’s website
Key question: Are there specific combinations of motifs that are enriched in one group of genes and not the other?
Brute force analysis will require the enumeration of \(2^M\) sets of motifs, with \(M = 397\)
library(CASMAP)
Set the paths to the input files (data and labels). Note: for simplicity, the data files are assumed to be located in the same directory as the scripts.
data_file <- "breast_cancer_matrix.dat"
label_file <- "breast_cancer_label.dat"
Create an object to search for higher order interactions (all possible combinations).
obj_hoi <- CASMAP(mode="higherOrderEpistasis")
Set hyperparameters of the analysis:
obj_hoi$setTargetFWER(alpha=0.05)
obj_hoi$setMaxCombinationSize(max_comb_size=0)
Read input files. Note: Nomenclature of named-parameters is assumed for genomic data.
obj_hoi$readFiles(genotype_file=data_file, phenotype_file=label_file)
All is ready for the analysis! Execute the search. Note: Depending on the data, this step may take some time.
obj_hoi$execute()
The analysis is finalized. Get the summary results.
summary_results <- obj_hoi$getSummary()
print(summary_results)
## $n.iset.processed
## [1] 33837957
##
## $n.iset.closed.processed
## [1] 12060658
##
## $n.iset.testable
## [1] 12009298
##
## $testability.threshold
## [1] 3.981072e-09
##
## $target.fwer
## [1] 0.05
##
## $corrected.significance.threshold
## [1] 4.163441e-09
Get the statistically significant sets of motifs.
sig_sets <- obj_hoi$getSignificantInteractions()
names(sig_sets)
## [1] "itemsets" "score" "odds_ratio" "pvalue"
Display the most significant sets.
sort_idx <- order(sig_sets$pvalue)
sig_sets_ordered <- sig_sets[sort_idx, ]
head(sig_sets_ordered[, c("pvalue", "itemsets")], n=5)
## pvalue itemsets
## 790 1.025043e-19 10, 97, 230, 88, 74, 146, 202, 297
## 3983 1.495522e-19 317, 182, 106, 146
## 260 1.910906e-17 19, 209, 106, 249, 297
## 788 1.910906e-17 10, 97, 230, 88, 74, 146, 155, 202, 297
## 789 1.910906e-17 10, 97, 230, 88, 74, 146, 6, 202, 297, 266
Display the borderline significant sets.
sort_idx <- order(sig_sets$pvalue)
sig_sets_ordered <- sig_sets[sort_idx, ]
tail(sig_sets_ordered[, c("pvalue", "itemsets")], n=5)
## pvalue itemsets
## 1797 3.399109e-09 313, 146, 297
## 16 3.482785e-09 7, 106, 164, 266
## 46 3.482785e-09 368, 106, 266, 297
## 1830 3.482785e-09 313, 146, 6
## 1802 4.008258e-09 313, 146, 202
The combinatorial analysis with CASMAP does not explore the entire search space. Untestable hypothesis are pruned achieving more statistical power.