Search
Gene Ontology is the world’s largest knowledgebase of the functions of genes. In this tutorial we will demonstrate its utilization for set-level analysis of gene expression data:
The set level gene expression analysis may bring additional knowledge about the disease under study as the straightforward detection of differentially expressed genes may not be not sensitive enough to detect the subtle differences between the expression patterns in both the groups. Diseases typically involve entire groups of mutually linked genes. Only the additive change in their expression may lead to the difference in phenotypic expression.
R
load("GSE18123.Rdata") colnames(phenoData(gse18123)) phenoData(gse18123)$group head(exprs(gse18123)) heatmap(exprs(gse18123)[1:4,1:20])
Technically, the annotation data contain the list of terms, each term has a vector of related genes. Let us inspect gene sets derived from the GO Biological Process ontology:
library(fgsea) c5.go.bp.lists <- gmtPathways("c5.go.bp.v7.5.1.entrez.gmt") head(c5.go.bp.lists,1)
The first element in the list of GO terms, ENTREZ IDs of related genes follow
$GOBP_MITOCHONDRIAL_GENOME_MAINTENANCE [1] "10000" "10891" "11232" "142" "1763" "1890" "201163" "201973" [9] "2021" "219736" "291" "3980" "4205" "4358" "4976" "55186" [17] "7156" "7157" "80119" "83667" "84275" "92667" "9361"
Now the goal is to quantify differential expression. We deal with microarray data in this task, we will use limma package. Essentially, linear regression is applied. Below, there is a table and boxplot for 4 most differentially expressed microarray probes (y-axis shows expression).
library(limma) experiments_of_interest <- c("ASD", "CONTROL") levels(phenoData(gse18123)$group)<-c("CONTROL","AUTISM","AUTISM","AUTISM") levels(phenoData(gse18123)$group)<-c("CONTROL","ASD") columns_of_interest <- which(phenoData(gse18123)$group %in% experiments_of_interest) grouping <- droplevels(phenoData(gse18123)$group[columns_of_interest]) ge_of_interest <- exprs(gse18123)[,columns_of_interest] design <- model.matrix(~ grouping) fit <- limma::lmFit(log2(ge_of_interest), design) result <- limma::eBayes(fit) topTable(result,number=4)
Limma output, with log foldchange, average expression, t-stat value and corresponding p-values for each of the probes
logFC AveExpr t P.Value adj.P.Val B 203761_at 0.4857979 10.522307 5.652879 1.527789e-07 0.006379269 6.979307 224724_at 0.6256810 9.495828 5.471321 3.373936e-07 0.006379269 6.268014 212209_at 0.5745102 8.934370 5.389020 4.812168e-07 0.006379269 5.949362 202032_s_at 0.3387834 8.866316 5.353562 5.603056e-07 0.006379269 5.812834
Each microarray chip has an annotation library that matches the probes and genes. With the aid of library we can learn more about the probes and genes.
library(hgu133plus2.db) dbq<-select(hgu133plus2.db, keys=rownames(gse18123), keytype="PROBEID", columns = c("SYMBOL","ENTREZID"), multiVals="list") annot<-dbq[match(unique(dbq$PROBEID), dbq$PROBEID),] annot[annot$PROBEID %in% rownames(topTable(result,number=4)),]
A brief annotation for 4 top probes
PROBEID SYMBOL ENTREZID 203761_at SLA 6503 212209_at MED13L 23389 202032_s_at MAN2A2 4122 224724_at SULF2 55959
When knowing Entrez ID and gene symbol, we can access NCBI website for SLA summary.
Over-representation analysis (ORA) determines whether genes from a pre-defined set (in here, those beloging to a specific GO term) are present more than would be expected (over-represented) in a subset of your data (in here, a set of differentially expressed genes). In fact, it is a repetitive application of hypergeometric test.
In the R code below, the differentially expressed genes (DEGs) are found. Then, ORA is performed with the aid of clusterProfiler package. In particular, enrichGO function gradually goes through all the GO BP (biological process) terms and sorts them in terms of their over-representation in the list of DEGs. emapplot shows 30 most significant terms, mutually overlapping gene sets tend to cluster together, making it easier for interpretation.
library(clusterProfiler) DEGs_ENTREZID<-annot$ENTREZID[annot$PROBEID %in% rownames(topTable(result,number=Inf,p.value=0.05))] DEGs_ENTREZID<-DEGs_ENTREZID[!is.na(DEGs_ENTREZID)] ans.go <- enrichGO(gene = DEGs_ENTREZID, ont = "BP", OrgDb ="org.Hs.eg.db", universe = annot$ENTREZID, readable=TRUE, pvalueCutoff = 1) tab.go <- as.data.frame(ans.go) tab.go<- subset(tab.go, Count>5) tab.go[1:5, 1:6] emapplot(ans.go)
The top 5 over-represented GO BP terms
ID Description GeneRatio BgRatio pvalue p.adjust GO:0036230 granulocyte activation 59/664 498/16042 2.631647e-13 1.169581e-09 GO:0042119 neutrophil activation 58/664 492/16042 5.246282e-13 1.169581e-09 GO:0002283 neutrophil activation in immune response 57/664 482/16042 7.385270e-13 1.169581e-09 GO:0002446 neutrophil mediated immunity 57/664 493/16042 1.889805e-12 1.805826e-09 GO:0043312 neutrophil degranulation 56/664 479/16042 1.900469e-12 1.805826e-09
Let us have a look at granulocyte activation term presented in the table above. Its GeneRatio says that there are 59 genes that make the overlap between the set of differentially expressed genes (DEs, whose size is 664, only unique IDs count, they also have to be found anywhere in the whole collection of GO BP gene IDs) and the set of genes that are annotated as granulocyte activating (GAs, whose size is 498, the genes have to be measured by the microarray too). The total amount of unique gene IDs in GO BP and the microarray set of genes is 16,042. Consequently, the following contingency table orignates:
The outcome of hypergeometric Fisher's test performed over the table perfectly matches the raw p-value reported above for granulocyte activation by enrichGO:
fisher.test(matrix(c(59,439,605,14939),nrow=2))
The outcome of hypergeometric Fisher's test perfectly matches the raw p-value reported above
Fisher's Exact Test for Count Data data: x p-value = 2.632e-13 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 2.454533 4.420599 sample estimates: odds ratio 3.318156
Over-representation analysis may suffer from a couple of limitations. First, after correcting for multiple hypotheses testing, no individual gene may meet the threshold for statistical significance. The relevant biological differences could be modest relative to the noise inherent to the expression measurement. Second, the selection of a set of DEGs leads to information loss as the quantitative information about gene expression is binarized. Again, subtle differential expression patterns spread accross entire groups of mutually linked genes could be omitted. GSEA ([Subramanian et al.]) avoids these bottlenecks.
library(fgsea) probe.tstats<-topTable(result,number=Inf)$t names(probe.tstats)<-annot$ENTREZID[match(rownames(topTable(result,number=Inf)),annot$PROBEID)] probe.tstats<-probe.tstats[!is.na(names(probe.tstats))] gene.tstats<-deframe(aggregate(probe.tstats,list(names(probe.tstats)),mean)) c5.go.bp.lists <- gmtPathways("c5.go.bp.v7.5.1.entrez.gmt") fgseaRes <- fgsea(pathways=c5.go.bp.lists, stats=gene.tstats, minSize=10, maxSize=500,nperm=100000) View(fgseaRes[order(pval)])
GSEA gives a different ordering of gene-sets, but it confirms low p-values of the terms proposed in terms of ORA:
GSEA gives a different ordering of gene-sets, but confirms low p-values for the terms proposed in terms of ORA
pathway pval padj ES NES GOBP_GRANULOCYTE_ACTIVATION 8.423992e-05 0.001208761 0.5493069 1.911837 GOBP_NEUTROPHIL_ACTIVATION_INVOLVED_IN_IMMUNE_RESPONSE 1.469233e-04 0.001856250 0.6760583 1.994625 GOBP_NEUTROPHIL_DEGRANULATION 1.825791e-04 0.002235839 0.7303794 1.949595 GOBP_NEUTROPHIL_MEDIATED_IMMUNITY 3.321024e-03 0.021030446 0.5091520 1.695032
See the last chunk in the Rmd script attached below. Inspect it, run it and explain whether:
Eventually, propose your own way of gene-set-based feature extraction for ASD detection.
The assignment aims at another GO utilization, in particular, automated (protein) function prediction (AFP).
The supplement for this tutorial contains a Rmd file and the input datasets. The zip file can be downloaded here. The Rmd file contains more experiments, namely with different annotation datasets (KEGG pathways, transcription factors) and graphical outputs.