Tutorial 10 - Gene Ontology

Gene Ontology is the world’s largest knowledgebase of the functions of genes. In this tutorial we will demonstrate its utilization for set-level analysis of gene expression data:

First, we will quantify differential expression between a diseased and a control group for individual genes. You have already learnt this procedure in the previous tutorial. The differentially expressed genes may have an association with the disease phenotype.
Then, classes of related genes will be taken from gene ontology and represented in terms of gene sets.
Eventually, we will detect the classes of genes whose expression changes between the groups.

The set level gene expression analysis may bring additional knowledge about the disease under study as the straightforward detection of differentially expressed genes may not be not sensitive enough to detect the subtle differences between the expression patterns in both the groups. Diseases typically involve entire groups of mutually linked genes. Only the additive change in their expression may lead to the difference in phenotypic expression.

The main goals

Access and understand gene-ontology annotation data, use MSigDB or Bioconductor packages.
Learn and compare two ways of set-level gene expression analysis:
- over-representation analysis (ORA),
- gene-set enrichment analysis (GSEA).
Visualize and interpret the reached results.

The input data

[Kong et al.] presents the results of a large blood transcriptome study that aims to evaluate the utility of gene expression profiling in the diagnosis of Autism Spectrum Disorders (ASD). The full gene expression dataset is available through Gene Expression Omnibus. In this tutorial, we will work with a RData expression file that contains 54,613 features (detection probes) and 99 samples (individuals). 66 individuals suffer with an ASD, 33 individuals represent the control healthy group. This file corresponds to the train dataset referenced in the paper.

R

load("GSE18123.Rdata")
colnames(phenoData(gse18123))
phenoData(gse18123)$group
head(exprs(gse18123))
heatmap(exprs(gse18123)[1:4,1:20])

GO annotation data are available through MSigDB as well as many R packages. In fact, they link the ontology terms and the individual genes (see the right pane in the figure below, the figure is from [Glass and Girvan]).

Technically, the annotation data contain the list of terms, each term has a vector of related genes. Let us inspect gene sets derived from the GO Biological Process ontology:

R

library(fgsea)
c5.go.bp.lists <- gmtPathways("c5.go.bp.v7.5.1.entrez.gmt")
head(c5.go.bp.lists,1)

The first element in the list of GO terms, ENTREZ IDs of related genes follow

$GOBP_MITOCHONDRIAL_GENOME_MAINTENANCE
 [1] "10000"  "10891"  "11232"  "142"    "1763"   "1890"   "201163" "201973"
 [9] "2021"   "219736" "291"    "3980"   "4205"   "4358"   "4976"   "55186" 
[17] "7156"   "7157"   "80119"  "83667"  "84275"  "92667"  "9361"

Differential gene expression

Now the goal is to quantify differential expression. We deal with microarray data in this task, we will use limma library. Essentially, linear regression is applied. Below, there is a table and boxplot for 4 most differentially expressed microarray probes (y-axis shows expression).

R

library(limma)

experiments_of_interest <- c("ASD", "CONTROL")
levels(phenoData(gse18123)$group)<-c("CONTROL","AUTISM","AUTISM","AUTISM")
levels(phenoData(gse18123)$group)<-c("CONTROL","ASD")

columns_of_interest <- which(phenoData(gse18123)$group %in% experiments_of_interest)
grouping <- droplevels(phenoData(gse18123)$group[columns_of_interest])
ge_of_interest <- exprs(gse18123)[,columns_of_interest]

design <- model.matrix(~ grouping)

fit <- limma::lmFit(log2(ge_of_interest), design)
result <- limma::eBayes(fit)
topTable(result,number=4)

Limma output, with log foldchange, average expression, t-stat value and corresponding p-values for each of the probes

                 logFC   AveExpr        t      P.Value   adj.P.Val        B
203761_at    0.4857979 10.522307 5.652879 1.527789e-07 0.006379269 6.979307
224724_at    0.6256810  9.495828 5.471321 3.373936e-07 0.006379269 6.268014
212209_at    0.5745102  8.934370 5.389020 4.812168e-07 0.006379269 5.949362
202032_s_at  0.3387834  8.866316 5.353562 5.603056e-07 0.006379269 5.812834

Learn more about the detected probes

Each microarray chip has an annotation library that matches the probes and genes. With the aid of library we can learn more about the probes and genes.

R

library(hgu133plus2.db)

dbq<-select(hgu133plus2.db, keys=rownames(gse18123), keytype="PROBEID", columns = c("SYMBOL","ENTREZID"), multiVals="list")
annot<-dbq[match(unique(dbq$PROBEID), dbq$PROBEID),]
annot[annot$PROBEID %in% rownames(topTable(result,number=4)),]

A brief annotation for 4 top probes

    PROBEID SYMBOL ENTREZID
  203761_at    SLA     6503
  212209_at MED13L    23389
202032_s_at MAN2A2     4122
  224724_at  SULF2    55959

When knowing Entrez ID and gene symbol, we can access NCBI website for SLA summary.

The information about DEGs itself may lead to interesting findings. However, the appearance of probes and genes in this list could be the consequence of noisy measurements, small sample sizes and large number of tested probes. Genes are often multi-functional too. Set-level analysis could be more robust and bring a different view of expression data.

Over-representation analysis

Over-representation analysis determines whether genes from a pre-defined set (in here, those beloging to a specific GO term) are present more than would be expected (over-represented) in a subset of your data (in here, a set of differentially expressed genes). In fact, it is a repetitive application of hypergeometric test.

In the R code below, the differentially expressed genes (DEGs) are found. Then, enrichGO gradually goes through all the GO BP (biological process) terms and sorts them in terms of their over-representation in the list of DEGs. emapplot shows 30 most significant terms, mutually overlapping gene sets tend to cluster together, making it easier for interpretation.

R

library(clusterProfiler)

DEGs_ENTREZID<-annot$ENTREZID[annot$PROBEID %in% rownames(topTable(result,number=Inf,p.value=0.05))]
DEGs_ENTREZID<-DEGs_ENTREZID[!is.na(DEGs_ENTREZID)]

ans.go <- enrichGO(gene = DEGs_ENTREZID, ont = "BP",
                   OrgDb ="org.Hs.eg.db",
                   universe = annot$ENTREZID,
                   readable=TRUE,
                   pvalueCutoff = 1)

tab.go <- as.data.frame(ans.go)
tab.go<- subset(tab.go, Count>5)

tab.go[1:5, 1:6]
emapplot(ans.go)

The top 5 over-represented GO BP terms

        ID                              Description GeneRatio   BgRatio       pvalue      p.adjust
GO:0036230                   granulocyte activation    59/664 498/16042 2.631647e-13  1.169581e-09
GO:0042119                    neutrophil activation    58/664 492/16042 5.246282e-13  1.169581e-09
GO:0002283 neutrophil activation in immune response    57/664 482/16042 7.385270e-13  1.169581e-09
GO:0002446             neutrophil mediated immunity    57/664 493/16042 1.889805e-12  1.805826e-09
GO:0043312                 neutrophil degranulation    56/664 479/16042 1.900469e-12  1.805826e-09

The top 5 GO BP terms as well as one of the major GO BP clusters in the emapplot picture above link ASD and immune system. Neutrophils are the most abundant type of granulocytes and make up 40% to 70% of all white blood cells in humans. They form an essential part of the innate immune system. Immune system problems in ASD have been increasing in recent times and are the subject of ongoing interest, see e.g., Masi et al.. The ontology view of the given cluster could be seen here.

Over-representation analysis as a hypergeometric test

Let us have a look at granulocyte activation term presented in the table above. Its GeneRatio says that there are 59 genes that make the overlap between the set of differentially expressed genes (DEs, whose size is 664, only unique IDs count, they also have to be found anywhere in the whole collection of GO BP gene IDs) and the set of genes that are annotated as granulocyte activating (GAs, whose size is 498, the genes have to be measured by the microarray too). The total amount of unique gene IDs in GO BP and the microarray set of genes is 16,042. Consequently, the following contingency table orignates:

	GAs	others	total
DEs	59	605	664
others	439	14,939	15,378
total	498	15,544	16,042

The outcome of hypergeometric Fisher's test performed over the table perfectly matches the raw p-value reported above for granulocyte activation by enrichGO:

R

fisher.test(matrix(c(59,439,605,14939),nrow=2))

The outcome of hypergeometric Fisher's test perfectly matches the raw p-value reported above

Fisher's Exact Test for Count Data

data:  x
p-value = 2.632e-13
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 2.454533 4.420599
sample estimates:
odds ratio 
  3.318156

Gene-set enrichment analysis (GSEA)

Over-representation analysis may suffer from a couple of limitations. First, after correcting for multiple hypotheses testing, no individual gene may meet the threshold for statistical significance. The relevant biological differences could be modest relative to the noise inherent to the expression measurement. Second, the selection of a set of DEGs leads to information loss as the quantitative information about gene expression is binarized. Again, subtle differential expression patterns spread accross entire groups of mutually linked genes could be omitted. GSEA ([Subramanian et al.]) avoids these bottlenecks.

R

library(fgsea)

probe.tstats<-topTable(result,number=Inf)$t
names(probe.tstats)<-annot$ENTREZID[match(rownames(topTable(result,number=Inf)),annot$PROBEID)]
probe.tstats<-probe.tstats[!is.na(names(probe.tstats))]
gene.tstats<-deframe(aggregate(probe.tstats,list(names(probe.tstats)),mean))

c5.go.bp.lists <- gmtPathways("c5.go.bp.v7.5.1.entrez.gmt")
fgseaRes <- fgsea(pathways=c5.go.bp.lists, stats=gene.tstats, minSize=10, maxSize=500,nperm=100000)
View(fgseaRes[order(pval)])

GSEA gives a different ordering of gene-sets, but it confirms low p-values of the terms proposed in terms of ORA:

GSEA gives a different ordering of gene-sets, but confirms low p-values for the terms proposed in terms of ORA

                                                pathway         pval        padj        ES      NES
                            GOBP_GRANULOCYTE_ACTIVATION 8.423992e-05 0.001208761 0.5493069 1.911837
 GOBP_NEUTROPHIL_ACTIVATION_INVOLVED_IN_IMMUNE_RESPONSE 1.469233e-04 0.001856250 0.6760583 1.994625
                          GOBP_NEUTROPHIL_DEGRANULATION 1.825791e-04 0.002235839 0.7303794 1.949595
                      GOBP_NEUTROPHIL_MEDIATED_IMMUNITY 3.321024e-03 0.021030446 0.5091520 1.695032

Your task for this tutorial

See the last chunk in the Rmd script attached below. Inspect it, run it and explain whether:

the blood expression profiling for ASD detection may be feasible,
the outlined gene set aggregation helps in ASD detection,
is the outlined evaluation of gene set aggregation unbiased,
the proposed GO BP terms match the terms proposed earlier in terms of ORA and GSEA and whether they should match them in theory.

Eventually, propose your own way of gene-set-based feature extraction for ASD detection.

The assignment

The assignment aims at another GO utilization, in particular, automated (protein) function prediction (AFP).

Bibliography

[Subramanian et al.] Subramanian, Aravind, et al. “Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.” Proceedings of the National Academy of Sciences 102.43 (2005): 15545-15550.
[Wu et al.] Wu, Tianzhi, et al. “clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.” The Innovation 2.3 (2021): 100141.
[Kong et al.] Kong, Sek Won, et al. “Characteristics and predictive value of blood transcriptome signature in males with autism spectrum disorders.” PloS one 7.12 (2012): e49475.
[Quesnel-Vallières et al.] Quesnel-Vallieres, Mathieu, et al. “Autism spectrum disorder: insights into convergent mechanisms from transcriptomics.” Nature Reviews Genetics 20.1 (2019): 51-63.

Materials

The supplement for this tutorial contains a Rmd file and the input datasets. The zip file can be downloaded here. The Rmd file contains more experiments, namely with different annotation datasets (KEGG pathways, transcription factors) and graphical outputs.

Table of Contents