====== Tutorial 10 - Gene Ontology ====== [[http://geneontology.org/|Gene Ontology]] is the world’s largest knowledgebase of the functions of genes. In this tutorial we will demonstrate its utilization for **set-level analysis of gene expression data**: * First, we will quantify differential expression between a diseased and a control group for individual genes. You have already learnt this procedure in the previous tutorial. The differentially expressed genes may have an association with the disease phenotype. * Then, classes of related genes will be taken from gene ontology and represented in terms of gene sets. * Eventually, we will detect the classes of genes whose expression changes between the groups. The set level gene expression analysis may bring additional knowledge about the disease under study as the straightforward detection of differentially expressed genes may not be not sensitive enough to detect the subtle differences between the expression patterns in both the groups. Diseases typically involve entire groups of mutually linked genes. Only the additive change in their expression may lead to the difference in phenotypic expression. ===== The main goals ===== * Access and understand gene-ontology annotation data. * Learn and compare two ways of set-level gene expression analysis: * **over-representation analysis (ORA)**, * **gene-set enrichment analysis (GSEA)**. * Visualize and interpret the reached results. ===== The input data ===== * [[https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0049475|[Kong et al.]]] presents the results of a large blood transcriptome study that aims to evaluate the utility of gene expression profiling in the diagnosis of Autism Spectrum Disorders (ASD). The full gene expression dataset is available through [[https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE18123|Gene Expression Omnibus]]. In this tutorial, we will work with a {{:courses:bin:tutorials:gse18123.zip|RData expression file}} that contains 54,613 features (detection probes) and 99 samples (individuals). 66 individuals suffer with an ASD, 33 individuals represent the control healthy group. This file corresponds to the train dataset referenced in the paper. load("GSE18123.Rdata") colnames(phenoData(gse18123)) phenoData(gse18123)$group head(exprs(gse18123)) heatmap(exprs(gse18123)[1:4,1:20]) {{ :courses:bin:tutorials:ge_heatmap.jpeg |GE heatmap}} * GO annotation data are available through [[https://www.gsea-msigdb.org/gsea/msigdb/index.jsp|MSigDB]] as well as many R/Bioconductor packages (see [[https://bioconductor.org/packages/release/data/annotation/html/GO.db.html|GO.db]] or [[https://bioconductor.org/packages/release/bioc/html/topGO.html|TopGO]]). In fact, they link the ontology terms and the individual genes (see the right pane in the figure below, the figure is from [[https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004565|[Glass and Girvan]]]). {{ :courses:bin:tutorials:go_annot.jpeg |GO, term hierarchy and annotations}} Technically, the annotation data contain the list of terms, each term has a vector of related genes. Let us inspect [[https://www.gsea-msigdb.org/gsea/msigdb/download_file.jsp?filePath=/msigdb/release/7.5.1/c5.go.bp.v7.5.1.entrez.gmt|gene sets derived from the GO Biological Process ontology]]: library(fgsea) c5.go.bp.lists <- gmtPathways("c5.go.bp.v7.5.1.entrez.gmt") head(c5.go.bp.lists,1) $GOBP_MITOCHONDRIAL_GENOME_MAINTENANCE [1] "10000" "10891" "11232" "142" "1763" "1890" "201163" "201973" [9] "2021" "219736" "291" "3980" "4205" "4358" "4976" "55186" [17] "7156" "7157" "80119" "83667" "84275" "92667" "9361" ===== Differential gene expression ===== Now the goal is to quantify differential expression. We deal with **microarray** data in this task, we will use [[https://bioconductor.org/packages/release/bioc/html/limma.html|limma]] package. Essentially, linear regression is applied. Below, there is a table and boxplot for 4 most differentially expressed microarray probes (y-axis shows expression). library(limma) experiments_of_interest <- c("ASD", "CONTROL") levels(phenoData(gse18123)$group)<-c("CONTROL","AUTISM","AUTISM","AUTISM") levels(phenoData(gse18123)$group)<-c("CONTROL","ASD") columns_of_interest <- which(phenoData(gse18123)$group %in% experiments_of_interest) grouping <- droplevels(phenoData(gse18123)$group[columns_of_interest]) ge_of_interest <- exprs(gse18123)[,columns_of_interest] design <- model.matrix(~ grouping) fit <- limma::lmFit(log2(ge_of_interest), design) result <- limma::eBayes(fit) topTable(result,number=4) logFC AveExpr t P.Value adj.P.Val B 203761_at 0.4857979 10.522307 5.652879 1.527789e-07 0.006379269 6.979307 224724_at 0.6256810 9.495828 5.471321 3.373936e-07 0.006379269 6.268014 212209_at 0.5745102 8.934370 5.389020 4.812168e-07 0.006379269 5.949362 202032_s_at 0.3387834 8.866316 5.353562 5.603056e-07 0.006379269 5.812834 {{ :courses:bin:tutorials:de_probes.jpeg |Boxplots for 4 most differentially expressed microarray probes}} ===== Learn more about the detected probes ===== Each microarray chip has an annotation library that matches the probes and genes. With the aid of library we can learn more about the probes and genes. library(hgu133plus2.db) dbq<-select(hgu133plus2.db, keys=rownames(gse18123), keytype="PROBEID", columns = c("SYMBOL","ENTREZID"), multiVals="list") annot<-dbq[match(unique(dbq$PROBEID), dbq$PROBEID),] annot[annot$PROBEID %in% rownames(topTable(result,number=4)),] PROBEID SYMBOL ENTREZID 203761_at SLA 6503 212209_at MED13L 23389 202032_s_at MAN2A2 4122 224724_at SULF2 55959 When knowing Entrez ID and gene symbol, we can access [[https://www.ncbi.nlm.nih.gov/gene/6503|NCBI website for SLA summary]]. /* most straightforwardly, we may look at gene summary: Predicted to enable signaling receptor binding activity. Predicted to be involved in cell differentiation; innate immune response; and transmembrane receptor protein tyrosine kinase signaling pathway. Predicted to be located in cytosol. Predicted to be active in dendritic spine and focal adhesion. Predicted to be extrinsic component of cytoplasmic side of plasma membrane. wrt the later analysis, the relationship to immune system could be relevant ...*/ The information about DEGs itself may lead to interesting findings. However, the appearance of probes and genes in this list could be the consequence of noisy measurements, small sample sizes and large number of tested probes. Genes are often multi-functional too. Set-level analysis could be more robust and bring a different view of expression data. ===== Over-representation analysis ===== Over-representation analysis (ORA) determines whether genes from a pre-defined set (in here, those beloging to a specific GO term) are present more than would be expected (over-represented) in a subset of your data (in here, a set of differentially expressed genes). In fact, it is a repetitive application of [[https://en.wikipedia.org/wiki/Hypergeometric_distribution|hypergeometric test]]. In the R code below, the differentially expressed genes (DEGs) are found. Then, ORA is performed with the aid of [[https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html|clusterProfiler]] package. In particular, //enrichGO// function gradually goes through all the GO BP (biological process) terms and sorts them in terms of their over-representation in the list of DEGs. //emapplot// shows 30 most significant terms, mutually overlapping gene sets tend to cluster together, making it easier for interpretation. library(clusterProfiler) DEGs_ENTREZID<-annot$ENTREZID[annot$PROBEID %in% rownames(topTable(result,number=Inf,p.value=0.05))] DEGs_ENTREZID<-DEGs_ENTREZID[!is.na(DEGs_ENTREZID)] ans.go <- enrichGO(gene = DEGs_ENTREZID, ont = "BP", OrgDb ="org.Hs.eg.db", universe = annot$ENTREZID, readable=TRUE, pvalueCutoff = 1) tab.go <- as.data.frame(ans.go) tab.go<- subset(tab.go, Count>5) tab.go[1:5, 1:6] emapplot(ans.go) ID Description GeneRatio BgRatio pvalue p.adjust GO:0036230 granulocyte activation 59/664 498/16042 2.631647e-13 1.169581e-09 GO:0042119 neutrophil activation 58/664 492/16042 5.246282e-13 1.169581e-09 GO:0002283 neutrophil activation in immune response 57/664 482/16042 7.385270e-13 1.169581e-09 GO:0002446 neutrophil mediated immunity 57/664 493/16042 1.889805e-12 1.805826e-09 GO:0043312 neutrophil degranulation 56/664 479/16042 1.900469e-12 1.805826e-09 /* GeneRatio and BgRatio give the 4-fold table for the given term, for example, there is about 3% of granulocyte activation genes in general (498/16042=0.031), but in our set, there is nearly 9% granulocyte activation genes (59/664=0.089)*/ {{ :courses:bin:tutorials:go_bp_links.jpeg | Over-represented GO BP terms, overlap graph}} The top 5 GO BP terms as well as one of the major GO BP clusters in the //emapplot// picture above link ASD and immune system. Neutrophils are the most abundant type of granulocytes and make up 40% to 70% of all white blood cells in humans. They form an essential part of the innate immune system. Immune system problems in ASD have been increasing in recent times and are the subject of ongoing interest, see e.g., [[https://link.springer.com/article/10.1007/s12264-017-0103-8|Masi et al.]]. The ontology view of the given cluster could be seen [[https://www.ebi.ac.uk/QuickGO/term/GO:0043312|here]]. === Over-representation analysis as a hypergeometric test === Let us have a look at //granulocyte activation// term presented in the table above. Its GeneRatio says that there are 59 genes that make the overlap between the set of differentially expressed genes (DEs, whose size is 664, only unique IDs count, they also have to be found anywhere in the whole collection of GO BP gene IDs) and the set of genes that are annotated as //granulocyte activating// (GAs, whose size is 498, the genes have to be measured by the microarray too). The total amount of unique gene IDs in GO BP and the microarray set of genes is 16,042. Consequently, the following contingency table orignates: | ^ GAs^ others^ total^ ^ DEs | 59| 605| 664| ^ others | 439| 14,939| 15,378| ^ total | 498| 15,544| 16,042| The outcome of hypergeometric Fisher's test performed over the table perfectly matches the raw p-value reported above for //granulocyte activation// by //enrichGO//: fisher.test(matrix(c(59,439,605,14939),nrow=2)) Fisher's Exact Test for Count Data data: x p-value = 2.632e-13 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 2.454533 4.420599 sample estimates: odds ratio 3.318156 ===== Gene-set enrichment analysis (GSEA) ===== Over-representation analysis may suffer from a couple of limitations. First, after correcting for multiple hypotheses testing, no individual gene may meet the threshold for statistical significance. The relevant biological differences could be modest relative to the noise inherent to the expression measurement. Second, the selection of a set of DEGs leads to information loss as the quantitative information about gene expression is binarized. Again, subtle differential expression patterns spread accross entire groups of mutually linked genes could be omitted. GSEA ([[https://www.pnas.org/doi/10.1073/pnas.0506580102|[Subramanian et al.]]]) avoids these bottlenecks. library(fgsea) probe.tstats<-topTable(result,number=Inf)$t names(probe.tstats)<-annot$ENTREZID[match(rownames(topTable(result,number=Inf)),annot$PROBEID)] probe.tstats<-probe.tstats[!is.na(names(probe.tstats))] gene.tstats<-deframe(aggregate(probe.tstats,list(names(probe.tstats)),mean)) c5.go.bp.lists <- gmtPathways("c5.go.bp.v7.5.1.entrez.gmt") fgseaRes <- fgsea(pathways=c5.go.bp.lists, stats=gene.tstats, minSize=10, maxSize=500,nperm=100000) View(fgseaRes[order(pval)]) GSEA gives a different ordering of gene-sets, but it confirms low p-values of the terms proposed in terms of ORA: pathway pval padj ES NES GOBP_GRANULOCYTE_ACTIVATION 8.423992e-05 0.001208761 0.5493069 1.911837 GOBP_NEUTROPHIL_ACTIVATION_INVOLVED_IN_IMMUNE_RESPONSE 1.469233e-04 0.001856250 0.6760583 1.994625 GOBP_NEUTROPHIL_DEGRANULATION 1.825791e-04 0.002235839 0.7303794 1.949595 GOBP_NEUTROPHIL_MEDIATED_IMMUNITY 3.321024e-03 0.021030446 0.5091520 1.695032 ==== Your task for this tutorial ==== See the last chunk in the Rmd script attached below. Inspect it, run it and explain whether: - the blood expression profiling for ASD detection may be feasible, /* yes, it is feasible, AUC could be up to 0.8 for approximately 50 selected probes (see the cvfit graph), the same observation was done in the paper of Kong et al. (AUC 0.73 for male samples, we deal only with male samples here too) */ - the outlined gene set aggregation helps in ASD detection, /* yes, it further improves the AUC shown fro probes, AUC could be up to 0.9 for approximately 50 selected gene sets (see the cvfit graph) */ - is the outlined evaluation of gene set aggregation unbiased, /* unfortunately, the gene set experiment is slightly optimistically biased, features are constructed with the aid of information taken from samples being classified, the class is ignored which avoids major overfitting, but still, nested cross-validation would be needed to avoid this optimistic bias */ - the proposed GO BP terms match the terms proposed earlier in terms of ORA and GSEA and whether they should match them in theory. /* they should match the DE gene sets discovered earlier, differential expression implies good classification performance */ Eventually, propose your own way of gene-set-based feature extraction for ASD detection. /* possible changes: different way of aggregation (averaging, setSig, pick the best GS gene or simple concatenation (no aggregation, DE gene sets picked as a whole)), utilization of different types of gene sets (KEGG, GO MF, ...), proper testing with the aid of nested CV */ ====== The assignment ====== The assignment aims at another GO utilization, in particular, [[courses:bin:assignments:hw5|automated (protein) function prediction (AFP)]]. ===== Bibliography ===== * [[https://www.pnas.org/doi/10.1073/pnas.0506580102|[Subramanian et al.]]] Subramanian, Aravind, et al. "Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles." Proceedings of the National Academy of Sciences 102.43 (2005): 15545-15550. * [[https://www.sciencedirect.com/science/article/pii/S2666675821000667|[Wu et al.]]] Wu, Tianzhi, et al. "clusterProfiler 4.0: A universal enrichment tool for interpreting omics data." The Innovation 2.3 (2021): 100141. * [[https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0049475|[Kong et al.]]] Kong, Sek Won, et al. "Characteristics and predictive value of blood transcriptome signature in males with autism spectrum disorders." PloS one 7.12 (2012): e49475. * [[https://www.nature.com/articles/s41576-018-0066-2|[Quesnel-Vallières et al.]]] Quesnel-Vallieres, Mathieu, et al. "Autism spectrum disorder: insights into convergent mechanisms from transcriptomics." Nature Reviews Genetics 20.1 (2019): 51-63. ===== Materials ==== The supplement for this tutorial contains a Rmd file and the input datasets. The zip file can be downloaded {{ :courses:bin:tutorials:go_asd_tutorial.zip |here}}. The Rmd file contains more experiments, namely with different annotation datasets (KEGG pathways, transcription factors) and graphical outputs.