Gene Ontology is the world’s largest knowledgebase of the functions of genes. In this notebook we will demonstrate its utilization for set-level analysis of gene expression data. The main goals will be:
Kong et al. presents the results of a large blood transcriptome study that aims to evaluate the utility of gene expression profiling in the diagnosis of Autism Spectrum Disorders (ASD). The full gene expression dataset is available through Gene Expression Omnibus. In this notebook, we will work with a RData expression file that contains 54,613 features (detection probes) and 99 samples (individuals). 66 individuals suffer with an ASD, 33 individuals represent the control healthy group. This file corresponds to the train dataset referenced in the paper.
There are four classes in the expression set: CONTROL, AUTISM, PDD-NOS, ASPERGER’S DISORDER. We will merge the three non-control groups into a single category referred to as Autism spectrum disorder (ASD).
## [1] "title"
## [2] "geo_accession"
## [3] "status"
## [4] "submission_date"
## [5] "last_update_date"
## [6] "type"
## [7] "channel_count"
## [8] "source_name_ch1"
## [9] "organism_ch1"
## [10] "characteristics_ch1"
## [11] "characteristics_ch1.1"
## [12] "characteristics_ch1.2"
## [13] "characteristics_ch1.3"
## [14] "characteristics_ch1.4"
## [15] "characteristics_ch1.5"
## [16] "characteristics_ch1.6"
## [17] "characteristics_ch1.7"
## [18] "characteristics_ch1.8"
## [19] "characteristics_ch1.9"
## [20] "characteristics_ch1.10"
## [21] "characteristics_ch1.11"
## [22] "characteristics_ch1.12"
## [23] "characteristics_ch1.13"
## [24] "characteristics_ch1.14"
## [25] "characteristics_ch1.15"
## [26] "characteristics_ch1.16"
## [27] "characteristics_ch1.17"
## [28] "characteristics_ch1.18"
## [29] "characteristics_ch1.19"
## [30] "characteristics_ch1.20"
## [31] "molecule_ch1"
## [32] "extract_protocol_ch1"
## [33] "label_ch1"
## [34] "label_protocol_ch1"
## [35] "taxid_ch1"
## [36] "hyb_protocol"
## [37] "scan_protocol"
## [38] "description"
## [39] "data_processing"
## [40] "platform_id"
## [41] "contact_name"
## [42] "contact_email"
## [43] "contact_phone"
## [44] "contact_laboratory"
## [45] "contact_department"
## [46] "contact_institute"
## [47] "contact_address"
## [48] "contact_city"
## [49] "contact_state"
## [50] "contact_zip/postal_code"
## [51] "contact_country"
## [52] "supplementary_file"
## [53] "data_row_count"
## [54] "age at blood drawing (months):ch1"
## [55] "allergies:ch1"
## [56] "autoimmune disorder:ch1"
## [57] "birth defects:ch1"
## [58] "cerebral palsy:ch1"
## [59] "chronic diseases:ch1"
## [60] "ct:ch1"
## [61] "developmental/speech disorder:ch1"
## [62] "diagnosis:ch1"
## [63] "ethnicity:ch1"
## [64] "gastrointestinal disorder:ch1"
## [65] "gender:ch1"
## [66] "genetic testing:ch1"
## [67] "hours (minutes since last caloric intake):ch1"
## [68] "medications/vitamin names:ch1"
## [69] "mri:ch1"
## [70] "neurological disorder:ch1"
## [71] "other diseases:ch1"
## [72] "psychiatric disorder:ch1"
## [73] "race:ch1"
## [74] "seizure:ch1"
## [75] "group"
## [1] PDD-NOS PDD-NOS AUTISM
## [4] AUTISM AUTISM PDD-NOS
## [7] PDD-NOS AUTISM ASPERGER'S DISORDER
## [10] AUTISM ASPERGER'S DISORDER PDD-NOS
## [13] ASPERGER'S DISORDER PDD-NOS PDD-NOS
## [16] AUTISM AUTISM AUTISM
## [19] PDD-NOS AUTISM AUTISM
## [22] AUTISM PDD-NOS ASPERGER'S DISORDER
## [25] AUTISM AUTISM AUTISM
## [28] PDD-NOS AUTISM PDD-NOS
## [31] AUTISM ASPERGER'S DISORDER PDD-NOS
## [34] AUTISM AUTISM PDD-NOS
## [37] PDD-NOS PDD-NOS AUTISM
## [40] PDD-NOS PDD-NOS PDD-NOS
## [43] AUTISM PDD-NOS AUTISM
## [46] AUTISM AUTISM PDD-NOS
## [49] AUTISM AUTISM AUTISM
## [52] ASPERGER'S DISORDER AUTISM AUTISM
## [55] AUTISM AUTISM ASPERGER'S DISORDER
## [58] PDD-NOS ASPERGER'S DISORDER ASPERGER'S DISORDER
## [61] AUTISM PDD-NOS PDD-NOS
## [64] PDD-NOS PDD-NOS PDD-NOS
## [67] CONTROL CONTROL CONTROL
## [70] CONTROL CONTROL CONTROL
## [73] CONTROL CONTROL CONTROL
## [76] CONTROL CONTROL CONTROL
## [79] CONTROL CONTROL CONTROL
## [82] CONTROL CONTROL CONTROL
## [85] CONTROL CONTROL CONTROL
## [88] CONTROL CONTROL CONTROL
## [91] CONTROL CONTROL CONTROL
## [94] CONTROL CONTROL CONTROL
## [97] CONTROL CONTROL CONTROL
## Levels: CONTROL AUTISM PDD-NOS ASPERGER'S DISORDER
## GSM650510 GSM650512 GSM650513 GSM650514 GSM650515 GSM650516 GSM650517
## 1007_s_at 26.65640 107.6181 48.97026 72.65312 80.83264 40.13933 47.43693
## 1053_at 39.00726 135.7397 51.37555 102.65829 78.93955 81.75657 79.74300
## 117_at 296.01724 459.1438 378.86384 618.79730 266.87741 320.33815 685.72845
## 121_at 209.03575 232.0835 190.15020 185.90196 216.47953 172.25500 188.12218
## 1255_g_at 63.82730 50.2000 60.23913 22.03061 41.29200 74.49414 16.84841
## 1294_at 165.14969 360.1872 190.22052 260.85834 305.24385 286.53678 249.08788
## GSM650518 GSM650519 GSM650520 GSM650521 GSM650522 GSM650523 GSM650525
## 1007_s_at 57.90003 65.03053 43.14078 44.66914 112.48516 42.34660 67.71297
## 1053_at 74.16394 131.21986 62.78481 59.43699 110.43174 108.47401 76.09486
## 117_at 285.89160 996.76063 231.43584 418.68458 421.70080 449.31038 435.37965
## 121_at 155.99192 172.30327 141.44389 280.48566 213.13301 301.18909 234.26958
## 1255_g_at 48.88556 19.70762 17.45545 25.77657 39.78822 40.91114 30.19161
## 1294_at 412.38298 227.39853 393.23011 159.28096 238.38950 474.06981 358.50993
## GSM650526 GSM650527 GSM650528 GSM650529 GSM650530 GSM650531 GSM650532
## 1007_s_at 97.61767 86.20184 66.83669 112.60186 84.04476 88.51976 54.02114
## 1053_at 115.22141 105.21627 66.37294 40.22882 128.49734 90.94384 69.77820
## 117_at 482.63132 351.16903 442.39206 319.49443 526.06732 291.08733 391.89801
## 121_at 202.57670 191.83694 170.35430 178.74447 237.33150 231.47094 144.76965
## 1255_g_at 63.95302 35.09998 30.34217 41.24234 27.77132 62.36360 47.78701
## 1294_at 341.14757 250.76949 315.76797 297.34555 323.81813 310.73256 229.03069
## GSM650533 GSM650534 GSM650536 GSM650537 GSM650538 GSM650540 GSM650541
## 1007_s_at 49.04914 95.25455 30.55739 32.51340 76.10798 61.11153 70.53271
## 1053_at 53.30094 91.86629 38.08755 58.46904 95.66880 97.61994 57.76480
## 117_at 286.14483 188.09012 177.53415 238.23088 299.47317 311.33134 312.82048
## 121_at 251.38737 233.33160 190.83007 311.91319 185.35691 290.27232 253.54693
## 1255_g_at 45.04606 56.82903 31.59687 31.47276 36.42960 36.56611 39.40304
## 1294_at 262.26849 242.44354 192.50216 246.96233 240.79891 213.81507 292.37782
## GSM650542 GSM650543 GSM650544 GSM650545 GSM650546 GSM650547 GSM650548
## 1007_s_at 127.37630 55.17926 58.02352 46.28381 72.81389 92.34122 109.71356
## 1053_at 59.81668 104.90197 96.80858 87.26766 60.02479 94.34461 55.37355
## 117_at 300.70545 495.05182 509.04185 312.03245 489.62629 415.17741 442.54907
## 121_at 229.58772 227.93792 204.34015 297.33280 143.82368 262.96083 149.96356
## 1255_g_at 22.76424 52.86628 49.39887 34.18810 20.69595 53.38379 6.29048
## 1294_at 366.65715 194.83877 342.43099 181.69983 266.71635 276.19455 307.73590
## GSM650549 GSM650550 GSM650551 GSM650552 GSM650553 GSM650554 GSM650555
## 1007_s_at 92.14704 90.90995 105.08639 146.93503 78.28801 81.68491 82.02377
## 1053_at 95.75218 50.77397 110.89406 70.73290 63.95427 35.69693 40.44235
## 117_at 151.71754 347.63886 360.82290 273.84063 748.67212 410.17838 195.40343
## 121_at 280.49265 145.42998 218.02353 202.10174 163.21464 196.52260 269.76409
## 1255_g_at 43.15510 31.85688 24.02297 33.45226 54.44614 47.17705 55.44899
## 1294_at 228.34206 394.19645 224.67764 320.61316 283.30227 309.48430 264.58254
## GSM650556 GSM650557 GSM650558 GSM650560 GSM650561 GSM650562 GSM650563
## 1007_s_at 156.92131 77.46551 85.99289 94.88252 152.78473 105.98660 78.05807
## 1053_at 90.74336 46.96045 84.63094 40.58687 39.30229 33.82422 63.27953
## 117_at 161.36395 248.27982 485.06022 348.66766 208.82518 305.12205 341.88091
## 121_at 176.95763 279.04549 213.28606 207.84496 159.02063 192.77817 210.62787
## 1255_g_at 38.67305 39.54768 40.60540 8.44513 44.99761 53.31200 56.77954
## 1294_at 335.83368 270.75202 235.12873 375.20933 424.90983 370.19472 324.76240
## GSM650564 GSM650565 GSM650566 GSM650567 GSM650568 GSM650569 GSM650570
## 1007_s_at 166.40268 62.91733 58.61961 128.24129 88.45555 118.22818 69.03256
## 1053_at 38.78946 44.56538 33.02484 46.62806 29.36873 107.71534 16.78094
## 117_at 179.36373 319.07661 395.07780 299.10731 213.50369 716.61013 170.56964
## 121_at 128.38691 240.95580 287.15458 273.93943 175.58030 162.62681 279.35940
## 1255_g_at 23.47485 46.03113 36.06378 39.63614 23.56664 45.42461 50.26017
## 1294_at 284.99828 254.14767 159.74961 358.10752 270.03849 343.80083 187.74559
## GSM650571 GSM650572 GSM650573 GSM650574 GSM650575 GSM650577 GSM650578
## 1007_s_at 69.74568 116.98213 99.34677 76.87986 68.75603 103.28642 206.2436
## 1053_at 48.22536 92.15871 79.71097 91.87400 71.37006 90.53375 103.2923
## 117_at 320.91119 524.78802 520.51320 368.97094 742.61166 238.83626 330.6690
## 121_at 272.64131 222.97582 227.02233 182.92163 249.96396 263.33372 231.3220
## 1255_g_at 32.43698 38.61383 64.57411 49.49402 56.12527 66.30911 41.4669
## 1294_at 318.05537 264.53520 335.33309 244.52139 316.91943 329.81489 424.0644
## GSM650579 GSM650580 GSM650581 GSM650600 GSM650603 GSM650604 GSM650606
## 1007_s_at 62.61237 150.77520 165.40445 104.52255 171.31276 61.90472 78.72994
## 1053_at 29.77958 76.14664 71.08097 91.39947 75.20531 86.12535 50.16479
## 117_at 406.40342 231.10036 335.55447 290.05855 150.92785 820.24275 104.84900
## 121_at 222.25569 309.33096 271.97834 279.47867 234.82769 200.23679 207.01358
## 1255_g_at 48.21423 53.88799 46.95981 33.35752 38.25803 40.47013 8.65039
## 1294_at 359.82553 357.85310 382.47175 322.70023 349.84522 427.46915 206.51453
## GSM650607 GSM650614 GSM650615 GSM650617 GSM650618 GSM650619 GSM650620
## 1007_s_at 69.87071 127.12054 107.10228 60.89538 85.91566 64.04747 118.08883
## 1053_at 58.85608 81.37677 110.68482 115.79915 94.21142 42.34658 37.78299
## 117_at 181.05986 192.91026 580.58332 393.03068 100.02749 302.18980 278.09137
## 121_at 269.15970 254.46785 273.83700 287.53296 245.05935 140.15537 198.76899
## 1255_g_at 30.90586 77.25758 59.57494 15.08969 56.52948 0.08787 39.30338
## 1294_at 201.79583 217.34291 279.05300 425.46090 263.23253 248.57059 294.13025
## GSM650621 GSM650622 GSM650632 GSM650633 GSM650636 GSM650637 GSM650638
## 1007_s_at 72.11873 118.84221 11.97880 59.92202 86.49749 61.53625 87.46682
## 1053_at 22.22120 85.58917 25.50613 29.76794 55.63645 87.32867 102.22848
## 117_at 342.31566 280.24622 158.48301 198.31375 161.98425 530.47440 555.54914
## 121_at 208.35171 219.57346 108.62315 174.11616 258.86629 282.97858 204.59205
## 1255_g_at 38.20731 33.84125 29.00139 36.82613 55.28146 38.20192 39.98789
## 1294_at 391.36274 301.28988 125.84589 218.69022 277.73730 316.85612 368.69160
## GSM650639 GSM650640 GSM650641 GSM650642 GSM650643 GSM650644 GSM650645
## 1007_s_at 68.47716 76.07428 100.09395 134.68330 133.9225 81.02742 67.71790
## 1053_at 43.30464 31.13427 41.39585 30.79279 114.9599 157.17724 86.83151
## 117_at 187.92812 126.49112 161.00544 169.90919 502.2139 913.09583 285.02298
## 121_at 280.13799 189.34671 315.23597 269.91806 225.4656 212.27610 309.88049
## 1255_g_at 49.59823 30.35822 50.72183 31.46361 58.0416 42.52692 43.31678
## 1294_at 200.10303 177.30314 303.33246 284.05885 289.2569 309.59830 239.33292
## GSM650646 GSM650647 GSM650648 GSM650649 GSM650651 GSM650652 GSM650653
## 1007_s_at 49.50204 64.66049 110.58686 18.62053 66.15737 76.31541 59.93721
## 1053_at 106.72321 37.18311 45.04578 77.12382 92.94504 137.79575 26.70354
## 117_at 539.54601 75.35680 350.91103 488.36033 252.50309 329.28063 145.30082
## 121_at 206.78336 281.03506 189.41136 294.80112 274.37934 232.94968 204.35111
## 1255_g_at 21.37225 61.94335 37.83045 19.08806 43.59119 49.46663 36.18051
## 1294_at 321.91182 189.73190 476.11247 173.49481 213.45439 191.23652 305.75434
## GSM650654
## 1007_s_at 45.99937
## 1053_at 101.07505
## 117_at 276.80480
## 121_at 265.61533
## 1255_g_at 16.70463
## 1294_at 205.76415
The first step is to quantify differential expression for the individual probes. We deal with microarray data in this task, we will use limma library. Essentially, linear regression is applied.
design <- model.matrix(~ grouping)
# deals with log2 scaled data
fit <- limma::lmFit(log2(counts_of_interest), design)
result <- limma::eBayes(fit)
topTable(result)
## Removing intercept from test coefficients
# plot the top results, their ordering not kept
top12<-rownames(topTable(result,number=12))
## Removing intercept from test coefficients
df<-data.frame(probe=rep(top12,length(columns_of_interest)),class=rep(grouping,each=12),data=as.vector(exprs(gse18123)[rownames(exprs(gse18123)) %in% top12,columns_of_interest]))
ggplot(df, aes(x=probe, y=data, fill=class)) +
geom_boxplot() +
facet_wrap(~probe, scale="free")
Over-representation analysis determines whether genes from a pre-defined set (in here, those beloging to a specific GO term) are present more than would be expected (over-represented) in a subset of your data (in here, a set of differentially expressed genes). In fact, it is a repetitive application of hypergeometric test.
# identify differentially expressed genes first, adjusted p.value must be smaller than 0.05, 412 genes selected (around 0.7%)
DEGs<-topTable(result,number=Inf,p.value=0.05)
## Removing intercept from test coefficients
# get their annotation
dbq<-AnnotationDbi::select(hgu133plus2.db, keys=rownames(gse18123), keytype="PROBEID",columns = c("SYMBOL","ENTREZID"),multiVals="list")
## 'select()' returned 1:many mapping between keys and columns
annot<-dbq[match(unique(dbq$PROBEID), dbq$PROBEID),]
DEGs_ENTREZID<-annot$ENTREZID[annot$PROBEID %in% rownames(DEGs)]
DEGs_ENTREZID<-DEGs_ENTREZID[!is.na(DEGs_ENTREZID)]
# use clusterProfiler with built-in gene lists
ans.go <- enrichGO(gene = DEGs_ENTREZID, ont = "BP",
OrgDb ="org.Hs.eg.db",
universe = annot$ENTREZID,
readable=TRUE,
pvalueCutoff = 1, qvalueCutoff=1)
tab.go <- as.data.frame(ans.go)
tab.go<- subset(tab.go, Count>5)
tab.go[1:5, 1:6]
# show multiple scores at once, adjusted p-value as well as count, sort by adjusted p-value
# count is the number of DE genes that belong to a given gene-set
barplot(ans.go, showCategory=10)
# visualizes gene sets as a network, mutually overlapping gene sets tend to cluster together, making it easier for interpretation.
#emapplot(ans.go)
emapplot(pairwise_termsim(ans.go))
ans.kegg <- enrichKEGG(gene = DEGs_ENTREZID,
organism = 'hsa',
universe = annot$ENTREZID,
pvalueCutoff = 0.05, qvalueCutoff=1)
## Reading KEGG annotation online:
## Reading KEGG annotation online:
tab.kegg <- as.data.frame(ans.kegg)
tab.kegg<- subset(tab.kegg, Count>5)
tab.kegg[1:5, 1:6]
# show multiple scores at once, GeneRatio, adjusted p-value and count
# GeneRatio=count/setSize, where 'count' is the number of DE genes that belong to a given gene-set, while 'setSize' is the total number of genes in the gene-set
# pushes up larger gene sets, such as Endocytosis (contains 249 genes)
dotplot(ans.kegg, showCategory=12) + ggtitle("KEGG")
# visualize particular pathway
logFC <- DEGs$logFC
names(logFC) <- DEGs_ENTREZID
pathview(gene.data = logFC,
pathway.id = "hsa04140",
species = "hsa",
limit = list(gene=2, cpd=1))
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/jiri/ownCloud/oCloud/Vyuka/BIN/cviceni/Enrich_autism_tutorial
## Info: Writing image file hsa04140.pathview.png
# deal with external gene lists
c3.tf <- read.gmt("c3.tft.v7.5.1.entrez.gmt")
# by default, abnalyze only the sets with minimum size 10 and maximum size 500
ans.tf <- enricher(DEGs_ENTREZID, TERM2GENE=c3.tf,universe = annot$ENTREZID)
#ans.tf <- enricher(DEGs_ENTREZID, TERM2GENE=c3.tf,universe = annot$ENTREZID, pvalueCutoff = 1, qvalueCutoff = 1) # full result to compare with fgsea
tab.tf <- as.data.frame(ans.tf)
tab.tf.large<- subset(tab.tf, Count>5)
tab.tf.large[1:5,1:5]
# visualizes gene sets as a network, mutually overlapping gene sets tend to cluster together, making it easier for interpretation.
#emapplot(ans.tf)
emapplot(pairwise_termsim(ans.tf))
# run Fisher test for the NFMUE1_Q6 list
# inspect the list first
head(c3.tf)
NFMUE1_Q6<-c3.tf$gene[c3.tf$term=="NFMUE1_Q6"] # older call: NFMUE1_Q6<-c3.tf$gene[c3.tf$ont=="NFMUE1_Q6"]
annot[annot$ENTREZID %in% NFMUE1_Q6,] # 723 probes
# construct the contingency table and run the hypergeometric test
f_NFMUE1_Q6<-factor(unique(annot$ENTREZID) %in% NFMUE1_Q6)
f_DEGs_ENTREZID<-factor(unique(annot$ENTREZID) %in% DEGs_ENTREZID)
table(f_DEGs_ENTREZID,f_NFMUE1_Q6)
## f_NFMUE1_Q6
## f_DEGs_ENTREZID FALSE TRUE
## FALSE 20460 217
## TRUE 697 32
fisher.test(f_DEGs_ENTREZID,f_NFMUE1_Q6)
##
## Fisher's Exact Test for Count Data
##
## data: f_DEGs_ENTREZID and f_NFMUE1_Q6
## p-value = 1.156e-10
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 2.866135 6.349123
## sample estimates:
## odds ratio
## 4.328167
Over-representation analysis may suffer from a couple of limitations. First, after correcting for multiple hypotheses testing, no individual gene may meet the threshold for statistical significance. The relevant biological differences could be modest relative to the noise inherent to the expression measurement. Second, the selection of a set of DEGs leads to information loss as the quantitative information about gene expression is binarized. Again, subtle differential expression patterns spread accross entire groups of mutually linked genes could be omitted. GSEA ([Subramanian et al.](https://www.pnas.org/doi/10.1073/pnas.0506580102 ” Subramanian, Aravind, et al. “Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.” Proceedings of the National Academy of Sciences 102.43 (2005): 15545-15550.”)) avoids these bottlenecks.
probe.tstats<-topTable(result,number=Inf)$t
## Removing intercept from test coefficients
names(probe.tstats)<-annot$ENTREZID[match(rownames(topTable(result,number=Inf)),annot$PROBEID)]
## Removing intercept from test coefficients
probe.tstats<-probe.tstats[!is.na(names(probe.tstats))]
gene.tstats<-deframe(aggregate(probe.tstats,list(names(probe.tstats)),function(x) x[which(abs(x)==max(abs(x)))]))
gene.tstats<-deframe(aggregate(probe.tstats,list(names(probe.tstats)),mean))
c3.tf.lists <- gmtPathways("c3.tft.v7.5.1.entrez.gmt")
fgseaRes <- fgsea(pathways=c3.tf.lists, stats=gene.tstats, minSize=10, maxSize=500,nperm=100000)
## Warning in fgsea(pathways = c3.tf.lists, stats = gene.tstats, minSize = 10, :
## You are trying to run fgseaSimple. It is recommended to use fgseaMultilevel. To
## run fgseaMultilevel, you need to remove the nperm argument in the fgsea function
## call.
View(fgseaRes[order(pval)])
# low p-values found again, however low ability to distinguish between gene sets
fgseaRes[fgseaRes$pathway %in% tab.tf[1:5,]$ID,]
# compare p-val ranking
cor.test(fgseaRes$pval,tab.tf$pvalue[match(fgseaRes$pathway,tab.tf$ID)],method="spearman")
## Warning in cor.test.default(fgseaRes$pval,
## tab.tf$pvalue[match(fgseaRes$pathway, : Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: fgseaRes$pval and tab.tf$pvalue[match(fgseaRes$pathway, tab.tf$ID)]
## S = 38877, p-value = 6.413e-06
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.4889777
convgo<-function(go_name){
terms<-strsplit(go_name," ")[[1]]
return(toupper(paste(c("GOBP",terms),collapse="_")))
}
c5.go.bp.lists <- gmtPathways("c5.go.bp.v7.5.1.entrez.gmt")
fgseaRes <- fgsea(pathways=c5.go.bp.lists, stats=gene.tstats, minSize=10, maxSize=500,nperm=100000)
## Warning in fgsea(pathways = c5.go.bp.lists, stats = gene.tstats, minSize = 10, :
## You are trying to run fgseaSimple. It is recommended to use fgseaMultilevel. To
## run fgseaMultilevel, you need to remove the nperm argument in the fgsea function
## call.
View(fgseaRes[order(pval)])
# low p-values found again, however low ability to distinguish between gene sets
fgseaRes[fgseaRes$pathway %in% sapply(tab.go[1:5,]$Description,convgo)]
# compare p-val ranking
cor.test(fgseaRes$pval,tab.go$pvalue[match(fgseaRes$pathway,sapply(tab.go$Description,convgo))],method="spearman")
## Warning in cor.test.default(fgseaRes$pval,
## tab.go$pvalue[match(fgseaRes$pathway, : Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: fgseaRes$pval and tab.go$pvalue[match(fgseaRes$pathway, sapply(tab.go$Description, convgo))]
## S = 57826994, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.4873761
See the chunk in the Rmd script attached below. Inspect it, run it and explain whether:
Eventually, propose your own way of gene-set-based feature extraction for ASD detection.
# classify ASD and control samples
# start with probes
# learn an elastic network, obtain its cross-validated performance
cvfit <- cv.glmnet(t(counts_of_interest),grouping,family="binomial",type.measure="auc",nfolds=5)
plot(cvfit,main="GLMNET: ASD classification with probes")
# get selected probes, i.e., the probes used for classification
tmp_coeffs <- coef(cvfit, s = "lambda.min")
lasso_features<-data.frame(name = tmp_coeffs@Dimnames[[1]][tmp_coeffs@i+1], coefficient = tmp_coeffs@x)[-1,]
annot[annot$PROBEID %in% lasso_features$name,]
# use SVD to identify the first principal component in each GS
svdGS<-function(TrainingDataSet){
decomp<-svd(TrainingDataSet)
u<-decomp$u
inv_u<-tryCatch(leftinverse(u),error=function(err){
inv_u<-t(u)*NA
})
dv<-diag(decomp$d)%*%t(decomp$v)
return(dv[1,])
}
# precompute aggregate expressions for GO terms
c5.go.bp.lengths<-sapply(c5.go.bp.lists,length)
gs_counts<-t(sapply(c5.go.bp.lists[c5.go.bp.lengths>10 & c5.go.bp.lengths<100],function(x) svdGS(counts_of_interest[annot$ENTREZID %in% x,])))
# learn an elastic network to classify with them
cvfit <- cv.glmnet(t(gs_counts),grouping,family="binomial",type.measure="auc",nfolds=5)
plot(cvfit,main="GLMNET: ASD classification with GO terms")
tmp_coeffs <- coef(cvfit, s = "lambda.1se")
lasso_features<-data.frame(name = tmp_coeffs@Dimnames[[1]][tmp_coeffs@i+1], coefficient = tmp_coeffs@x)[-1,]
lasso_features
tab.go[sapply(tab.go$Description,convgo) %in% as.character(lasso_features$name),] # no big overlap with ORA
fgseaRes[fgseaRes$pathway %in% as.character(lasso_features$name)] # neither with GSEA