Gene Ontology is the world’s largest knowledgebase of the functions of genes. In this notebook we will demonstrate its utilization for set-level analysis of gene expression data. The main goals will be:

The input data

Kong et al. presents the results of a large blood transcriptome study that aims to evaluate the utility of gene expression profiling in the diagnosis of Autism Spectrum Disorders (ASD). The full gene expression dataset is available through Gene Expression Omnibus. In this notebook, we will work with a RData expression file that contains 54,613 features (detection probes) and 99 samples (individuals). 66 individuals suffer with an ASD, 33 individuals represent the control healthy group. This file corresponds to the train dataset referenced in the paper.

There are four classes in the expression set: CONTROL, AUTISM, PDD-NOS, ASPERGER’S DISORDER. We will merge the three non-control groups into a single category referred to as Autism spectrum disorder (ASD).

##  [1] "title"                                        
##  [2] "geo_accession"                                
##  [3] "status"                                       
##  [4] "submission_date"                              
##  [5] "last_update_date"                             
##  [6] "type"                                         
##  [7] "channel_count"                                
##  [8] "source_name_ch1"                              
##  [9] "organism_ch1"                                 
## [10] "characteristics_ch1"                          
## [11] "characteristics_ch1.1"                        
## [12] "characteristics_ch1.2"                        
## [13] "characteristics_ch1.3"                        
## [14] "characteristics_ch1.4"                        
## [15] "characteristics_ch1.5"                        
## [16] "characteristics_ch1.6"                        
## [17] "characteristics_ch1.7"                        
## [18] "characteristics_ch1.8"                        
## [19] "characteristics_ch1.9"                        
## [20] "characteristics_ch1.10"                       
## [21] "characteristics_ch1.11"                       
## [22] "characteristics_ch1.12"                       
## [23] "characteristics_ch1.13"                       
## [24] "characteristics_ch1.14"                       
## [25] "characteristics_ch1.15"                       
## [26] "characteristics_ch1.16"                       
## [27] "characteristics_ch1.17"                       
## [28] "characteristics_ch1.18"                       
## [29] "characteristics_ch1.19"                       
## [30] "characteristics_ch1.20"                       
## [31] "molecule_ch1"                                 
## [32] "extract_protocol_ch1"                         
## [33] "label_ch1"                                    
## [34] "label_protocol_ch1"                           
## [35] "taxid_ch1"                                    
## [36] "hyb_protocol"                                 
## [37] "scan_protocol"                                
## [38] "description"                                  
## [39] "data_processing"                              
## [40] "platform_id"                                  
## [41] "contact_name"                                 
## [42] "contact_email"                                
## [43] "contact_phone"                                
## [44] "contact_laboratory"                           
## [45] "contact_department"                           
## [46] "contact_institute"                            
## [47] "contact_address"                              
## [48] "contact_city"                                 
## [49] "contact_state"                                
## [50] "contact_zip/postal_code"                      
## [51] "contact_country"                              
## [52] "supplementary_file"                           
## [53] "data_row_count"                               
## [54] "age at blood drawing (months):ch1"            
## [55] "allergies:ch1"                                
## [56] "autoimmune disorder:ch1"                      
## [57] "birth defects:ch1"                            
## [58] "cerebral palsy:ch1"                           
## [59] "chronic diseases:ch1"                         
## [60] "ct:ch1"                                       
## [61] "developmental/speech disorder:ch1"            
## [62] "diagnosis:ch1"                                
## [63] "ethnicity:ch1"                                
## [64] "gastrointestinal disorder:ch1"                
## [65] "gender:ch1"                                   
## [66] "genetic testing:ch1"                          
## [67] "hours (minutes since last caloric intake):ch1"
## [68] "medications/vitamin names:ch1"                
## [69] "mri:ch1"                                      
## [70] "neurological disorder:ch1"                    
## [71] "other diseases:ch1"                           
## [72] "psychiatric disorder:ch1"                     
## [73] "race:ch1"                                     
## [74] "seizure:ch1"                                  
## [75] "group"
##  [1] PDD-NOS             PDD-NOS             AUTISM             
##  [4] AUTISM              AUTISM              PDD-NOS            
##  [7] PDD-NOS             AUTISM              ASPERGER'S DISORDER
## [10] AUTISM              ASPERGER'S DISORDER PDD-NOS            
## [13] ASPERGER'S DISORDER PDD-NOS             PDD-NOS            
## [16] AUTISM              AUTISM              AUTISM             
## [19] PDD-NOS             AUTISM              AUTISM             
## [22] AUTISM              PDD-NOS             ASPERGER'S DISORDER
## [25] AUTISM              AUTISM              AUTISM             
## [28] PDD-NOS             AUTISM              PDD-NOS            
## [31] AUTISM              ASPERGER'S DISORDER PDD-NOS            
## [34] AUTISM              AUTISM              PDD-NOS            
## [37] PDD-NOS             PDD-NOS             AUTISM             
## [40] PDD-NOS             PDD-NOS             PDD-NOS            
## [43] AUTISM              PDD-NOS             AUTISM             
## [46] AUTISM              AUTISM              PDD-NOS            
## [49] AUTISM              AUTISM              AUTISM             
## [52] ASPERGER'S DISORDER AUTISM              AUTISM             
## [55] AUTISM              AUTISM              ASPERGER'S DISORDER
## [58] PDD-NOS             ASPERGER'S DISORDER ASPERGER'S DISORDER
## [61] AUTISM              PDD-NOS             PDD-NOS            
## [64] PDD-NOS             PDD-NOS             PDD-NOS            
## [67] CONTROL             CONTROL             CONTROL            
## [70] CONTROL             CONTROL             CONTROL            
## [73] CONTROL             CONTROL             CONTROL            
## [76] CONTROL             CONTROL             CONTROL            
## [79] CONTROL             CONTROL             CONTROL            
## [82] CONTROL             CONTROL             CONTROL            
## [85] CONTROL             CONTROL             CONTROL            
## [88] CONTROL             CONTROL             CONTROL            
## [91] CONTROL             CONTROL             CONTROL            
## [94] CONTROL             CONTROL             CONTROL            
## [97] CONTROL             CONTROL             CONTROL            
## Levels: CONTROL AUTISM PDD-NOS ASPERGER'S DISORDER
##           GSM650510 GSM650512 GSM650513 GSM650514 GSM650515 GSM650516 GSM650517
## 1007_s_at  26.65640  107.6181  48.97026  72.65312  80.83264  40.13933  47.43693
## 1053_at    39.00726  135.7397  51.37555 102.65829  78.93955  81.75657  79.74300
## 117_at    296.01724  459.1438 378.86384 618.79730 266.87741 320.33815 685.72845
## 121_at    209.03575  232.0835 190.15020 185.90196 216.47953 172.25500 188.12218
## 1255_g_at  63.82730   50.2000  60.23913  22.03061  41.29200  74.49414  16.84841
## 1294_at   165.14969  360.1872 190.22052 260.85834 305.24385 286.53678 249.08788
##           GSM650518 GSM650519 GSM650520 GSM650521 GSM650522 GSM650523 GSM650525
## 1007_s_at  57.90003  65.03053  43.14078  44.66914 112.48516  42.34660  67.71297
## 1053_at    74.16394 131.21986  62.78481  59.43699 110.43174 108.47401  76.09486
## 117_at    285.89160 996.76063 231.43584 418.68458 421.70080 449.31038 435.37965
## 121_at    155.99192 172.30327 141.44389 280.48566 213.13301 301.18909 234.26958
## 1255_g_at  48.88556  19.70762  17.45545  25.77657  39.78822  40.91114  30.19161
## 1294_at   412.38298 227.39853 393.23011 159.28096 238.38950 474.06981 358.50993
##           GSM650526 GSM650527 GSM650528 GSM650529 GSM650530 GSM650531 GSM650532
## 1007_s_at  97.61767  86.20184  66.83669 112.60186  84.04476  88.51976  54.02114
## 1053_at   115.22141 105.21627  66.37294  40.22882 128.49734  90.94384  69.77820
## 117_at    482.63132 351.16903 442.39206 319.49443 526.06732 291.08733 391.89801
## 121_at    202.57670 191.83694 170.35430 178.74447 237.33150 231.47094 144.76965
## 1255_g_at  63.95302  35.09998  30.34217  41.24234  27.77132  62.36360  47.78701
## 1294_at   341.14757 250.76949 315.76797 297.34555 323.81813 310.73256 229.03069
##           GSM650533 GSM650534 GSM650536 GSM650537 GSM650538 GSM650540 GSM650541
## 1007_s_at  49.04914  95.25455  30.55739  32.51340  76.10798  61.11153  70.53271
## 1053_at    53.30094  91.86629  38.08755  58.46904  95.66880  97.61994  57.76480
## 117_at    286.14483 188.09012 177.53415 238.23088 299.47317 311.33134 312.82048
## 121_at    251.38737 233.33160 190.83007 311.91319 185.35691 290.27232 253.54693
## 1255_g_at  45.04606  56.82903  31.59687  31.47276  36.42960  36.56611  39.40304
## 1294_at   262.26849 242.44354 192.50216 246.96233 240.79891 213.81507 292.37782
##           GSM650542 GSM650543 GSM650544 GSM650545 GSM650546 GSM650547 GSM650548
## 1007_s_at 127.37630  55.17926  58.02352  46.28381  72.81389  92.34122 109.71356
## 1053_at    59.81668 104.90197  96.80858  87.26766  60.02479  94.34461  55.37355
## 117_at    300.70545 495.05182 509.04185 312.03245 489.62629 415.17741 442.54907
## 121_at    229.58772 227.93792 204.34015 297.33280 143.82368 262.96083 149.96356
## 1255_g_at  22.76424  52.86628  49.39887  34.18810  20.69595  53.38379   6.29048
## 1294_at   366.65715 194.83877 342.43099 181.69983 266.71635 276.19455 307.73590
##           GSM650549 GSM650550 GSM650551 GSM650552 GSM650553 GSM650554 GSM650555
## 1007_s_at  92.14704  90.90995 105.08639 146.93503  78.28801  81.68491  82.02377
## 1053_at    95.75218  50.77397 110.89406  70.73290  63.95427  35.69693  40.44235
## 117_at    151.71754 347.63886 360.82290 273.84063 748.67212 410.17838 195.40343
## 121_at    280.49265 145.42998 218.02353 202.10174 163.21464 196.52260 269.76409
## 1255_g_at  43.15510  31.85688  24.02297  33.45226  54.44614  47.17705  55.44899
## 1294_at   228.34206 394.19645 224.67764 320.61316 283.30227 309.48430 264.58254
##           GSM650556 GSM650557 GSM650558 GSM650560 GSM650561 GSM650562 GSM650563
## 1007_s_at 156.92131  77.46551  85.99289  94.88252 152.78473 105.98660  78.05807
## 1053_at    90.74336  46.96045  84.63094  40.58687  39.30229  33.82422  63.27953
## 117_at    161.36395 248.27982 485.06022 348.66766 208.82518 305.12205 341.88091
## 121_at    176.95763 279.04549 213.28606 207.84496 159.02063 192.77817 210.62787
## 1255_g_at  38.67305  39.54768  40.60540   8.44513  44.99761  53.31200  56.77954
## 1294_at   335.83368 270.75202 235.12873 375.20933 424.90983 370.19472 324.76240
##           GSM650564 GSM650565 GSM650566 GSM650567 GSM650568 GSM650569 GSM650570
## 1007_s_at 166.40268  62.91733  58.61961 128.24129  88.45555 118.22818  69.03256
## 1053_at    38.78946  44.56538  33.02484  46.62806  29.36873 107.71534  16.78094
## 117_at    179.36373 319.07661 395.07780 299.10731 213.50369 716.61013 170.56964
## 121_at    128.38691 240.95580 287.15458 273.93943 175.58030 162.62681 279.35940
## 1255_g_at  23.47485  46.03113  36.06378  39.63614  23.56664  45.42461  50.26017
## 1294_at   284.99828 254.14767 159.74961 358.10752 270.03849 343.80083 187.74559
##           GSM650571 GSM650572 GSM650573 GSM650574 GSM650575 GSM650577 GSM650578
## 1007_s_at  69.74568 116.98213  99.34677  76.87986  68.75603 103.28642  206.2436
## 1053_at    48.22536  92.15871  79.71097  91.87400  71.37006  90.53375  103.2923
## 117_at    320.91119 524.78802 520.51320 368.97094 742.61166 238.83626  330.6690
## 121_at    272.64131 222.97582 227.02233 182.92163 249.96396 263.33372  231.3220
## 1255_g_at  32.43698  38.61383  64.57411  49.49402  56.12527  66.30911   41.4669
## 1294_at   318.05537 264.53520 335.33309 244.52139 316.91943 329.81489  424.0644
##           GSM650579 GSM650580 GSM650581 GSM650600 GSM650603 GSM650604 GSM650606
## 1007_s_at  62.61237 150.77520 165.40445 104.52255 171.31276  61.90472  78.72994
## 1053_at    29.77958  76.14664  71.08097  91.39947  75.20531  86.12535  50.16479
## 117_at    406.40342 231.10036 335.55447 290.05855 150.92785 820.24275 104.84900
## 121_at    222.25569 309.33096 271.97834 279.47867 234.82769 200.23679 207.01358
## 1255_g_at  48.21423  53.88799  46.95981  33.35752  38.25803  40.47013   8.65039
## 1294_at   359.82553 357.85310 382.47175 322.70023 349.84522 427.46915 206.51453
##           GSM650607 GSM650614 GSM650615 GSM650617 GSM650618 GSM650619 GSM650620
## 1007_s_at  69.87071 127.12054 107.10228  60.89538  85.91566  64.04747 118.08883
## 1053_at    58.85608  81.37677 110.68482 115.79915  94.21142  42.34658  37.78299
## 117_at    181.05986 192.91026 580.58332 393.03068 100.02749 302.18980 278.09137
## 121_at    269.15970 254.46785 273.83700 287.53296 245.05935 140.15537 198.76899
## 1255_g_at  30.90586  77.25758  59.57494  15.08969  56.52948   0.08787  39.30338
## 1294_at   201.79583 217.34291 279.05300 425.46090 263.23253 248.57059 294.13025
##           GSM650621 GSM650622 GSM650632 GSM650633 GSM650636 GSM650637 GSM650638
## 1007_s_at  72.11873 118.84221  11.97880  59.92202  86.49749  61.53625  87.46682
## 1053_at    22.22120  85.58917  25.50613  29.76794  55.63645  87.32867 102.22848
## 117_at    342.31566 280.24622 158.48301 198.31375 161.98425 530.47440 555.54914
## 121_at    208.35171 219.57346 108.62315 174.11616 258.86629 282.97858 204.59205
## 1255_g_at  38.20731  33.84125  29.00139  36.82613  55.28146  38.20192  39.98789
## 1294_at   391.36274 301.28988 125.84589 218.69022 277.73730 316.85612 368.69160
##           GSM650639 GSM650640 GSM650641 GSM650642 GSM650643 GSM650644 GSM650645
## 1007_s_at  68.47716  76.07428 100.09395 134.68330  133.9225  81.02742  67.71790
## 1053_at    43.30464  31.13427  41.39585  30.79279  114.9599 157.17724  86.83151
## 117_at    187.92812 126.49112 161.00544 169.90919  502.2139 913.09583 285.02298
## 121_at    280.13799 189.34671 315.23597 269.91806  225.4656 212.27610 309.88049
## 1255_g_at  49.59823  30.35822  50.72183  31.46361   58.0416  42.52692  43.31678
## 1294_at   200.10303 177.30314 303.33246 284.05885  289.2569 309.59830 239.33292
##           GSM650646 GSM650647 GSM650648 GSM650649 GSM650651 GSM650652 GSM650653
## 1007_s_at  49.50204  64.66049 110.58686  18.62053  66.15737  76.31541  59.93721
## 1053_at   106.72321  37.18311  45.04578  77.12382  92.94504 137.79575  26.70354
## 117_at    539.54601  75.35680 350.91103 488.36033 252.50309 329.28063 145.30082
## 121_at    206.78336 281.03506 189.41136 294.80112 274.37934 232.94968 204.35111
## 1255_g_at  21.37225  61.94335  37.83045  19.08806  43.59119  49.46663  36.18051
## 1294_at   321.91182 189.73190 476.11247 173.49481 213.45439 191.23652 305.75434
##           GSM650654
## 1007_s_at  45.99937
## 1053_at   101.07505
## 117_at    276.80480
## 121_at    265.61533
## 1255_g_at  16.70463
## 1294_at   205.76415

Differential gene expression

The first step is to quantify differential expression for the individual probes. We deal with microarray data in this task, we will use limma library. Essentially, linear regression is applied.

design <- model.matrix(~ grouping)

# deals with log2 scaled data 
fit <- limma::lmFit(log2(counts_of_interest), design)
result <- limma::eBayes(fit)
topTable(result)
## Removing intercept from test coefficients
# plot the top results, their ordering not kept
top12<-rownames(topTable(result,number=12))
## Removing intercept from test coefficients
df<-data.frame(probe=rep(top12,length(columns_of_interest)),class=rep(grouping,each=12),data=as.vector(exprs(gse18123)[rownames(exprs(gse18123)) %in% top12,columns_of_interest]))
ggplot(df, aes(x=probe, y=data, fill=class)) + 
  geom_boxplot() +
  facet_wrap(~probe, scale="free")

Over-representation analysis

Over-representation analysis determines whether genes from a pre-defined set (in here, those beloging to a specific GO term) are present more than would be expected (over-represented) in a subset of your data (in here, a set of differentially expressed genes). In fact, it is a repetitive application of hypergeometric test.

# identify differentially expressed genes first, adjusted p.value must be smaller than 0.05, 412 genes selected (around 0.7%)
DEGs<-topTable(result,number=Inf,p.value=0.05)
## Removing intercept from test coefficients
# get their annotation
dbq<-AnnotationDbi::select(hgu133plus2.db, keys=rownames(gse18123), keytype="PROBEID",columns = c("SYMBOL","ENTREZID"),multiVals="list")
## 'select()' returned 1:many mapping between keys and columns
annot<-dbq[match(unique(dbq$PROBEID), dbq$PROBEID),]
DEGs_ENTREZID<-annot$ENTREZID[annot$PROBEID %in% rownames(DEGs)]
DEGs_ENTREZID<-DEGs_ENTREZID[!is.na(DEGs_ENTREZID)]

# use clusterProfiler with built-in gene lists
ans.go <- enrichGO(gene = DEGs_ENTREZID, ont = "BP",
                   OrgDb ="org.Hs.eg.db",
                   universe = annot$ENTREZID,
                   readable=TRUE,
                   pvalueCutoff = 1, qvalueCutoff=1)

tab.go <- as.data.frame(ans.go)
tab.go<- subset(tab.go, Count>5)
tab.go[1:5, 1:6]
# show multiple scores at once, adjusted p-value as well as count, sort by adjusted p-value
# count is the number of DE genes that belong to a given gene-set
barplot(ans.go, showCategory=10)

# visualizes gene sets as a network, mutually overlapping gene sets tend to cluster together, making it easier for interpretation.
#emapplot(ans.go)
emapplot(pairwise_termsim(ans.go))

ans.kegg <- enrichKEGG(gene = DEGs_ENTREZID,
                       organism = 'hsa',
                       universe = annot$ENTREZID,
                       pvalueCutoff = 0.05, qvalueCutoff=1)
## Reading KEGG annotation online:
## Reading KEGG annotation online:
tab.kegg <- as.data.frame(ans.kegg)
tab.kegg<- subset(tab.kegg, Count>5)
tab.kegg[1:5, 1:6]
# show multiple scores at once, GeneRatio, adjusted p-value and count
# GeneRatio=count/setSize, where 'count' is the number of DE genes that belong to a given gene-set, while 'setSize' is the total number of genes in the gene-set
# pushes up larger gene sets, such as Endocytosis (contains 249 genes)
dotplot(ans.kegg, showCategory=12) + ggtitle("KEGG")

# visualize particular pathway
logFC <- DEGs$logFC
names(logFC) <- DEGs_ENTREZID
pathview(gene.data = logFC, 
         pathway.id = "hsa04140", 
         species = "hsa", 
         limit = list(gene=2, cpd=1))
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/jiri/ownCloud/oCloud/Vyuka/BIN/cviceni/Enrich_autism_tutorial
## Info: Writing image file hsa04140.pathview.png
# deal with external gene lists
c3.tf <- read.gmt("c3.tft.v7.5.1.entrez.gmt")
# by default, abnalyze only the sets with minimum size 10 and maximum size 500
ans.tf <- enricher(DEGs_ENTREZID, TERM2GENE=c3.tf,universe = annot$ENTREZID)
#ans.tf <- enricher(DEGs_ENTREZID, TERM2GENE=c3.tf,universe = annot$ENTREZID, pvalueCutoff = 1, qvalueCutoff = 1) # full result to compare with fgsea
tab.tf <- as.data.frame(ans.tf)
tab.tf.large<- subset(tab.tf, Count>5)
tab.tf.large[1:5,1:5]
# visualizes gene sets as a network, mutually overlapping gene sets tend to cluster together, making it easier for interpretation.
#emapplot(ans.tf)
emapplot(pairwise_termsim(ans.tf))

# run Fisher test for the NFMUE1_Q6 list
# inspect the list first
head(c3.tf)
NFMUE1_Q6<-c3.tf$gene[c3.tf$term=="NFMUE1_Q6"] # older call: NFMUE1_Q6<-c3.tf$gene[c3.tf$ont=="NFMUE1_Q6"]
annot[annot$ENTREZID %in% NFMUE1_Q6,] # 723 probes
# construct the contingency table and run the hypergeometric test
f_NFMUE1_Q6<-factor(unique(annot$ENTREZID) %in% NFMUE1_Q6)
f_DEGs_ENTREZID<-factor(unique(annot$ENTREZID) %in% DEGs_ENTREZID)
table(f_DEGs_ENTREZID,f_NFMUE1_Q6)
##                f_NFMUE1_Q6
## f_DEGs_ENTREZID FALSE  TRUE
##           FALSE 20460   217
##           TRUE    697    32
fisher.test(f_DEGs_ENTREZID,f_NFMUE1_Q6)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  f_DEGs_ENTREZID and f_NFMUE1_Q6
## p-value = 1.156e-10
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  2.866135 6.349123
## sample estimates:
## odds ratio 
##   4.328167

Gene-set enrichment analysis

Over-representation analysis may suffer from a couple of limitations. First, after correcting for multiple hypotheses testing, no individual gene may meet the threshold for statistical significance. The relevant biological differences could be modest relative to the noise inherent to the expression measurement. Second, the selection of a set of DEGs leads to information loss as the quantitative information about gene expression is binarized. Again, subtle differential expression patterns spread accross entire groups of mutually linked genes could be omitted. GSEA ([Subramanian et al.](https://www.pnas.org/doi/10.1073/pnas.0506580102 ” Subramanian, Aravind, et al. “Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.” Proceedings of the National Academy of Sciences 102.43 (2005): 15545-15550.”)) avoids these bottlenecks.

probe.tstats<-topTable(result,number=Inf)$t
## Removing intercept from test coefficients
names(probe.tstats)<-annot$ENTREZID[match(rownames(topTable(result,number=Inf)),annot$PROBEID)]
## Removing intercept from test coefficients
probe.tstats<-probe.tstats[!is.na(names(probe.tstats))]
gene.tstats<-deframe(aggregate(probe.tstats,list(names(probe.tstats)),function(x) x[which(abs(x)==max(abs(x)))]))
gene.tstats<-deframe(aggregate(probe.tstats,list(names(probe.tstats)),mean))
c3.tf.lists <- gmtPathways("c3.tft.v7.5.1.entrez.gmt")
fgseaRes <- fgsea(pathways=c3.tf.lists, stats=gene.tstats, minSize=10, maxSize=500,nperm=100000)
## Warning in fgsea(pathways = c3.tf.lists, stats = gene.tstats, minSize = 10, :
## You are trying to run fgseaSimple. It is recommended to use fgseaMultilevel. To
## run fgseaMultilevel, you need to remove the nperm argument in the fgsea function
## call.
View(fgseaRes[order(pval)])

# low p-values found again, however low ability to distinguish between gene sets
fgseaRes[fgseaRes$pathway %in% tab.tf[1:5,]$ID,]
# compare p-val ranking
cor.test(fgseaRes$pval,tab.tf$pvalue[match(fgseaRes$pathway,tab.tf$ID)],method="spearman")
## Warning in cor.test.default(fgseaRes$pval,
## tab.tf$pvalue[match(fgseaRes$pathway, : Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  fgseaRes$pval and tab.tf$pvalue[match(fgseaRes$pathway, tab.tf$ID)]
## S = 38877, p-value = 6.413e-06
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.4889777
convgo<-function(go_name){
  terms<-strsplit(go_name," ")[[1]]
  return(toupper(paste(c("GOBP",terms),collapse="_")))
}

c5.go.bp.lists <- gmtPathways("c5.go.bp.v7.5.1.entrez.gmt")
fgseaRes <- fgsea(pathways=c5.go.bp.lists, stats=gene.tstats, minSize=10, maxSize=500,nperm=100000)
## Warning in fgsea(pathways = c5.go.bp.lists, stats = gene.tstats, minSize = 10, :
## You are trying to run fgseaSimple. It is recommended to use fgseaMultilevel. To
## run fgseaMultilevel, you need to remove the nperm argument in the fgsea function
## call.
View(fgseaRes[order(pval)])
# low p-values found again, however low ability to distinguish between gene sets
fgseaRes[fgseaRes$pathway %in% sapply(tab.go[1:5,]$Description,convgo)]
# compare p-val ranking
cor.test(fgseaRes$pval,tab.go$pvalue[match(fgseaRes$pathway,sapply(tab.go$Description,convgo))],method="spearman")
## Warning in cor.test.default(fgseaRes$pval,
## tab.go$pvalue[match(fgseaRes$pathway, : Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  fgseaRes$pval and tab.go$pvalue[match(fgseaRes$pathway, sapply(tab.go$Description, convgo))]
## S = 57826994, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.4873761

Your task for this tutorial

See the chunk in the Rmd script attached below. Inspect it, run it and explain whether:

  • the peripheral blood expression profiling for ASD detection/classification may be feasible,
  • the outlined gene set aggregation helps in ASD detection,
  • is the outlined evaluation of gene set aggregation unbiased,
  • the proposed GO BP terms match the terms proposed earlier in terms of ORA and GSEA and whether they should match them in theory.

Eventually, propose your own way of gene-set-based feature extraction for ASD detection.

# classify ASD and control samples
# start with probes

# learn an elastic network, obtain its cross-validated performance
cvfit <- cv.glmnet(t(counts_of_interest),grouping,family="binomial",type.measure="auc",nfolds=5)
plot(cvfit,main="GLMNET: ASD classification with probes")

# get selected probes, i.e., the probes used for classification
tmp_coeffs <- coef(cvfit, s = "lambda.min")
lasso_features<-data.frame(name = tmp_coeffs@Dimnames[[1]][tmp_coeffs@i+1], coefficient = tmp_coeffs@x)[-1,]
annot[annot$PROBEID %in% lasso_features$name,]
# use SVD to identify the first principal component in each GS
svdGS<-function(TrainingDataSet){
  decomp<-svd(TrainingDataSet) 
  u<-decomp$u  
  inv_u<-tryCatch(leftinverse(u),error=function(err){        
    inv_u<-t(u)*NA   
  })
  dv<-diag(decomp$d)%*%t(decomp$v)
  return(dv[1,])
}

# precompute aggregate expressions for GO terms
c5.go.bp.lengths<-sapply(c5.go.bp.lists,length)
gs_counts<-t(sapply(c5.go.bp.lists[c5.go.bp.lengths>10 & c5.go.bp.lengths<100],function(x) svdGS(counts_of_interest[annot$ENTREZID %in% x,])))
# learn an elastic network to classify with them
cvfit <- cv.glmnet(t(gs_counts),grouping,family="binomial",type.measure="auc",nfolds=5)
plot(cvfit,main="GLMNET: ASD classification with GO terms")

tmp_coeffs <- coef(cvfit, s = "lambda.1se")
lasso_features<-data.frame(name = tmp_coeffs@Dimnames[[1]][tmp_coeffs@i+1], coefficient = tmp_coeffs@x)[-1,]
lasso_features
tab.go[sapply(tab.go$Description,convgo) %in% as.character(lasso_features$name),] # no big overlap with ORA
fgseaRes[fgseaRes$pathway %in% as.character(lasso_features$name)] # neither with GSEA