Tutorial 10 - Gene expression data analysis
On the previous tutorial, we looked at the process of collection and assembly of gene expression data.
In this tutorial we will reproduce a certain breakthrough experiment [1] (in a simplified scenario, of course) regarding the analysis of such data. This way, we will also learn about the popular dimensionality reduction method PCA.
The challenge
We have a small number of observations (~ 10^1) and big number of features (~ 10^3)
The expected problems with that include false hypotheses and overfitting.
Interpretability: are the expressed genes the causal ones?
Possible solution
Decrease the number of hypotheses
Analyze in terms of more abstract entities than genes, e.g. principal components
PCA - motivation
PCA (principal component analysis) is a dimensionality reduction method exploiting the correlations among features in the data.
Here, our parameters and variables are the following:
M – number of genes
N – number of samples
K – number of eigengenes, i.e. the number of underlying concepts
X – A (N x M) matrix; the GE data in the space of genes
V – A (M x K) matrix; the transformation basis, eigengenes
Z – A (N x K) matrix; transformed GE data in the space of eigengenes
The assignment
We assume you have an installation of Matlab. If you don't there is a free license to university students.
First, download and extract the file: ge_assignment.zip
Data
Taken from [1].
7,129 GE profiles of 72 patients
25 samples: acute myeloid leucaemia (AML)
47 samples: acute lymphoblastic leucaemia (ALL)
The task
Construct a decision model to differentiate these types of cancer. Just complete the code in the script attached ge_cv.m
(or ge_cv_matlab2015.m
).
Part 1
Learn a decision tree on subjected data. Use Matlab function fitctree
. (or the class ClassificationTree
and its method fit
for older Matlab versions).
Show the tree (method view) and enumerate its training accuracy.
How would you interpret this model? Which gene is crucial for the decision?
Is this gene really the one causing the cancer? Look up in the article [1] Golub et al., 1999.
Estimate real accuracy of the tree. Use e.g., crossvalidation (alternatively, you can split the data).
Compare it with the training accuracy.
Part 2
Learn a basis-matrix V of the data. Use the attached function pca.m
.
For a range of component numbers K:
project the original data X to the top K components of V. The result are data Z with reduced dimensionality:
Create a tree out of these reduced data. Show it and enumerate its training accuracy.
Compare all the trees resulting from the reduced data and pick the “best” according to its accuracy and structure. Follow the Occam razor.
Estimate the real accuracy of the “best” chosen tree. Again, by e.g. crossvalidation.
Extract the genes active in the discriminative components. The discriminative components are those vectors of basis-matrix V, which refer to the features your tree consists of. To extract the active genes from a component use the function mineGenes
.
OPTIONAL Resulting gene-sets related to each of the discriminative component shall hopefully refer to some abstract biological processes. Use
Gorilla to enrich these gene sets in Gene-ontology terms.
Bibliography
Materials