On the previous tutorial, we looked at the process of collection and assembly of gene expression data. In this tutorial we will reproduce a certain breakthrough experiment [1] (in a simplified scenario, of course) regarding the analysis of such data. This way, we will also learn about the popular dimensionality reduction method PCA.

- We have a small number of observations (~ 10^1) and big number of features (~ 10^3)
- The expected problems with that include false hypotheses and overfitting.
- Interpretability: are the expressed genes the causal ones?

- Decrease the number of hypotheses
- Analyze in terms of more
*abstract entities*than genes, e.g. principal components

PCA (principal component analysis) is a dimensionality reduction method exploiting the correlations among features in the data. Here, our parameters and variables are the following:

- M – number of genes
- N – number of samples
- K – number of eigengenes, i.e. the number of underlying concepts
- X – A (N x M) matrix; the GE data in the space of genes
- V – A (M x K) matrix; the transformation basis, eigengenes
- Z – A (N x K) matrix; transformed GE data in the space of eigengenes

We assume you have an installation of Matlab. If you don't there is a free license to university students. First, download and extract the file: ge_assignment.zip

Taken from [1].

- 7,129 GE profiles of 72 patients
- 25 samples: acute myeloid leucaemia (AML)
- 47 samples: acute lymphoblastic leucaemia (ALL)

Construct a decision model to differentiate these types of cancer. Just complete the code in the script attached `ge_cv.m`

(or `ge_cv_matlab2015.m`

).

- Learn a decision tree on subjected data. Use Matlab function
`fitctree`

. (or the class`ClassificationTree`

and its method`fit`

for older Matlab versions). - Show the tree (method view) and enumerate its training accuracy.
- How would you interpret this model? Which gene is crucial for the decision?
- Is this gene really the one causing the cancer? Look up in the article [1] Golub et al., 1999.
- Estimate real accuracy of the tree. Use e.g., crossvalidation (alternatively, you can split the data).
- Compare it with the training accuracy.

- Learn a basis-matrix V of the data. Use the attached function
`pca.m`

. - For a range of component numbers K:
- project the original data X to the top K components of V. The result are data Z with reduced dimensionality:
- Create a tree out of these reduced data. Show it and enumerate its training accuracy.

- Compare all the trees resulting from the reduced data and pick the “best” according to its accuracy and structure. Follow the Occam razor.
- Estimate the real accuracy of the “best” chosen tree. Again, by e.g. crossvalidation.
- Extract the genes active in the discriminative components. The discriminative components are those vectors of basis-matrix V, which refer to the features your tree consists of. To extract the active genes from a component use the function
`mineGenes`

. - OPTIONAL Resulting gene-sets related to each of the discriminative component shall hopefully refer to some abstract biological processes. Use Gorilla to enrich these gene sets in Gene-ontology terms.

- [1] Golub et al.: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring.

Science, 1999

- [2] Lee et al.: Learning the parts of objects by non-negative matrix factorization. Science, 1999

courses/bin/tutorials/tutorial10.txt · Last modified: 2021/04/26 11:48 by barvijac