Table of Contents

Tutorial 10 - Gene expression data analysis

On the previous tutorial, we looked at the process of collection and assembly of gene expression data. In this tutorial we will reproduce a certain breakthrough experiment [1] (in a simplified scenario, of course) regarding the analysis of such data. This way, we will also learn about the popular dimensionality reduction method PCA.

The challenge

Possible solution

PCA - motivation

PCA illustrated on eigenfaces as in [2] PCA (principal component analysis) is a dimensionality reduction method exploiting the correlations among features in the data. Here, our parameters and variables are the following:

The assignment

We assume you have an installation of Matlab. If you don't there is a free license to university students. First, download and extract the file: ge_assignment.zip

Data

Taken from [1].

The task

Construct a decision model to differentiate these types of cancer. Just complete the code in the script attached ge_cv.m (or ge_cv_matlab2015.m).

Part 1

  1. Learn a decision tree on subjected data. Use Matlab class ClassificationTree and its method fit.
  2. Show the tree (method view) and enumerate its training accuracy.
  3. How would you interpret this model? Which gene is crucial for the decision?
  4. Is this gene really the one causing the cancer? Look up in the article [1] Golub et al., 1999.
  5. Estimate real accuracy of the tree. Use e.g., crossvalidation (alternatively, you can split the data).
  6. Compare it with the training accuracy.

Part 2

  1. Learn a basis-matrix V of the data. Use the attached function pca.m.
  2. For a range of component numbers K:
    1. project the original data X to the top K components of V. The result are data Z with reduced dimensionality:
    2. Create a tree out of these reduced data. Show it and enumerate its training accuracy.
  3. Compare all the trees resulting from the reduced data and pick the “best” according to its accuracy and structure. Follow the Occam razor.
  4. Estimate the real accuracy of the “best” chosen tree. Again, by e.g. crossvalidation.
  5. Extract the genes active in the discriminative components. The discriminative components are those vectors of basis-matrix V, which refer to the features your tree consists of. To extract the active genes from a component use the function mineGenes.
  6. OPTIONAL Resulting gene-sets related to each of the discriminative component shall hopefully refer to some abstract biological processes. Use Gorilla to enrich these gene sets in Gene-ontology terms.

Bibliography

Science, 1999

Materials

Old slides