====== Tutorial 10 - Gene expression data analysis ======

{{ :courses:bin:tutorials:graphic-5.large.jpg?400|}}

On the previous tutorial, we looked at the process of collection and assembly of gene expression data.
In this tutorial we will reproduce certain breakthrough experiment [1] (in a simplified scenario, of course) regarding the analysis of such data. This way, we will also learn about the popular dimensionality reduction method PCA.

===== The challenge =====
  * We  have small number of observations (~ 10^1) and big number of features (~ 10^3)
  * The expected problems with that include false hypotheses and overfitting.
  * Interpretability: are the expressed genes the causal ones?

===== Possible solution =====
  * Decrease number of hypotheses
  * Analyze in terms of more //abstract entities// than genes, e.g. principal components

===== PCA - motivation =====
{{ :courses:bin:tutorials:pca-eigenfaces.png?400|PCA illustrated on eigenfaces as in [2]}}
PCA (principal component analysis) is a dimensionality reduction method exploiting the correlations among features in the data.
Here, our parameters and variables are the following:
  * M -- number of genes
  * N -- number of samples
  * K -- number of eigengenes, i.e. the number of underlying concepts
  * X -- A (N x M) matrix; the GE data in the space of genes
  * V -- A (M x K) matrix; the transformation basis, eigengenes
  * Z -- A (N x K) matrix; transformed GE data in the space of eigengenes

====== The assignment ======
We assume you have an installation of Matlab. If you don't there is a free license to university students. 
First, download and extract the file: {{ :courses:bin:tutorials:ge_assignment.zip |}}
==== Data ====
Taken from [1].

  * 7,129 GE profiles of 72 patients
  * 25 samples: acute myeloid leucaemia (AML)
  * 47 samples: acute lymphoblastic leucaemia (ALL)

==== The task ====

Construct decision model to differentiate these types of cancer. Just complete the code in the script attached  ''ge_cv.m'' (or ''ge_cv_matlab2015.m'')

=== Part 1 === 
  - Learn a decision tree on subjected data. Use Matlab class ClassificationTree and its method fit.
  - Show the tree (method view) and enumerate its training accuracy.
  - How would you interpret this model? Which gene is crucial for the decision?
  - Is this gene really the one causing the cancer? Look up in the article [1] Golub et al., 1999.
  - Estimate real accuracy of the tree. Use e.g., crossvalidation (alternatively, you can split the data).
  - Compare it with the training accuracy. 

=== Part 2 === 
  - Learn a basis-matrix V of the data. Use the attached function ''pca.m''.
  - For for a range of component numbers K:
    - project the original data X to the top K components of V. The result are data Z with reduced dimensionality:
    - Create a tree out of these reduced data. Show it and enumerate its training accuracy. 
  - Compare all the trees resulting from the reduced data and pick the “best” according to its accuracy and structure. Follow the Occam razor.
  - Estimate the real accuracy of the “best” chosen tree. Again, by e.g. crossvalidation.
  - Extract the genes active in the discriminative components. The discriminative components are those vectors of basis-matrix V, which refer to the features your tree consists of. To extract the active genes from a component use the function ''mineGenes''.
  - OPTIONAL Resulting gene-sets related to each of the discriminative component shall hopefully refer to some abstract biological processes. Use [[http://cbl-gorilla.cs.technion.ac.il/|Gorilla]] to enrich these gene sets in Gene-ontology terms.

===== Bibliography =====


  * [1] Golub et al.: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring.
Science, 1999
  * [2] Lee et al.: Learning the parts of objects by non-negative matrix factorization. Science, 1999

===== Materials ====
{{ :courses:bin:tutorials:ge_seminar.pdf |Old slides}}