====== Tutorial 10 - Gene expression data analysis ====== {{ :courses:bin:tutorials:graphic-5.large.jpg?400|}} On the previous tutorial, we looked at the process of collection and assembly of gene expression data. In this tutorial we will reproduce certain breakthrough experiment [1] (in a simplified scenario, of course) regarding the analysis of such data. This way, we will also learn about the popular dimensionality reduction method PCA. ===== The challenge ===== * We have small number of observations (~ 10^1) and big number of features (~ 10^3) * The expected problems with that include false hypotheses and overfitting. * Interpretability: are the expressed genes the causal ones? ===== Possible solution ===== * Decrease number of hypotheses * Analyze in terms of more //abstract entities// than genes, e.g. principal components ===== PCA - motivation ===== {{ :courses:bin:tutorials:pca-eigenfaces.png?400|PCA illustrated on eigenfaces as in [2]}} PCA (principal component analysis) is a dimensionality reduction method exploiting the correlations among features in the data. Here, our parameters and variables are the following: * M -- number of genes * N -- number of samples * K -- number of eigengenes, i.e. the number of underlying concepts * X -- A (N x M) matrix; the GE data in the space of genes * V -- A (M x K) matrix; the transformation basis, eigengenes * Z -- A (N x K) matrix; transformed GE data in the space of eigengenes ====== The assignment ====== We assume you have an installation of Matlab. If you don't there is a free license to university students. First, download and extract the file: {{ :courses:bin:tutorials:ge_assignment.zip |}} ==== Data ==== Taken from [1]. * 7,129 GE profiles of 72 patients * 25 samples: acute myeloid leucaemia (AML) * 47 samples: acute lymphoblastic leucaemia (ALL) ==== The task ==== Construct decision model to differentiate these types of cancer. Just complete the code in the script attached ''ge_cv.m'' (or ''ge_cv_matlab2015.m'') === Part 1 === - Learn a decision tree on subjected data. Use Matlab class ClassificationTree and its method fit. - Show the tree (method view) and enumerate its training accuracy. - How would you interpret this model? Which gene is crucial for the decision? - Is this gene really the one causing the cancer? Look up in the article [1] Golub et al., 1999. - Estimate real accuracy of the tree. Use e.g., crossvalidation (alternatively, you can split the data). - Compare it with the training accuracy. === Part 2 === - Learn a basis-matrix V of the data. Use the attached function ''pca.m''. - For for a range of component numbers K: - project the original data X to the top K components of V. The result are data Z with reduced dimensionality: - Create a tree out of these reduced data. Show it and enumerate its training accuracy. - Compare all the trees resulting from the reduced data and pick the “best” according to its accuracy and structure. Follow the Occam razor. - Estimate the real accuracy of the “best” chosen tree. Again, by e.g. crossvalidation. - Extract the genes active in the discriminative components. The discriminative components are those vectors of basis-matrix V, which refer to the features your tree consists of. To extract the active genes from a component use the function ''mineGenes''. - OPTIONAL Resulting gene-sets related to each of the discriminative component shall hopefully refer to some abstract biological processes. Use [[http://cbl-gorilla.cs.technion.ac.il/|Gorilla]] to enrich these gene sets in Gene-ontology terms. ===== Bibliography ===== * [1] Golub et al.: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 1999 * [2] Lee et al.: Learning the parts of objects by non-negative matrix factorization. Science, 1999 ===== Materials ==== {{ :courses:bin:tutorials:ge_seminar.pdf |Old slides}}