Analysis of differential gene expression is one of the most popular bioinformatics tasks. Each particular gene can be either turned on or off. This process regulates the amount of produced proteins in each cell, allowing it to react to a lack of minerals or change its type from stem cell to neuron. It is, however, problematic to measure protein levels directly. Therefore scientists developed a method to use DNA sequencing to read RNA transcripts. The cell is killed, and all RNA is put into a sequencing machine. Machine reads (again, only short fragments!) this RNA. To quantify the abundance of each gene, we must first map reads to each particular gene.
The mapping is not as easy as it seems. Firstly, we need to have a reference genome with its annotation with the boundaries of all genes. Secondly, not all organisms are the same; there may be some differences between different representatives of a single species. Therefore, reads won't match exactly. Not to mention any sequencing errors and alternative splicing, which means that reads won't match to a single interval of a genome but two or more consecutive intervals. Repeats and gene-similarity produce another source of uncertainty in the mapping as we cannot be sure which gene the transcript originated from. Mapping algorithms (as STAR) need first to create an index, which is a suffix tree on the reference genome to be able to search in it. Multiple gene mapping reads can be solved, for example, by a procedure similar to EM-algorithm. Secondly, the mapping produces aligned reads in a BAM file.
Having the mapping, we can use a program called htseq-count to calculate a vector with counts for each gene. The counts need to be normalized as longer genes mean more reads with the same number of transcripts. Those counts are then direct input to the DeSEQ2 library, which allows different statistical analyses. If you have multiple samples, you may compare which genes are “more active” in cancer cells compared to healthy cells and use this knowledge further. In the next tutorial, we will try ourselves to run the mapping algorithm, estimate the counts, and use them in a simple analysis of zebrafish embryos. The problem is simplified so that we can run it on a modest work-station in a period dedicated to one tutorial. Therefore, only a single gene shows statistically significant differences in the single chromosome we use.
As we will work with bulk data, download some files in advance. This following script should do all the required for you.
bash
mkdir bintut9 cd bintut9 # annotation files and genome wget ftp://ftp.ensembl.org/pub/release-96/gtf/danio_rerio/Danio_rerio.GRCz11.96.gtf.gz wget ftp://ftp.ensembl.org/pub/release-96/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.chromosome.1.fa.gz gunzip Danio_rerio.GRCz11.dna.chromosome.1.fa.gz gunzip Danio_rerio.GRCz11.96.gtf.gz wget ftp://ftp.ebi.ac.uk/pub/training/Train_online/RNA-seq_exercise/6h_1.fastq wget ftp://ftp.ebi.ac.uk/pub/training/Train_online/RNA-seq_exercise/6h_2.fastq wget ftp://ftp.ebi.ac.uk/pub/training/Train_online/RNA-seq_exercise/2cells_1.fastq wget ftp://ftp.ebi.ac.uk/pub/training/Train_online/RNA-seq_exercise/2cells_2.fastq wget https://github.com/alexdobin/STAR/archive/2.7.0f.tar.gz tar -xzf 2.7.0f.tar.gz # Unfortunatelly, the original source of RNA-seq data was discontinued #wget https://www.ebi.ac.uk/~emily/Online%20courses/NGS/RNA-seq.zip # Instead, we will use this snapshot: wget http://ida.fel.cvut.cz/~rysavy/bin/RNA-seq.zip unzip RNA-seq.zip mkdir tophat mv RNA-seq/annotation/Danio_rerio.Zv9.66.gtf tophat/ mv RNA-seq/tophat/ZV9_6h/accepted_hits.sorted.bam tophat/6h.bam mv RNA-seq/tophat/ZV9_2cells/accepted_hits.sorted.bam tophat/2cells.bam
Next, we need to install the DESeq2 library into R. Run the following code in your R installation.
R
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("DESeq2") library(DESeq2)
In the tutorial, we will study expression levels of embryo cells of zebrafish, Danio rerio in Latin. Sequencing was done with Illumina RNA-seq platform which produces reads of RNA transcripts in a cell.
You have downloaded the following files:
Danio_rerio.GRCz11.dna.chromosome.1.fa.gz
with the DNA sequence of the zebrafish. Due to the time constraints in class, we will use only chromosome 1 in our experiments.
Danio_rerio.GRCz11.96.gtf.gz
with annotation of the genome. The file describes where each gene is located.
2cells_1.fastq
and 2cells_2.fastq
contain paired-end RNA-seq reads of a two-cell zebrafish embryo.
6h_1.fastq
and 6h_2.fastq
contain paired-end RNA-seq reads of zebrafish embryo $6$ hours post fertilization.
2.7.0f.tar.gz
contains the STAR aligner, see https://doi.org/10.1093/bioinformatics/bts635.
RNA-seq.zip
contains aligned reads by tophat aligner. For time constraints, we are not able to align reads in the class.
The reference genome and annotation may be downloaded from the ENSEMBL database.
The input data were obtained from the experiment described here and here. The sample alignment was downloaded from this tutorial.
The first step in the pipeline is the alignment of the reads to the genome. We have downloaded the annotated references from the ENSEMBL database. For the alignment, we will use the STAR aligner to calculate the alignment. The original paper was published here:
This aligner is faster than the tophat aligner that was used commonly before; STAR, however, consumes a lot of memory. Aligning human reads requires approximately $32\,\mathrm{GB}$. Hence, we will use only the first chromosome in the class.
The alignment has two steps. First, we need to create the index to calculate the alignment. Use the following parameters:
genome
, and use this directory for the genome index.
.fa
.gtf
file you downloaded.
zless
to view one of the read files and count the nucleotides.
Use documentation of the aligner to set the parameters: https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf
The second step is the alignment itself. The parameters are as follows.
_1.fastq
contains forward reads, file _2.fastq
contains backward reads.
outSAMtype
to sort the entries by coordinate in the final BAM file.
starout/6h
or starout/2cells
.
The produced BAM files are compressed versions of SAM file format. Feel free to inspect the contents of the output. You would need samtools
to do that.
For abundance quantification, we will use the HTSeq
count python tool. This tool is aimed as replacement of the cufflinks alternative. Documentation of the tool is on https://htseq.readthedocs.io/en/release_0.11.1/.
Start by installing the tool.
bash
pip3 install HTSeq
Finish the following commands to calculate the counts.
bash
python3 -m HTSeq.scripts.count -m ???? -f ???? ???? ???? > 6h.counts python3 -m HTSeq.scripts.count -m ???? -f ???? ???? ???? > 2cells.counts
You may use
bash
python3 -m HTSeq.scripts.count --help
union
mode.
Now, inspect contents of the .counts
file. Search for some of the gene names (ENSEMBL ids).
For analysis, we will use the R library DESeq2. Load the data with.
R
library(DESeq2) sampleFiles<-grep('counts',list.files("."),value=TRUE) sampleCondition <- c("2cells", "6h") sampleTable<-data.frame(sampleName=sampleFiles, fileName=sampleFiles, condition=sampleCondition) ddsHTSeq<-DESeqDataSetFromHTSeqCount(sampleTable=sampleTable, directory=".", design=~condition) ddsHTSeq
The analysis and visualization of the expression levels may be done with the following code.
R
dds <- DESeq(ddsHTSeq) saveRDS(dds, file="dds.rds") res <- results(dds) res plotMA(res, ylim=c(-10,10))
The textual input shows the difference in expression for all genes sorted by gene name. The plot shows the same result. Note the red triangle on the top. As the dataset is small, there is only a single gene that shows statistically significant differential expression. In the library vignette, you may find much more representative examples based on a study of multiple samples.
If we had multiple samples, PCA would show the clustering by test condition. Here, only a single dimension is enough. Similarly, we can visualize expressions by the condition for a gene of interest.
This tutorial was inspired by
Used tools include