Search
As we will work with bulk data, download some files in advance. This following script should do all the required for you.
bash
mkdir bintut9 cd bintut9 # annotation files and genome wget ftp://ftp.ensembl.org/pub/release-96/gtf/danio_rerio/Danio_rerio.GRCz11.96.gtf.gz wget ftp://ftp.ensembl.org/pub/release-96/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.chromosome.1.fa.gz gunzip Danio_rerio.GRCz11.dna.chromosome.1.fa.gz gunzip Danio_rerio.GRCz11.96.gtf.gz wget ftp://ftp.ebi.ac.uk/pub/training/Train_online/RNA-seq_exercise/6h_1.fastq wget ftp://ftp.ebi.ac.uk/pub/training/Train_online/RNA-seq_exercise/6h_2.fastq wget ftp://ftp.ebi.ac.uk/pub/training/Train_online/RNA-seq_exercise/2cells_1.fastq wget ftp://ftp.ebi.ac.uk/pub/training/Train_online/RNA-seq_exercise/2cells_2.fastq wget https://github.com/alexdobin/STAR/archive/2.7.0f.tar.gz tar -xzf 2.7.0f.tar.gz wget https://www.ebi.ac.uk/~emily/Online%20courses/NGS/RNA-seq.zip unzip RNA-seq.zip mkdir tophat mv RNA-seq/annotation/Danio_rerio.Zv9.66.gtf tophat/ mv RNA-seq/tophat/ZV9_6h/accepted_hits.sorted.bam tophat/6h.bam mv RNA-seq/tophat/ZV9_2cells/accepted_hits.sorted.bam tophat/2cells.bam
Next, we need to install the DESeq2 library into R. Run the following code in your R installation.
R
source('http://bioconductor.org/biocLite.R') biocLite('DESeq2') library(DESeq2)
In the tutorial, we will study expression levels of embryo cells of zebrafish, Danio rerio in Latin. Sequencing was done with Illumina RNA-seq platform which produces reads of RNA transcripts in a cell.
You have downloaded the following files:
Danio_rerio.GRCz11.dna.chromosome.1.fa.gz
Danio_rerio.GRCz11.96.gtf.gz
2cells_1.fastq
2cells_2.fastq
6h_1.fastq
6h_2.fastq
2.7.0f.tar.gz
RNA-seq.zip
The reference genome and annotation may be downloaded from the ENSEMBL database.
The input data were obtained from the experiment described here and here. The sample alignment was downloaded from this tutorial.
The first step in the pipeline is the alignment of the reads to the genome. We have downloaded the annotated references from the ENSEMBL database. For the alignment, we will use the STAR aligner to calculate the alignment. The original paper was published here:
This aligner is faster than the tophat aligner that was used commonly before; STAR, however, consumes a lot of memory. Aligning human reads requires approximately $32\,\mathrm{GB}$. Hence, we will use only the first chromosome in the class.
The alignment has two steps. First, we need to create the index to calculate the alignment. Use the following parameters:
genome
.fa
.gtf
zless
Use documentation of the aligner to set the parameters: https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf
The second step is the alignment itself. The parameters are as follows.
_1.fastq
_2.fastq
outSAMtype
starout/6h
starout/2cells
The produced BAM files are compressed versions of SAM file format. Feel free to inspect the contents of the output. You would need samtools to do that.
samtools
For abundance quantification, we will use the HTSeq count python tool. This tool is aimed as replacement of the cufflinks alternative. Documentation of the tool is on https://htseq.readthedocs.io/en/release_0.11.1/.
HTSeq
Start by installing the tool.
pip3 install HTSeq
Finish the following commands to calculate the counts.
python3 -m HTSeq.scripts.count -m ???? -f ???? ???? ???? > 6h.counts python3 -m HTSeq.scripts.count -m ???? -f ???? ???? ???? > 2cells.counts
You may use
python3 -m HTSeq.scripts.count --help
union
Now, inspect contents of the .counts file. Search for some of the gene names (ENSEMBL ids).
.counts
For analysis, we will use the R library DESeq2. Load the data with.
library(DESeq2) sampleFiles<-grep('counts',list.files("."),value=TRUE) sampleCondition <- c("2cells", "6h") sampleTable<-data.frame(sampleName=sampleFiles, fileName=sampleFiles, condition=sampleCondition) ddsHTSeq<-DESeqDataSetFromHTSeqCount(sampleTable=sampleTable, directory=".", design=~condition) ddsHTSeq
The analysis and visualization of the expression levels may be done with the following code.
dds <- DESeq(ddsHTSeq) saveRDS(dds, file="dds.rds") res <- results(dds) res plotMA(res, ylim=c(-10,10))
The textual input shows the difference in expression for all genes sorted by gene name. The plot shows the same result. Note the red triangle on the top. As the dataset is small, there is only a single gene that shows statistically significant differential expression. In the library vignette, you may find much more representative examples based on a study of multiple samples.
If we had multiple samples, PCA would show the clustering by test condition. Here, only a single dimension is enough. Similarly, we can visualize expressions by the condition for a gene of interest.
This tutorial was inspired by
Used tools include