====== Tutorial 9 - RNA-Seq - alignment, abundance quantification and differential expression analysis ======

===== Before the tutorial =====

As we will work with bulk data, download some files in advance. This following script should do all the required for you.
<code|bash>
mkdir bintut9
cd bintut9

# annotation files and genome
wget ftp://ftp.ensembl.org/pub/release-96/gtf/danio_rerio/Danio_rerio.GRCz11.96.gtf.gz
wget ftp://ftp.ensembl.org/pub/release-96/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.chromosome.1.fa.gz

gunzip Danio_rerio.GRCz11.dna.chromosome.1.fa.gz
gunzip Danio_rerio.GRCz11.96.gtf.gz

wget ftp://ftp.ebi.ac.uk/pub/training/Train_online/RNA-seq_exercise/6h_1.fastq
wget ftp://ftp.ebi.ac.uk/pub/training/Train_online/RNA-seq_exercise/6h_2.fastq

wget ftp://ftp.ebi.ac.uk/pub/training/Train_online/RNA-seq_exercise/2cells_1.fastq
wget ftp://ftp.ebi.ac.uk/pub/training/Train_online/RNA-seq_exercise/2cells_2.fastq

wget https://github.com/alexdobin/STAR/archive/2.7.0f.tar.gz
tar -xzf 2.7.0f.tar.gz

wget https://www.ebi.ac.uk/~emily/Online%20courses/NGS/RNA-seq.zip
unzip RNA-seq.zip
mkdir tophat
mv RNA-seq/annotation/Danio_rerio.Zv9.66.gtf tophat/
mv RNA-seq/tophat/ZV9_6h/accepted_hits.sorted.bam tophat/6h.bam
mv RNA-seq/tophat/ZV9_2cells/accepted_hits.sorted.bam tophat/2cells.bam
</code>

Next, we need to install the DESeq2 library into R. Run the following code in your R installation.
<code|R>
source('http://bioconductor.org/biocLite.R')
biocLite('DESeq2')
library(DESeq2)
</code>

===== Introduction =====

In the tutorial, we will study expression levels of embryo cells of zebrafish, //Danio rerio// in Latin. Sequencing was done with Illumina RNA-seq platform which produces reads of RNA transcripts in a cell. 

You have downloaded the following files:
  - ''Danio_rerio.GRCz11.dna.chromosome.1.fa.gz'' with the DNA sequence of the zebrafish. Due to the time constraints in class, we will use only chromosome 1 in our experiments.
  - ''Danio_rerio.GRCz11.96.gtf.gz'' with annotation of the genome. The file describes where each gene is located.
  - Files ''2cells_1.fastq'' and ''2cells_2.fastq'' contain paired-end RNA-seq reads of a two-cell zebrafish embryo.
  - Files ''6h_1.fastq'' and ''6h_2.fastq'' contain paired-end RNA-seq reads of zebrafish embryo $6$ hours post fertilization.
  - ''2.7.0f.tar.gz'' contains the STAR aligner, see [[https://doi.org/10.1093/bioinformatics/bts635]].
  - ''RNA-seq.zip'' contains aligned reads by tophat aligner. For time constraints, we are not able to align reads in the class.

The reference genome and annotation may be downloaded from the [[https://www.ensembl.org/Danio_rerio/Info/Index|ENSEMBL database]].

{{ :courses:bin:tutorials:ensembl.png?nolink&400 |}}

The input data were obtained from the experiment described [[ftp://ftp.ebi.ac.uk/pub/training/Train_online/RNA-seq_exercise/zebrafish-rna-seq.pdf|here]] and [[https://www.ebi.ac.uk/training/online/course/ebi-next-generation-sequencing-practical-course/rna-sequencing/rna-seq-analysis-transcriptome|here]]. The sample alignment was downloaded from [[https://courses.cs.ut.ee/MTAT.03.239/2013_fall/uploads/Main/11_Practical_2013.pdf|this tutorial]].

===== Alignment =====

The first step in the pipeline is the alignment of the reads to the genome. We have downloaded the annotated references from the ENSEMBL database. For the alignment, we will use the STAR aligner to calculate the alignment. The original paper was published here:

  - Alexander Dobin, Carrie A. Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, Thomas R. Gingeras, STAR: ultrafast universal RNA-seq aligner, Bioinformatics, Volume 29, Issue 1, January 2013, Pages 15–21, https://doi.org/10.1093/bioinformatics/bts635

This aligner is faster than the tophat aligner that was used commonly before; STAR, however, consumes a lot of memory. Aligning human reads requires approximately $32\,\mathrm{GB}$. Hence, we will use only the first chromosome in the class.

The alignment has two steps. First, we need to create the index to calculate the alignment. Use the following parameters:
  - The number of threads should be equal to the number of cores of your computer.
  - Create a directory ''genome'', and use this directory for the genome index.
  - The genome FASTA file is the file with extension ''.fa''
  - Use the ''.gtf'' file you downloaded.
  - Sjdb overhang should be equal to read length - 1. Use ''zless'' to view one of the read files and count the nucleotides.
  - Don't forget to set the program mode.
Use documentation of the aligner to set the parameters: [[https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf]]

The second step is the alignment itself. The parameters are as follows.
  - The number of threads should be equal to the number of cores.
  - Use the genome directory you created in the last step.
  - The input should contain two FASTQ files for a single condition. File ''_1.fastq'' contains forward reads, file ''_2.fastq'' contains backward reads.
  - Add ''outSAMtype'' to sort the entries by coordinate in the final BAM file.
  - Set prefix of the output as ''starout/6h'' or ''starout/2cells''.

<note tip>Unfortunately, the alignment takes around half an hour on a reasonably fast computer. Therefore, in the next step, we will use precomputed aligned files.</note>

The produced BAM files are compressed versions of SAM file format. Feel free to inspect the contents of the output. You would need ''samtools'' to do that.

===== Abundance quantification =====

For abundance quantification, we will use the ''HTSeq'' count python tool. This tool is aimed as replacement of the cufflinks alternative. Documentation of the tool is on [[https://htseq.readthedocs.io/en/release_0.11.1/]].

Start by installing the tool.
<code|bash>
pip3 install HTSeq
</code>

Finish the following commands to calculate the counts.
<code|bash>
python3 -m HTSeq.scripts.count -m ???? -f ???? ???? ???? > 6h.counts
python3 -m HTSeq.scripts.count -m ???? -f ???? ???? ???? > 2cells.counts
</code>

You may use
<code|bash>
python3 -m HTSeq.scripts.count --help
</code>
to get help. Use ''union'' mode.

Now, inspect contents of the ''.counts'' file. Search for some of the gene names (ENSEMBL ids).

{{ :courses:bin:tutorials:counts.png?nolink&400 |}}

===== Differential expression analysis =====

For analysis, we will use the R library DESeq2. Load the data with.
<code|R>
library(DESeq2)
sampleFiles<-grep('counts',list.files("."),value=TRUE)
sampleCondition <- c("2cells", "6h")
sampleTable<-data.frame(sampleName=sampleFiles, fileName=sampleFiles, condition=sampleCondition)
ddsHTSeq<-DESeqDataSetFromHTSeqCount(sampleTable=sampleTable, directory=".", design=~condition)
ddsHTSeq
</code>
The last command visualized the result.
{{ :courses:bin:tutorials:ddshtseq.png?nolink&400 |}}

The analysis and visualization of the expression levels may be done with the following code.
<code|R>
dds <- DESeq(ddsHTSeq)
saveRDS(dds, file="dds.rds")
res <- results(dds)
res

plotMA(res, ylim=c(-10,10))
</code>

The textual input shows the difference in expression for all genes sorted by gene name.
{{ :courses:bin:tutorials:log2fold.png?nolink&400 |}}
The plot shows the same result. Note the red triangle on the top. As the dataset is small, there is only a single gene that shows statistically significant differential expression.
{{ :courses:bin:tutorials:maplot.png?nolink&400 |}}
In the library vignette, you may find much more representative examples based on a study of multiple samples.

If we had multiple samples, PCA would show the clustering by test condition. Here, only a single dimension is enough.
{{ :courses:bin:tutorials:pca.png?nolink&400 |}}
Similarly, we can visualize expressions by the condition for a gene of interest.
{{ :courses:bin:tutorials:expression_by_gene.png?nolink&400 |}}

===== References =====

This tutorial was inspired by
  - https://www.ebi.ac.uk/training/online/sites/ebi.ac.uk.training.online/files/user/523/documents/zebrafish-rna-seq.pdf
  - https://courses.cs.ut.ee/MTAT.03.239/2013_fall/uploads/Main/11_Practical_2013.pdf

Used tools include
  - STAR manual https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf
  - HTSeq count library https://htseq.readthedocs.io/en/release_0.11.1/
  - DESeq2 library https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html