Table of Contents

Tutorial 12 - Read alignment

In this tutorial, we will practice read alignment methods. First, we will review the FM-index used for short-read mapping. Then, we will introduce the BWA algorithm for short-read alignment that operates with this index. The output of BWA will be analyzed using samtools and subsequently used for variant calling.

FM-index, a simple example

  1. Consider the following DNA sequence: $\mathtt{AGAAGA}$.
  2. Build the FM-index for this sequence.
  3. Use it to search for pattern $\mathtt{AGA}$ (available).
  4. Use it to search for pattern $\mathtt{GGA}$ (unavailable).

Read alignment

BWA is a fast and efficient algorithm for aligning short DNA sequencing reads to a reference genome, using the FM-index for memory-efficient mapping. Samtools is a suite of tools for manipulating and analyzing alignment files in SAM/BAM format, including sorting, indexing, and extracting reads. Bcftools is used to call and process genetic variants from alignment data, enabling variant filtering, annotation, and summarization.

Necessary inputs

Let us install the necessary tools first:

bash

sudo apt-get update
sudo apt-get install bwa
sudo apt-get install samtools
sudo apt-get install bcftools

We will work with the HIV virus genome, in particular with its standard reference strain HIV-1 HXB2 used across HIV research worldwide. At the same time, we will gradually use several different sets of reads. The corresponding FASTA and FASTQ files are available here.

Task 1

Learn more about the genome as well as the first read set:

  1. What is the length of the reference genome?
  2. What is the distribution of the bases?

bash

samtools faidx HIV_ref.fasta
cat HIV_ref.fasta.fai

grep -v ">" HIV_ref.fasta | tr -d '\n' | fold -w1 | sort | uniq -c

Task 2

Align the first read set to the genome and understand the mapping.

bash

bwa index -p HIV_ref HIV_ref.fasta
bwa mem HIV_ref reads.fasta > reads.sam
less reads.sam

Samtools provide a set of useful tools for further understanding of the mapping.

bash

samtools view -bS reads.sam | samtools sort -o reads.bam
samtools index reads.bam
samtools flagstat reads.bam
samtools depth reads.bam
samtools view reads.bam HIV_complete_genome:1-1000

Answer the following questions:

  1. What happens if I try to map my first read with BLAST to a general nucleotide database? What does it say about the BLAST applicability to the task of read mapping?
  1. Look up the HXB2 gene coordinates. The first read set contains 3 reads with one insertion or deletion each. In which HIV genes do these variants fall — and what might be their biological significance?

Variant calling

Now we will align 300 reads from an unknown HIV sample genome to the HXB2 reference genome. The key biological question is: does this HIV sample differ from the reference, and if so, where? This is the task of variant calling — identifying positions where the sequenced sample systematically differs from the reference genome. A difference at a single position is called a SNP (Single Nucleotide Polymorphism), while insertions and deletions are called INDELs.

We use bcftools to perform variant calling. The code below generates a variant calling file:

bash

bwa mem HIV_ref reads_sample_Q20.fastq > reads_sample_Q20.sam
samtools view -bS reads_sample_Q20.sam | samtools sort -o reads_sample_Q20.bam
samtools index reads_sample_Q20.bam
samtools flagstat reads_sample_Q20.bam
samtools mpileup -g -f HIV_ref.fasta reads_sample_Q20.bam | bcftools call -mv -o variants.vcf 
less variants.vcf

The file can be analyzed with samtools:

bash

samtools depth -r HIV_complete_genome:501-501 reads_sample_Q20.bam
samtools tview -p HIV_complete_genome:490-510 reads_sample_Q20.bam HIV_ref.fasta
samtools tview -p HIV_complete_genome:6990 reads_sample_Q20.bam HIV_ref.fasta

Answer the following questions:

  1. How many variants were found, and what is the ratio of SNPs to INDELs?
  2. What is the average read depth across all variant positions? Is it uniform?
  3. Apply the standard filters (DP≥10, QUAL≥30). How many variants survive? What would you need to change in the experiment to make all variants pass?
  4. Are the called variants true mutations or sequencing artefacts?
  5. The INDEL at position 7000 has QUAL=221 while most SNPs have QUAL below 30. Why might an INDEL receive a higher confidence score than a SNP at similar read depth?

Independent work

Run a similar variant calling analysis with a set of pair-end reads (reads_Q20_R1.fastq and reads_Q20_R2.fastq). The bwa call changes to:

bash

bwa mem HIV_ref reads_Q20_R1.fastq reads_Q20_R2.fastq > reads_pe.sam

Answer the following questions:

  1. What are the changes in called variants with respect to the previous call?
  2. How pair-end reads generally influence trustworthiness of the alignment?