Table of Contents

Tutorial 12 - Read alignment

In this tutorial, we will practice read alignment methods. First, we will review the FM-index used for short-read mapping. Then, we will introduce the BWA algorithm for short-read alignment that operates with this index. The output of BWA will be analyzed using samtools and subsequently used for variant calling.

FM-index, a simple example

  1. Consider the following DNA sequence: $\mathtt{AGAAGA}$.
  2. Build its suffix array SA and Burrows-Wheeler transform BWT.
  3. Use SA and BWT to search for pattern $\mathtt{AGA}$ (available).
  4. Use SA and BWT to search for pattern $\mathtt{GGA}$ (unavailable).
  5. Build the FM-index for this sequence.
  6. Use it to search for both the patterns as well.

Read alignment

Sequencing does not produce a complete genome in one piece — instead it generates millions of short DNA fragments called reads. Read alignment is the process of determining where in a reference genome each read came from.

In this tutorial we will work with the Human Immunodeficiency Virus reference genome and a couple of small sets of reads — manageable enough to understand the process in detail. The biological question is real however: does our sequenced HIV sample differ from the reference strain, and if so, where?

Necessary inputs

BWA is a fast and efficient algorithm for aligning short DNA sequencing reads to a reference genome, using the FM-index for memory-efficient mapping. Samtools is a suite of tools for manipulating and analyzing alignment files in SAM/BAM format, including sorting, indexing, and extracting reads. Bcftools is used to call and process genetic variants from alignment data, enabling variant filtering, annotation, and summarization. Let us install the necessary tools first:

bash

sudo apt-get update
sudo apt-get install bwa
sudo apt-get install samtools
sudo apt-get install bcftools

We will work with the HIV genome, in particular with its standard reference strain HIV-1 HXB2 (9,181 bp) used across HIV research worldwide. At the same time, we will gradually use several different sets of reads. The corresponding FASTA and FASTQ files are available here.

The reads were sampled from a mutated HXB2 genome carrying several artificially introduced variants — SNPs and INDELs at known positions. This setup mimics a real clinical scenario where a patient's viral genome differs from the reference strain, and allows us to evaluate whether our variant calling pipeline successfully recovers the introduced mutations.

Task 1

Learn more about the reference genome as well as the first read set:

  1. What is the length of the reference genome?
  2. What is the distribution of the bases?
  3. How many reads do we have?

bash

# use samtools to analyze fasta and learn the genome length
samtools faidx HIV_ref.fasta
cat HIV_ref.fasta.fai

# find out the distribution of bases with shell utilities
grep -v ">" HIV_ref.fasta | tr -d '\n' | fold -w1 | sort | uniq -c

# count the number of reads with shell utilities
grep -c ">" reads.fasta
less reads.fasta

Task 2

Align the first read set to the genome and understand the mapping.

bash

# Index the reference genome (required once before any alignment)
bwa index -p HIV_ref HIV_ref.fasta
# Align reads to the indexed reference genome; output in human-readable SAM format
bwa mem HIV_ref reads.fasta > reads.sam
less reads.sam

The structure of SAM file can be seen here. Samtools provide a set of useful tools for further understanding of the mapping.

bash

# Convert SAM to BAM and sort by genomic position
samtools view -bS reads.sam | samtools sort -o reads.bam
# Index the BAM file for fast random access
samtools index reads.bam
# Print alignment statistics (total reads, mapped, paired, etc.)
samtools flagstat reads.bam
# Print per-position read depth across the genome
samtools depth reads.bam
# Extract reads mapping to the first 1000 bases of the genome
samtools view reads.bam HIV_complete_genome:1-1000

Answer the following questions:

  1. How well did the reads mapped to the genome? Interpret their CIGAR strings.
  2. What happens if I try to map my first read with BLAST to a general nucleotide database? What does it say about the BLAST applicability to the task of read mapping?
  3. Look up the HXB2 gene coordinates. The first read set contains 3 reads with one insertion or deletion each. In which HIV genes do these variants fall — and what might be their biological significance?

Variant calling

Now we will move from a vanilla read set to a more realistic one. Let us align 300 reads from an unknown HIV sample to the HXB2 reference genome. The key biological question is: does this HIV sample differ from the reference, and if so, where? This is the task of variant calling — identifying positions where the sequenced sample systematically differs from the reference genome. A difference at a single position is called a SNP (Single Nucleotide Polymorphism), while insertions and deletions are called INDELs.

We use bcftools to perform variant calling. The code below generates a variant calling file:

bash

# Align 300 reads from the unknown HIV sample to the reference genome
bwa mem HIV_ref reads_sample_Q20.fastq > reads_sample_Q20.sam
# Convert SAM to BAM and sort by genomic position
samtools view -bS reads_sample_Q20.sam | samtools sort -o reads_sample_Q20.bam
# Index the BAM file for fast random access
samtools index reads_sample_Q20.bam
# Check alignment quality — all 300 reads should map successfully
samtools flagstat reads_sample_Q20.bam
# Call variants: summarize base composition at each position, then identify differences from reference
bcftools mpileup -f HIV_ref.fasta reads_sample_Q20.bam | bcftools call -mv --ploidy 1 -o variants.vcf 
less variants.vcf

Learn the VCF structure here. The file can be analyzed with samtools:

bash

# Check read depth at the first called variant (SNP at position 501)
samtools depth -r HIV_complete_genome:501 reads_sample_Q20.bam
# Inspect the alignment around position 501 — verify the SNP visually
samtools tview -p HIV_complete_genome:490 reads_sample_Q20.bam HIV_ref.fasta
# Inspect the alignment around position 7000 — verify the high-confidence INDEL
samtools tview -p HIV_complete_genome:6990 reads_sample_Q20.bam HIV_ref.fasta

The overall variant calling workflow is as follows:

 The variant calling workflow

Answer the following questions:

  1. How many variants were found, and what is the ratio of SNPs to INDELs?
  2. What is the average read depth across all variant positions? Is it uniform?
  3. Apply the standard filters (DP≥10, QUAL≥30). How many variants survive? What would you need to change in the experiment to make all variants pass?
  4. Are the called variants true mutations or sequencing artefacts?
  5. The INDEL at position 7000 has QUAL around 200 while most SNPs have QUAL below 30. Why might an INDEL receive a higher confidence score than a SNP at similar read depth?

Independent work

So far all reads were single-end — each sequenced fragment was read from one end only. In paired-end sequencing, both ends of each DNA fragment are sequenced independently, producing two files: reads_Q20_R1.fastq (forward reads) and reads_Q20_R2.fastq (reverse reads). Each pair of reads originates from the same fragment, so they map to opposite strands at a predictable distance from each other.

Paired-end sequencing has two key advantages for variant calling. First, it doubles the number of reads covering each position, increasing read depth. Second, since variants are confirmed from both strands independently, strand bias artefacts are greatly reduced.

Run a similar variant calling analysis with a set of paired-end reads. The bwa call changes to:

bash

bwa mem HIV_ref reads_Q20_R1.fastq reads_Q20_R2.fastq > reads_pe.sam

Answer the following questions:

  1. How many reads mapped? How does samtools flagstat output differ from the single-end case — in particular, what does “properly paired” mean?
  2. What are the changes in called variants with respect to the previous call?
  3. How do paired-end reads generally influence trustworthiness of the alignment?

/