Search
In this tutorial, we will practice read alignment methods. First, we will review the FM-index used for short-read mapping. Then, we will introduce the BWA algorithm for short-read alignment that operates with this index. The output of BWA will be analyzed using samtools and subsequently used for variant calling.
Sequencing does not produce a complete genome in one piece — instead it generates millions of short DNA fragments called reads. Read alignment is the process of determining where in a reference genome each read came from.
In this tutorial we will work with the Human Immunodeficiency Virus reference genome and a couple of small sets of reads — manageable enough to understand the process in detail. The biological question is real however: does our sequenced HIV sample differ from the reference strain, and if so, where?
BWA is a fast and efficient algorithm for aligning short DNA sequencing reads to a reference genome, using the FM-index for memory-efficient mapping. Samtools is a suite of tools for manipulating and analyzing alignment files in SAM/BAM format, including sorting, indexing, and extracting reads. Bcftools is used to call and process genetic variants from alignment data, enabling variant filtering, annotation, and summarization. Let us install the necessary tools first:
bash
sudo apt-get update sudo apt-get install bwa sudo apt-get install samtools sudo apt-get install bcftools
We will work with the HIV genome, in particular with its standard reference strain HIV-1 HXB2 (9,181 bp) used across HIV research worldwide. At the same time, we will gradually use several different sets of reads. The corresponding FASTA and FASTQ files are available here.
Learn more about the reference genome as well as the first read set:
# use samtools to analyze fasta and learn the genome length samtools faidx HIV_ref.fasta cat HIV_ref.fasta.fai # find out the distribution of bases with shell utilities grep -v ">" HIV_ref.fasta | tr -d '\n' | fold -w1 | sort | uniq -c # count the number of reads with shell utilities grep -c ">" reads.fasta less reads.fasta
Align the first read set to the genome and understand the mapping.
# Index the reference genome (required once before any alignment) bwa index -p HIV_ref HIV_ref.fasta # Align reads to the indexed reference genome; output in human-readable SAM format bwa mem HIV_ref reads.fasta > reads.sam less reads.sam
The structure of SAM file can be seen here. Samtools provide a set of useful tools for further understanding of the mapping.
# Convert SAM to BAM and sort by genomic position samtools view -bS reads.sam | samtools sort -o reads.bam # Index the BAM file for fast random access samtools index reads.bam # Print alignment statistics (total reads, mapped, paired, etc.) samtools flagstat reads.bam # Print per-position read depth across the genome samtools depth reads.bam # Extract reads mapping to the first 1000 bases of the genome samtools view reads.bam HIV_complete_genome:1-1000
Answer the following questions:
Now we will move from a vanilla read set to a more realistic one. Let us align 300 reads from an unknown HIV sample to the HXB2 reference genome. The key biological question is: does this HIV sample differ from the reference, and if so, where? This is the task of variant calling — identifying positions where the sequenced sample systematically differs from the reference genome. A difference at a single position is called a SNP (Single Nucleotide Polymorphism), while insertions and deletions are called INDELs.
We use bcftools to perform variant calling. The code below generates a variant calling file:
# Align 300 reads from the unknown HIV sample to the reference genome bwa mem HIV_ref reads_sample_Q20.fastq > reads_sample_Q20.sam # Convert SAM to BAM and sort by genomic position samtools view -bS reads_sample_Q20.sam | samtools sort -o reads_sample_Q20.bam # Index the BAM file for fast random access samtools index reads_sample_Q20.bam # Check alignment quality — all 300 reads should map successfully samtools flagstat reads_sample_Q20.bam # Call variants: summarize base composition at each position, then identify differences from reference bcftools mpileup -f HIV_ref.fasta reads_sample_Q20.bam | bcftools call -mv --ploidy 1 -o variants.vcf less variants.vcf
Learn the VCF structure here. The file can be analyzed with samtools:
# Check read depth at the first called variant (SNP at position 501) samtools depth -r HIV_complete_genome:501 reads_sample_Q20.bam # Inspect the alignment around position 501 — verify the SNP visually samtools tview -p HIV_complete_genome:490 reads_sample_Q20.bam HIV_ref.fasta # Inspect the alignment around position 7000 — verify the high-confidence INDEL samtools tview -p HIV_complete_genome:6990 reads_sample_Q20.bam HIV_ref.fasta
The overall variant calling workflow is as follows:
So far all reads were single-end — each sequenced fragment was read from one end only. In paired-end sequencing, both ends of each DNA fragment are sequenced independently, producing two files: reads_Q20_R1.fastq (forward reads) and reads_Q20_R2.fastq (reverse reads). Each pair of reads originates from the same fragment, so they map to opposite strands at a predictable distance from each other.
reads_Q20_R1.fastq
reads_Q20_R2.fastq
Paired-end sequencing has two key advantages for variant calling. First, it doubles the number of reads covering each position, increasing read depth. Second, since variants are confirmed from both strands independently, strand bias artefacts are greatly reduced.
Run a similar variant calling analysis with a set of paired-end reads. The bwa call changes to:
bwa mem HIV_ref reads_Q20_R1.fastq reads_Q20_R2.fastq > reads_pe.sam
/