Search
In this tutorial, we will practice read alignment methods. First, we will review the FM-index used for short-read mapping. Then, we will introduce the BWA algorithm for short-read alignment that operates with this index. The output of BWA will be analyzed using samtools and subsequently used for variant calling.
BWA is a fast and efficient algorithm for aligning short DNA sequencing reads to a reference genome, using the FM-index for memory-efficient mapping. Samtools is a suite of tools for manipulating and analyzing alignment files in SAM/BAM format, including sorting, indexing, and extracting reads. Bcftools is used to call and process genetic variants from alignment data, enabling variant filtering, annotation, and summarization.
Let us install the necessary tools first:
bash
sudo apt-get update sudo apt-get install bwa sudo apt-get install samtools sudo apt-get install bcftools
We will work with the HIV virus genome, in particular with its standard reference strain HIV-1 HXB2 used across HIV research worldwide. At the same time, we will gradually use several different sets of reads. The corresponding FASTA and FASTQ files are available here.
Learn more about the genome as well as the first read set:
samtools faidx HIV_ref.fasta cat HIV_ref.fasta.fai grep -v ">" HIV_ref.fasta | tr -d '\n' | fold -w1 | sort | uniq -c
Align the first read set to the genome and understand the mapping.
bwa index -p HIV_ref HIV_ref.fasta bwa mem HIV_ref reads.fasta > reads.sam less reads.sam
Samtools provide a set of useful tools for further understanding of the mapping.
samtools view -bS reads.sam | samtools sort -o reads.bam samtools index reads.bam samtools flagstat reads.bam samtools depth reads.bam samtools view reads.bam HIV_complete_genome:1-1000
Answer the following questions:
Now we will align 300 reads from an unknown HIV sample genome to the HXB2 reference genome. The key biological question is: does this HIV sample differ from the reference, and if so, where? This is the task of variant calling — identifying positions where the sequenced sample systematically differs from the reference genome. A difference at a single position is called a SNP (Single Nucleotide Polymorphism), while insertions and deletions are called INDELs.
We use bcftools to perform variant calling. The code below generates a variant calling file:
bwa mem HIV_ref reads_sample_Q20.fastq > reads_sample_Q20.sam samtools view -bS reads_sample_Q20.sam | samtools sort -o reads_sample_Q20.bam samtools index reads_sample_Q20.bam samtools flagstat reads_sample_Q20.bam samtools mpileup -g -f HIV_ref.fasta reads_sample_Q20.bam | bcftools call -mv -o variants.vcf less variants.vcf
The file can be analyzed with samtools:
samtools depth -r HIV_complete_genome:501-501 reads_sample_Q20.bam samtools tview -p HIV_complete_genome:490-510 reads_sample_Q20.bam HIV_ref.fasta samtools tview -p HIV_complete_genome:6990 reads_sample_Q20.bam HIV_ref.fasta
Run a similar variant calling analysis with a set of pair-end reads (reads_Q20_R1.fastq and reads_Q20_R2.fastq). The bwa call changes to:
reads_Q20_R1.fastq
reads_Q20_R2.fastq
bwa mem HIV_ref reads_Q20_R1.fastq reads_Q20_R2.fastq > reads_pe.sam