====== Tutorial 12 - Read alignment ======
In this tutorial, we will practice read alignment methods. First, we will review the FM-index used for short-read mapping. Then, we will introduce the BWA algorithm for short-read alignment that operates with this index. The output of BWA will be analyzed using samtools and subsequently used for variant calling.
===== FM-index, a simple example =====
/** the solution is available in: fm_index_solved_example.pdf **/
- Consider the following DNA sequence: $\mathtt{AGAAGA}$.
- Build the FM-index for this sequence.
- Use it to search for pattern $\mathtt{AGA}$ (available).
- Use it to search for pattern $\mathtt{GGA}$ (unavailable).
===== Read alignment =====
BWA is a fast and efficient algorithm for aligning short DNA sequencing reads to a reference genome, using the FM-index for memory-efficient mapping. Samtools is a suite of tools for manipulating and analyzing alignment files in SAM/BAM format, including sorting, indexing, and extracting reads. Bcftools is used to call and process genetic variants from alignment data, enabling variant filtering, annotation, and summarization.
==== Necessary inputs ====
Let us install the necessary tools first:
sudo apt-get update
sudo apt-get install bwa
sudo apt-get install samtools
sudo apt-get install bcftools
/* instalacni alternativy: conda install -c bioconda bwa samtools bcftools */
We will work with the HIV virus genome, in particular with its standard reference strain HIV-1 HXB2 used across HIV research worldwide. At the same time, we will gradually use several different sets of reads. The corresponding FASTA and FASTQ files are available {{:courses:bin:tutorials:hiv_alignment.zip|here}}.
/* Genom a prvni sada 3 readu stazena z https://github.com/Mangul-Lab-USC/BWA_Tutorial. Mutovany genom a zbyle read sety byly genrovany umele za pomoci CLAUDE. */
==== Task 1 ====
/* FASTA and FASTQ files are text files, they can easily be viewed. */
Learn more about the genome as well as the first read set:
- What is the length of the reference genome?
- What is the distribution of the bases?
samtools faidx HIV_ref.fasta
cat HIV_ref.fasta.fai
grep -v ">" HIV_ref.fasta | tr -d '\n' | fold -w1 | sort | uniq -c
/* samtools rather work with the alignment SAM/BAM files, but could be used to analyze FASTA too, the rported numbers are: the genome length 9,181, at which byte the sequence starts (the header length is 20 bytes), the length of each row (60 nucleotides per line), the same length including the new line symbol */
/* We can rather use standard Linux shell utilities, such as grep, sed or awk. The grep command shows that HIV-1 HXB2 has an unusually low GC content, with cytosine comprising much less than 25% of the nucleotide composition. This AT-rich composition is characteristic of HIV-1 and has implications for sequencing quality, as AT-rich regions can be harder to sequence accurately and may show lower read coverage in alignment-based analyses. */
==== Task 2 ====
Align the first read set to the genome and understand the mapping.
bwa index -p HIV_ref HIV_ref.fasta
bwa mem HIV_ref reads.fasta > reads.sam
less reads.sam
Samtools provide a set of useful tools for further understanding of the mapping.
samtools view -bS reads.sam | samtools sort -o reads.bam
samtools index reads.bam
samtools flagstat reads.bam
samtools depth reads.bam
samtools view reads.bam HIV_complete_genome:1-1000
Answer the following questions:
- What happens if I try to map my first read with BLAST to a general nucleotide database? What does it say about the BLAST applicability to the task of read mapping?
/* BLAST na tento úkol stačí teoreticky. Prakticky nestačí ze čtyř důvodů:
1. Rychlost — BLAST prohledává celou databázi NCBI (miliardy sekvencí). BWA mapuje na jednu referenci. Typický RNA-seq experiment má 50 milionů readů — BLAST by trval týdny, BWA hodiny.
2. Nepřesná pozice — BLAST vrátí alignment na nejpodobnější sekvenci v databázi, ale neřekne vám přesnou pozici na vaší referenci. BWA vrátí chromosome + přesnou pozici + CIGAR string.
3. Špatná reference — BLAST našel SHIV místo HIV, protože databáze obsahuje stovky příbuzných kmenů. Pro variant calling potřebujete mapování na konkrétní referenci, ne na nejpodobnější sekvenci v databázi.
4. Nepodporuje výstupní formát — BLAST nevytvoří SAM/BAM soubor. Celý downstream pipeline (samtools, bcftools, IGV) vyžaduje BAM formát který BLAST neumí generovat.*/
- Look up the [[https://www.ncbi.nlm.nih.gov/nuccore/AF033819|HXB2 gene coordinates]]. The first read set contains 3 reads with one insertion or deletion each. In which HIV genes do these variants fall — and what might be their biological significance?
/* Biologická interpretace
Read1 (pos 168 — LTR) — inzerce v promotorové oblasti. LTR řídí transkripci celého viru — mutace zde může ovlivnit replikační aktivitu viru, zvýšit nebo snížit produkci virových RNA.
Read2 (pos 1705 — gag) — delece v oblasti kódující kapsidový protein (CA). Gag mutace typicky ovlivňují assembly virových partikulí a mohou způsobit rezistenci na inhibitory kapsidu (např. Lenacapavir).
Read3 (pos 5950 — vpu/env) — inzerce na hranici Vpu a Env. Vpu je zodpovědný za degradaci CD4 receptoru a uvolňování virionů z buňky — mutace zde může ovlivnit infektivitu. Zároveň začátek Env oblasti kódující gp160 prekurzor (gp120 + gp41). */
===== Variant calling =====
Now we will align 300 reads from an unknown HIV sample genome to the HXB2 reference genome. The key biological question is: does this HIV sample differ from the reference, and if so, where? This is the task of **variant calling** — identifying positions where the sequenced sample systematically differs from the reference genome. A difference at a single position is called a SNP (Single Nucleotide Polymorphism), while insertions and deletions are called INDELs.
We use bcftools to perform variant calling. The code below generates a variant calling file:
bwa mem HIV_ref reads_sample_Q20.fastq > reads_sample_Q20.sam
samtools view -bS reads_sample_Q20.sam | samtools sort -o reads_sample_Q20.bam
samtools index reads_sample_Q20.bam
samtools flagstat reads_sample_Q20.bam
samtools mpileup -g -f HIV_ref.fasta reads_sample_Q20.bam | bcftools call -mv -o variants.vcf
less variants.vcf
The file can be analyzed with samtools:
samtools depth -r HIV_complete_genome:501-501 reads_sample_Q20.bam
samtools tview -p HIV_complete_genome:490-510 reads_sample_Q20.bam HIV_ref.fasta
samtools tview -p HIV_complete_genome:6990 reads_sample_Q20.bam HIV_ref.fasta
Answer the following questions:
- How many variants were found, and what is the ratio of SNPs to INDELs?
- What is the average read depth across all variant positions? Is it uniform?
- Apply the standard filters (DP≥10, QUAL≥30). How many variants survive? What would you need to change in the experiment to make all variants pass?
- **Are the called variants true mutations or sequencing artefacts?**
- The INDEL at position 7000 has QUAL=221 while most SNPs have QUAL below 30. Why might an INDEL receive a higher confidence score than a SNP at similar read depth?
/* QUAL is not simply a function of read depth — it is a likelihood ratio that compares two hypotheses:
H0: the sample is identical to the reference at this position
H1: there is a real variant at this position
A high QUAL means H0 is extremely unlikely given the observed reads. What matters is not just how many reads support the variant, but how consistent they are and how unlikely the observation would be under H0.
Why the INDEL scores higher
The INDEL at position 7000 is an insertion of TT — A → ATT. This is a very specific signal:
An insertion of exactly TT is extremely unlikely to arise from a sequencing error — base substitution errors are common, but spontaneous insertions of two specific bases at the same position across multiple independent reads are rare, 9 reads support the INDEL (the highest DP of all our variants) — more evidence than most SNPs, bcftools uses the indel likelihood model which accounts for the fact that the same insertion appearing in multiple reads is strong evidence for a real variant.
By contrast, a SNP like G → A at DP=3 with QUAL=17 is ambiguous — a single damaged base or deamination event (the most common artefact:
cytosine (C) spontaneously loses an amino group and converts to uracil (U), which is read as thymine (T) by the polymerase) could explain it just as well as a real mutation. */
/* Read depth (DP) — how many reads cover the variant position? Our variants have DP between 3 and 9, which is very low. The gold standard is ≥ 10x, ideally 30x. With DP=3, a single sequencing error is enough to mimic a real variant.
QUAL score — a Phred-scaled probability that the variant is real. The standard filter is QUAL ≥ 30. Several of our SNPs (positions 1201, 3801, 5101, and 6701) fall below this threshold and would be discarded in a real analysis.
Strand bias — all our reads map to the forward strand only (5F/0R). A true mutation should be supported by reads from both strands. In our case this is caused by the low number of reads rather than a genuine artefact, but in a real dataset it would be a red flag.
Mapping quality (MQ) — how confidently does the read map to this position? MQ=60 is the maximum and indicates a unique, unambiguous mapping — a good sign. A low MQ would suggest the read matches multiple locations in the genome, making the variant call unreliable. */
===== Independent work =====
Run a similar variant calling analysis with a set of pair-end reads (''reads_Q20_R1.fastq'' and ''reads_Q20_R2.fastq''). The bwa call changes to:
bwa mem HIV_ref reads_Q20_R1.fastq reads_Q20_R2.fastq > reads_pe.sam
Answer the following questions:
- What are the changes in called variants with respect to the previous call?
- How pair-end reads generally influence trustworthiness of the alignment?