====== Tutorial 12 - Read alignment ======

In this tutorial, we will practice read alignment methods. First, we will review the FM-index used for short-read mapping. Then, we will introduce the BWA algorithm for short-read alignment that operates with this index. The output of BWA will be analyzed using samtools and subsequently used for variant calling.

===== FM-index, a simple example =====
/** the solution is available in: fm_index_solved_example.pdf  **/

  - Consider the following DNA sequence: $\mathtt{AGAAGA}$.
  - Build its suffix array //SA// and Burrows-Wheeler transform //BWT//.
  - Use //SA// and //BWT// to search for pattern $\mathtt{AGA}$ (available).
  - Use //SA// and //BWT// to search for pattern $\mathtt{GGA}$ (unavailable). 
  - Build the //FM-index// for this sequence.
  - Use it to search for both the patterns as well.


===== Read alignment =====

Sequencing does not produce a complete genome in one piece — instead it generates millions of short DNA fragments called reads. Read alignment is the process of determining where in a reference genome each read came from. 

In this tutorial we will work with the Human Immunodeficiency Virus reference genome and a couple of small sets of reads — manageable enough to understand the process in detail. The biological question is real however: **does our sequenced HIV sample differ from the reference strain, and if so, where?**

==== Necessary inputs ====

**BWA** is a fast and efficient algorithm for aligning short DNA sequencing reads to a reference genome, using the FM-index for memory-efficient mapping. **Samtools** is a suite of tools for manipulating and analyzing alignment files in SAM/BAM format, including sorting, indexing, and extracting reads. **Bcftools** is used to call and process genetic variants from alignment data, enabling variant filtering, annotation, and summarization. Let us install the necessary tools first:

<code|bash>
sudo apt-get update
sudo apt-get install bwa
sudo apt-get install samtools
sudo apt-get install bcftools
</code>

/* instalacni alternativy: conda install -c bioconda bwa samtools bcftools */

We will work with the HIV genome, in particular with its standard reference strain HIV-1 HXB2 (9,181 bp) used across HIV research worldwide. At the same time, we will gradually use several different sets of reads. The corresponding FASTA and FASTQ files are available {{:courses:bin:tutorials:hiv_alignment.zip|here}}.

<note>

The reads were sampled from a mutated HXB2 genome carrying several artificially introduced variants — SNPs and INDELs at known positions. This setup mimics a real clinical scenario where a patient's viral genome differs from the reference strain, and allows us to evaluate whether our variant calling pipeline successfully recovers the introduced mutations.

</note>


/* Genom a prvni sada 3 readu stazena z https://github.com/Mangul-Lab-USC/BWA_Tutorial. Mutovany genom a zbyle read sety byly generovany umele za pomoci CLAUDE. */

==== Task 1 ====

/* FASTA and FASTQ files are text files, they can easily be viewed. */

Learn more about the reference genome as well as the first read set:

  - What is the length of the reference genome? 
  - What is the distribution of the bases?
  - How many reads do we have?  

<code|bash>
# use samtools to analyze fasta and learn the genome length
samtools faidx HIV_ref.fasta
cat HIV_ref.fasta.fai

# find out the distribution of bases with shell utilities
grep -v ">" HIV_ref.fasta | tr -d '\n' | fold -w1 | sort | uniq -c

# count the number of reads with shell utilities
grep -c ">" reads.fasta
less reads.fasta
</code>

/* samtools rather work with the alignment SAM/BAM files, but could be used to analyze FASTA too, the reported numbers are: the genome length 9,181, at which byte the sequence starts (the header length is 20 bytes), the length of each row (60 nucleotides per line), the same length including the new line symbol */ 

/* We can rather use standard Linux shell utilities, such as grep, sed or awk. The grep command shows that HIV-1 HXB2 has an unusually low GC content, with cytosine comprising much less than 25% of the nucleotide composition. This AT-rich composition is characteristic of HIV-1 and has implications for sequencing quality, as AT-rich regions can be harder to sequence accurately and may show lower read coverage in alignment-based analyses. */

/* There are 3 reads with 72,72 and 96 bases. */

==== Task 2 ====

Align the first read set to the genome and understand the mapping. 

<code|bash>
# Index the reference genome (required once before any alignment)
bwa index -p HIV_ref HIV_ref.fasta
# Align reads to the indexed reference genome; output in human-readable SAM format
bwa mem HIV_ref reads.fasta > reads.sam
less reads.sam
</code>

/* Přepínač -p jen zkracuje jména indexových souborů, např. namísto HIV_ref.fasta.bwt je to jen HIV_ref.fasta.bwt */

The structure of SAM file can be seen [[https://samtools.github.io/hts-specs/SAMv1.pdf|here]]. Samtools provide a set of useful tools for further understanding of the mapping.

<code|bash>
# Convert SAM to BAM and sort by genomic position
samtools view -bS reads.sam | samtools sort -o reads.bam
# Index the BAM file for fast random access
samtools index reads.bam
# Print alignment statistics (total reads, mapped, paired, etc.)
samtools flagstat reads.bam
# Print per-position read depth across the genome
samtools depth reads.bam
# Extract reads mapping to the first 1000 bases of the genome
samtools view reads.bam HIV_complete_genome:1-1000
</code>

Answer the following questions:
  - How well did the reads mapped to the genome? Interpret their CIGAR strings.
  - What happens if I try to map my first read with BLAST to a general nucleotide database? What does it say about the BLAST applicability to the task of read mapping?
  - Look up the [[https://www.ncbi.nlm.nih.gov/nuccore/AF033819|HXB2 gene coordinates]]. The first read set contains 3 reads with one insertion or deletion each. In which HIV genes do these variants fall — and what might be their biological significance?

/* BLAST na tento úkol stačí teoreticky. Prakticky nestačí ze čtyř důvodů:
1. Rychlost — BLAST prohledává celou databázi NCBI (miliardy sekvencí). BWA mapuje na jednu referenci. Typický RNA-seq experiment má 50 milionů readů — BLAST by trval týdny, BWA hodiny.
2. Nepřesná pozice — BLAST vrátí alignment na nejpodobnější sekvenci v databázi, ale neřekne vám přesnou pozici na vaší referenci. BWA vrátí chromosome + přesnou pozici + CIGAR string.
3. Špatná reference — BLAST našel SHIV místo HIV, protože databáze obsahuje stovky příbuzných kmenů. Pro variant calling potřebujete mapování na konkrétní referenci, ne na nejpodobnější sekvenci v databázi.
4. Nepodporuje výstupní formát — BLAST nevytvoří SAM/BAM soubor. Celý downstream pipeline (samtools, bcftools, IGV) vyžaduje BAM formát který BLAST neumí generovat.*/

/* Biologická interpretace
Read1 (pos 168 — LTR) — inzerce v promotorové oblasti. LTR řídí transkripci celého viru — mutace zde může ovlivnit replikační aktivitu viru, zvýšit nebo snížit produkci virových RNA.
Read2 (pos 1705 — gag) — delece v oblasti kódující kapsidový protein (CA). Gag mutace typicky ovlivňují assembly virových partikulí a mohou způsobit rezistenci na inhibitory kapsidu (např. Lenacapavir).
Read3 (pos 5950 — vpu/env) — inzerce na hranici Vpu a Env. Vpu je zodpovědný za degradaci CD4 receptoru a uvolňování virionů z buňky — mutace zde může ovlivnit infektivitu. Zároveň začátek Env oblasti kódující gp160 prekurzor (gp120 + gp41). */

/* Čeho si je ještě dobré všimnout: virus musí zahustit hodně informace do krátkého genomu, geny se proto překrývají; dále platí, že jeden gen se přepíše na "velký" polyprotein a ten se později proteázou rozdělí na více menších proteinů se samostatnou funkcí -- proto například gen GAG má části matrix, capsid, nucleocapsid a p2 a p6. Proto obecně ení moc míst, kde by mutace či indel nevadily. Nejspíše asi v LTR (long terminal repeats) oblasti. */

/* Je dobré se podívat i na záložku Graphics -- ta ukazuje strukturu HIV genomu vizuálně. */
 
===== Variant calling =====

Now we will move from a vanilla read set to a more realistic one. Let us align 300 reads from an unknown HIV sample to the HXB2 reference genome. The key biological question is: does this HIV sample differ from the reference, and if so, where? This is the task of **variant calling** — identifying positions where the sequenced sample systematically differs from the reference genome. A difference at a single position is called a SNP (Single Nucleotide Polymorphism), while insertions and deletions are called INDELs. 

We use bcftools to perform variant calling. The code below generates a variant calling file:

<code|bash>
# Align 300 reads from the unknown HIV sample to the reference genome
bwa mem HIV_ref reads_sample_Q20.fastq > reads_sample_Q20.sam
# Convert SAM to BAM and sort by genomic position
samtools view -bS reads_sample_Q20.sam | samtools sort -o reads_sample_Q20.bam
# Index the BAM file for fast random access
samtools index reads_sample_Q20.bam
# Check alignment quality — all 300 reads should map successfully
samtools flagstat reads_sample_Q20.bam
# Call variants: summarize base composition at each position, then identify differences from reference
bcftools mpileup -f HIV_ref.fasta reads_sample_Q20.bam | bcftools call -mv --ploidy 1 -o variants.vcf 
less variants.vcf
</code>

/* bcftools předpokládá diploidní organismus (2 kopie každého chromozomu, jako u člověka). HIV je ale haploidní — má jen jednu kopii genomu. Proto --ploidy 1.*/

Learn the VCF structure [[https://samtools.github.io/hts-specs/VCFv4.2.pdf|here]]. The file can be analyzed with samtools:

<code|bash>
# Check read depth at the first called variant (SNP at position 501)
samtools depth -r HIV_complete_genome:501 reads_sample_Q20.bam
# Inspect the alignment around position 501 — verify the SNP visually
samtools tview -p HIV_complete_genome:490 reads_sample_Q20.bam HIV_ref.fasta
# Inspect the alignment around position 7000 — verify the high-confidence INDEL
samtools tview -p HIV_complete_genome:6990 reads_sample_Q20.bam HIV_ref.fasta
</code>

The overall variant calling workflow is as follows:

{{ :courses:bin:tutorials:read_alignment_wf.png?750 | The variant calling workflow}} 

Answer the following questions: 

  - How many variants were found, and what is the ratio of SNPs to INDELs?
  - What is the average read depth across all variant positions? Is it uniform?
  - Apply the standard filters (DP≥10, QUAL≥30). How many variants survive? What would you need to change in the experiment to make all variants pass?
  - **Are the called variants true mutations or sequencing artefacts?**
  - The INDEL at position 7000 has QUAL around 200 while most SNPs have QUAL below 30. Why might an INDEL receive a higher confidence score than a SNP at similar read depth?

/* QUAL is not simply a function of read depth — it is a likelihood ratio that compares two hypotheses:
    H0: the sample is identical to the reference at this position
    H1: there is a real variant at this position
   A high QUAL means H0 is extremely unlikely given the observed reads. What matters is not just how many reads support the variant, but how consistent they are and how unlikely the observation would be under H0.
   Why the INDEL scores higher
    The INDEL at position 7000 is an insertion of TT — A → ATT. This is a very specific signal:
    An insertion of exactly TT is extremely unlikely to arise from a sequencing error — base substitution errors are common, but spontaneous insertions of two specific bases at the same position across multiple independent reads are rare, 9 reads support the INDEL (the highest DP of all our variants) — more evidence than most SNPs, bcftools uses the indel likelihood model which accounts for the fact that the same insertion appearing in multiple reads is strong evidence for a real variant.
    By contrast, a SNP like G → A at DP=3 with QUAL=17 is ambiguous — a single damaged base or deamination event (the most common artefact:
cytosine (C) spontaneously loses an amino group and converts to uracil (U), which is read as thymine (T) by the polymerase) could explain it just as well as a real mutation. */

/* Read depth (DP) — how many reads cover the variant position? Our variants have DP between 3 and 9, which is very low. The gold standard is ≥ 10x, ideally 30x. With DP=3, a single sequencing error is enough to mimic a real variant.
   QUAL score — a Phred-scaled probability that the variant is real. The standard filter is QUAL ≥ 30. Several of our SNPs (positions 1201, 3801, 5101, and 6701) fall below this threshold and would be discarded in a real analysis.
   Strand bias — all our reads map to the forward strand only (5F/0R). A true mutation should be supported by reads from both strands. In our case this is caused by the low number of reads rather than a genuine artefact, but in a real dataset it would be a red flag.
   Mapping quality (MQ) — how confidently does the read map to this position? MQ=60 is the maximum and indicates a unique, unambiguous mapping — a good sign. A low MQ would suggest the read matches multiple locations in the genome, making the variant call unreliable. */
   

===== Independent work =====

So far all reads were single-end — each sequenced fragment was read from one end only. In **paired-end sequencing**, both ends of each DNA fragment are sequenced independently, producing two files: ''reads_Q20_R1.fastq'' (forward reads) and ''reads_Q20_R2.fastq'' (reverse reads). Each pair of reads originates from the same fragment, so they map to opposite strands at a predictable distance from each other.

Paired-end sequencing has two key advantages for variant calling. First, it doubles the number of reads covering each position, increasing read depth. Second, since variants are confirmed from both strands independently, strand bias artefacts are greatly reduced. 

Run a similar variant calling analysis with a set of paired-end reads. The //bwa// call changes to:

<code|bash>
bwa mem HIV_ref reads_Q20_R1.fastq reads_Q20_R2.fastq > reads_pe.sam
</code>

/* Although HIV is a single-stranded RNA virus, sequencing is always performed on double-stranded DNA — either as proviral DNA integrated in the host cell, or as cDNA generated by reverse transcription. Paired-end sequencing is therefore fully applicable.*/

/* V simulaci byl použit fragment_size=400 bp s read_length=150 bp, nebyla tedy čtena střední část fragmentu v délce 100 bp. */

Answer the following questions: 

  - How many reads mapped? How does //samtools flagstat// output differ from the single-end case — in particular, what does "properly paired" mean?
  - What are the changes in called variants with respect to the previous call?
  - How do paired-end reads generally influence trustworthiness of the alignment?

/* Máme jen 522 properly paired párových readů. Důvodem je řetězení nutných podmínek ke splnění: 1) oba ready z páru musí být samy o sobě úspěšně mapovány (to bylo splněno), 2) musí mapovat na stejný chromozom/referenci (zde splněno automaticky), 3) musí mapovat se správnou orientací, 4) musí být ve správné vzdálenosti. Špatné mapování je zde důsledkem samotné HIV sekvence v oblasti kolem pozice 1100–1600, která leží v genu gag. Oblast obsahuje repetitivní elementy které komplikují mapování. 87% properly paired readů je ale obecně solidní číslo. */

/* Volání je triviálním přepisem dříve použitých příkazů:
   samtools view -bS reads_pe.sam | samtools sort -o reads_pe.bam
   samtools index reads_pe.bam
   samtools flagstat reads_pe.bam
   samtools mpileup -g -f HIV_ref.fasta reads_pe.bam | bcftools call -mv -o variants_pe.vcf
   less variants_pe.vcf 
   samtools tview -p HIV_complete_genome:490-510 reads_pe.bam HIV_ref.fasta 
*/
   
/* Důležitější je interpretace: zajímavé je především to, že paired-end našel jen 6 z 10 variant — zbývající 4 (včetně všech INDELů) nebyly volány, protože jejich pokrytí nestačilo. Strand bias u pos 6701 (0% fwd, DP=4) je příklad kdy nízké DP způsobí zdánlivý bias i v paired-end datech. */

/* How to explain that we have less variants called with twice more reads: 
   More reads does not guarantee finding more variants — what matters is whether the reads happen to cover the variant positions. With low average coverage (~5–10x), there is a significant probability that some positions are covered by zero reads purely by chance. This is called sampling variance and is the key reason why sufficient coverage (≥30x) is essential for reliable variant calling. */

/* Na závěr je fajn ukázat zmutovaný genom, ze kterého byla generována čtení. Je v něm opravdu 10 změn nalezených na základě 300 nepárových čtení. */