Alignment
This section describes the alignment steps in the Poppy pipeline. It takes the trimmed and merged FASTQ files produced by the pre‑alignment module and produces a single, sorted, duplicate‑marked BAM file per sample that is used by downstream modules (SNV/indel calling, CNV/SV detection, and QC).
All alignment rules are provided by the Hydra‑Genetics alignment module (v0.5.1).
Input Files
The alignment module takes the trimmed and merged FASTQ files produced by the pre‑alignment module as input.
| Input File | Source |
|---|---|
prealignment/merged/{sample}_{type}_fastq1.fastq.gz |
Pre‑alignment module |
prealignment/merged/{sample}_{type}_fastq2.fastq.gz |
Pre‑alignment module |
Workflow Steps
1. BWA‑MEM — Read Alignment
Each FASTQ pair (one per flowcell / lane / barcode) is aligned to the reference genome using BWA‑MEM.
| Item | Value |
|---|---|
| Container | hydragenetics/bwa_mem:0.7.17 |
| Input | prealignment/merged/{sample}_{type}_{read}.fastq.gz |
| Output | alignment/bwa_mem/{sample}_{type}_{flowcell}_{lane}_{barcode}.bam |
2. Samtools Merge (per‑lane) — Merge Lane BAMs
When a sample has been sequenced across multiple flowcells or lanes, the per‑lane BAM files are merged into a single unsorted BAM with samtools merge.
| Item | Value |
|---|---|
| Output | alignment/bwa_mem/{sample}_{type}.bam (after sorting) |
3. Samtools Extract Reads — Split by Chromosome
The merged BAM is split into per‑chromosome BAM files. This allows duplicate marking to run in parallel for each chromosome, significantly reducing wall‑clock time.
| Item | Value |
|---|---|
| Output | alignment/samtools_extract_reads/{sample}_{type}_{chr}.bam |
4. Picard MarkDuplicates — Duplicate Marking
Duplicate reads are flagged independently per chromosome using Picard MarkDuplicates.
| Item | Value |
|---|---|
| Container | hydragenetics/picard:2.25.0 |
| Output | alignment/picard_mark_duplicates/{sample}_{type}_{chr}.bam |
5. Samtools Merge — Combine Chromosomes
The per‑chromosome, duplicate‑marked BAM files are merged back into a single BAM.
| Item | Value |
|---|---|
| Output | alignment/samtools_merge_bam/{sample}_{type}.bam (after sorting) |
6. Samtools Index — BAM Indexing
The final merged BAM is indexed so that it can be efficiently queried by downstream tools.
| Item | Value |
|---|---|
| Output | alignment/samtools_merge_bam/{sample}_{type}.bam.bai |
Key Output Files
| Output File | Description |
|---|---|
alignment/samtools_merge_bam/{sample}_{type}.bam |
Merged, sorted, duplicate‑marked BAM |
alignment/samtools_merge_bam/{sample}_{type}.bam.bai |
BAM index |
Downstream Consumers
The final BAM and its index are copied into the results/bam/ output directory as final pipeline outputs:
bam/{sample}_{type}.bam— Final merged BAMbam/{sample}_{type}.bam.bai— BAM index
They are also used by multiple downstream modules:
- SNV / Indels — GATK Mutect2, VarDict
- CNV / SV — CNVkit, GATK CNV, Pindel
- QC — Mosdepth, Picard CollectHsMetrics, samtools stats, and others
Configuration
The relevant sections in config.yaml:
bwa_mem:
amb: "/path/to/reference.amb"
ann: "/path/to/reference.ann"
bwt: "/path/to/reference.bwt"
pac: "/path/to/reference.pac"
sa: "/path/to/reference.sa"
container: "docker://hydragenetics/bwa_mem:0.7.17"
picard_mark_duplicates:
container: "docker://hydragenetics/picard:2.25.0"
See the full config.yaml for all available settings.