Pre‑alignment
This section describes the pre‑alignment steps in the Poppy pipeline. Starting from the raw FASTQ files listed in units.tsv, the module performs adapter trimming and quality filtering with fastp and then merges FASTQ files when a sample has been sequenced across multiple flowcells or lanes. The merged FASTQs are then passed to the alignment module.
All pre‑alignment rules are provided by the Hydra‑Genetics prealignment module (v1.2.0).
Input Files
The raw input FASTQ paths are defined in units.tsv. This can be generated by using the hydra genetics command hydra-genetics create-input-files (see Poppy User Guide). Each row represents one sequencing unit (a unique combination of sample, type, flowcell, lane, and barcode):
| Column | Description |
|---|---|
| sample | Sample identifier (must match an entry in samples.tsv) |
| type | Unit type — T (tumour), or N (normal) |
| flowcell | Flowcell identifier |
| lane | Lane identifier (e.g. L001) |
| barcode | Index barcode sequence(s) |
| fastq1 | Absolute path to the forward‑read FASTQ file (R1) |
| fastq2 | Absolute path to the reverse‑read FASTQ file (R2) |
Workflow Steps
1. Fastp — Adapter Trimming & Quality Filtering
Each FASTQ pair (per flowcell / lane / barcode) is processed by fastp for adapter removal and quality filtering. Adapter sequences are automatically detected from the units.tsv barcode column.
| Item | Value |
|---|---|
| Container | hydragenetics/fastp:0.20.1 |
| Input | Raw FASTQ files from units.tsv (fastq1, fastq2) |
| Output | prealignment/fastp_pe/{sample}_{type}_{flowcell}_{lane}_{barcode}_fastq1.fastq.gz |
prealignment/fastp_pe/{sample}_{type}_{flowcell}_{lane}_{barcode}_fastq2.fastq.gz |
|
| QC report | prealignment/fastp_pe/{sample}_{type}_{flowcell}_{lane}_{barcode}_fastp.json |
Fastp produces both HTML and JSON QC reports. The JSON report is consumed by MultiQC in the QC module.
2. Merged — Merge Multi‑Lane FASTQs
When a sample has been sequenced over multiple flowcells or lanes, the trimmed FASTQ files are concatenated (cat) into a single pair of merged FASTQ files per sample. If a sample was only sequenced on a single lane, this step simply passes through the trimmed file.
The trimmer_software setting in config.yaml controls which trimmer output is used as input for the merge step. In Poppy this is set to fastp_pe.
| Item | Value |
|---|---|
| Input | All fastp_pe output files for the same sample and type |
| Output | prealignment/merged/{sample}_{type}_fastq1.fastq.gz |
prealignment/merged/{sample}_{type}_fastq2.fastq.gz |
Key Output Files
| Output File | Description |
|---|---|
prealignment/merged/{sample}_{type}_fastq1.fastq.gz |
Merged, trimmed forward reads (R1) |
prealignment/merged/{sample}_{type}_fastq2.fastq.gz |
Merged, trimmed reverse reads (R2) |
prealignment/fastp_pe/{sample}_{type}_*_fastp.json |
Fastp QC report (used by MultiQC) |
Downstream Consumer
The merged FASTQ files are consumed by the alignment module as input to bwa_mem.
Configuration
The relevant sections in config.yaml:
trimmer_software: "fastp_pe"
fastp_pe:
container: "docker://hydragenetics/fastp:0.20.1"
The trimmer_software setting tells the prealignment module to use fastp_pe output as input for the merge step. See the full config.yaml for all available settings.