logo Pre‑alignment

This section describes the pre‑alignment steps in the Poppy pipeline. Starting from the raw FASTQ files listed in units.tsv, the module performs adapter trimming and quality filtering with fastp and then merges FASTQ files when a sample has been sequenced across multiple flowcells or lanes. The merged FASTQs are then passed to the alignment module.

All pre‑alignment rules are provided by the Hydra‑Genetics prealignment module (v1.2.0).

Pre-alignment Workflow


Input Files

The raw input FASTQ paths are defined in units.tsv. This can be generated by using the hydra genetics command hydra-genetics create-input-files (see Poppy User Guide). Each row represents one sequencing unit (a unique combination of sample, type, flowcell, lane, and barcode):

Column Description
sample Sample identifier (must match an entry in samples.tsv)
type Unit type — T (tumour), or N (normal)
flowcell Flowcell identifier
lane Lane identifier (e.g. L001)
barcode Index barcode sequence(s)
fastq1 Absolute path to the forward‑read FASTQ file (R1)
fastq2 Absolute path to the reverse‑read FASTQ file (R2)

Workflow Steps

1. Fastp — Adapter Trimming & Quality Filtering

Each FASTQ pair (per flowcell / lane / barcode) is processed by fastp for adapter removal and quality filtering. Adapter sequences are automatically detected from the units.tsv barcode column.

Item Value
Container hydragenetics/fastp:0.20.1
Input Raw FASTQ files from units.tsv (fastq1, fastq2)
Output prealignment/fastp_pe/{sample}_{type}_{flowcell}_{lane}_{barcode}_fastq1.fastq.gz
prealignment/fastp_pe/{sample}_{type}_{flowcell}_{lane}_{barcode}_fastq2.fastq.gz
QC report prealignment/fastp_pe/{sample}_{type}_{flowcell}_{lane}_{barcode}_fastp.json

Fastp produces both HTML and JSON QC reports. The JSON report is consumed by MultiQC in the QC module.

2. Merged — Merge Multi‑Lane FASTQs

When a sample has been sequenced over multiple flowcells or lanes, the trimmed FASTQ files are concatenated (cat) into a single pair of merged FASTQ files per sample. If a sample was only sequenced on a single lane, this step simply passes through the trimmed file.

The trimmer_software setting in config.yaml controls which trimmer output is used as input for the merge step. In Poppy this is set to fastp_pe.

Item Value
Input All fastp_pe output files for the same sample and type
Output prealignment/merged/{sample}_{type}_fastq1.fastq.gz
prealignment/merged/{sample}_{type}_fastq2.fastq.gz

Key Output Files

Output File Description
prealignment/merged/{sample}_{type}_fastq1.fastq.gz Merged, trimmed forward reads (R1)
prealignment/merged/{sample}_{type}_fastq2.fastq.gz Merged, trimmed reverse reads (R2)
prealignment/fastp_pe/{sample}_{type}_*_fastp.json Fastp QC report (used by MultiQC)

Downstream Consumer

The merged FASTQ files are consumed by the alignment module as input to bwa_mem.


Configuration

The relevant sections in config.yaml:

trimmer_software: "fastp_pe"

fastp_pe:
  container: "docker://hydragenetics/fastp:0.20.1"

The trimmer_software setting tells the prealignment module to use fastp_pe output as input for the merge step. See the full config.yaml for all available settings.