logo Copy Number Variants (CNVs) and Structural Variants (SVs)

This section describes the CNV and SV detection steps in the Poppy pipeline, as well as the generation of the final CNV HTML report. It takes the merged, sorted, and duplicate‑marked BAM files produced by the alignment module, along with variant calls from the SNV/indel module, and produces fully annotated and filtered candidate CNV/SVs and a comprehensive HTML report.

The core rules are provided by the Hydra‑Genetics cnv_sv and reports modules. In addition, Poppy implements several local adaptations (via workflow/rules/pindel_processing.smk and workflow/rules/svdb.smk) to modify formatting, annotations, and merge behaviors specifically for the myeloid workflow.

CNV and SV Workflow


Input Files

Input File Source
alignment/samtools_merge_bam/{sample}_{type}.bam Alignment module
snv_indels/bcbio_variation_recall_ensemble/{sample}_{type}.ensembled.vep_annotated.filter.germline.vcf.gz SNV / Indels module
snv_indels/gatk_mutect2/{sample}_{type}.normalized.sorted.vep_annotated.filter.germline.bcftools_annotated.vcf.gz SNV / Indels module

Workflow Steps

1. CNVkit — Copy Number Segmentation

Calculates copy number segmentation from targeted sequencing read depths using CNVkit.

Item Value
Container hydragenetics/cnvkit:0.9.9
Input alignment/samtools_merge_bam/{sample}_{type}.bam
Input snv_indels/bcbio_variation_recall_ensemble/{sample}_{type}.ensembled.vep_annotated.filter.germline.vcf.gz
Output cnv_sv/cnvkit_batch/{sample}/{sample}_{type}.cns

2. GATK CNV — Copy Number Segmentation

Utilizes GATK tools (CollectReadCounts, DenoiseReadCounts, and ModelSegments) to produce copy number ratio segments.

Item Value
Container hydragenetics/gatk4:4.1.9.0
Input alignment/samtools_merge_bam/{sample}_{type}.bam
Output cnv_sv/gatk_model_segments/{sample}_{type}.cr.seg

3. PureCN — Tumor Purity & Ploidy Estimation

Estimates tumor purity and ploidy, and integrates read depth with B-allele frequencies from the SNV/indel module VCF using PureCN.

Item Value
Container hydragenetics/purecn:2.2.0
Input alignment/samtools_merge_bam/{sample}_{type}.bam
Input snv_indels/gatk_mutect2/{sample}_{type}.normalized.sorted.vep_annotated.filter.germline.bcftools_annotated.vcf.gz
Output Used internally for tumor content reporting and dynamic report rendering.

4. Pindel — Structural Variant Calling

Detects long insertions, deletions, and structural variants for a specific targeted set of myeloid genes using Pindel.

Item Value
Container hydragenetics/pindel:0.2.5b9
Input alignment/samtools_merge_bam/{sample}_{type}.bam
Output cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vcf.gz

5. Pindel Processing (Local Adaptation)

Because Pindel outputs an older version of VCF, it requires specific formatting adjustments for downstream reporting. Local rules (workflow/rules/pindel_processing.smk) are executed to:

  • Fix AF: Adds allele frequency (AF) and depth (DP) to the INFO field.
  • VEP Annotation: Annotates the normalized Pindel VCF using VEP.
  • Artifact Annotation: Tags known artifacts based on a custom reference panel.
  • CSQ Correction: Adds missing CSQ annotation from VEP if necessary.
Item Value
Output cnv_sv/pindel_vcf/{sample}_{type}.no_tc.normalized.vep_annotated.artifact_annotated.vcf.gz

6. SVDB Merge — Caller Combination

The structural variant calls from CNVkit and GATK CNV are aggregated into a single set. Poppy uses a local rule (svdb_merge_wo_priority) to merge VCF files from the different CNV callers without prioritizing one caller over the other.

Item Value
Container hydragenetics/svdb:2.6.0
Output cnv_sv/svdb_merge/{sample}_{type}.{tc_method}.merged.vcf

7. SVDB Query — Database Annotation

The merged VCF is queried and annotated against structural variant databases (SVDB).

Item Value
Container hydragenetics/svdb:2.6.0
Output cnv_sv/svdb_query/{sample}_{type}.{tc_method}.svdb_query.vcf.gz

8. Annotation and Hard Filtering

The SVDB merged outputs are further annotated with relevant CNV gene sets and are hard-filtered based on predefined criteria to isolate high-confidence calls for clinical reporting. The criteria are defined in config_hard_filter_cnv.yaml:

Filter Criterion Description
`copy_number_normal` `INFO:SVTYPE = COPY_NORMAL` Hard filter CNVs with COPY_NORMAL
`artifacts` `(INFO:Normal_AF > 0.15)` Hard filter variants found in more than 15% of the normal samples
`amp_gene` `(!exist[[A-Za-z0-9_,-]+, INFO:Genes])` Only keep variants with gene annotations
Item Value
Output cnv_sv/svdb_query/{sample}_{type}.{tc_method}.svdb_query.annotate_cnv.cnv_genes.filter.cnv_hard_filter.vcf.gz

9. HTML Reporting Module

The filtered CNV/SV results, along with purity and ploidy estimates, are combined by the reporting module to generate a stand-alone CNV HTML report. Results are formatted dynamically depending on the selected tumor content estimation method (e.g., pathology vs. purecn).

Item Value
Output reports/cnv_html_report/{sample}_{type}.pathology.cnv_report.html
Output reports/cnv_html_report/{sample}_{type}.purecn.cnv_report.html

Key Output Files

Output File Description
vcf/{sample}_{type}.pindel.vep_annotated.filter.pindel.vcf.gz Final annotated and filtered Pindel VCF output
cnv/{sample}/{sample}_{type}.pathology.svdb_query.vcf.gz SVDB merged CNV VCF
cnv/{sample}/{sample}_{type}.pathology.cnv_report.html CNV HTML dynamic report (pathology TC)
cnv/{sample}/{sample}_{type}.purecn.cnv_report.html CNV HTML dynamic report (PureCN TC)

Downstream Consumers

The outputs generated by the CNV callers and the HTML report module are the endpoints of the pipeline. They are delivered to clinical geneticists and researchers via the final output repository structure (i.e. to vcf/ and cnv/ folders).


Configuration

The relevant sections in config.yaml governing CNV calling, SVDB merging, Pindel processing, and HTML reports include:

cnvkit_batch:
  container: "docker://hydragenetics/cnvkit:0.9.9"
  normal_reference: "{{REFERENCE_DIRECTORY}}/reference_files/cnvkit.PoN.cnn"
  method: hybrid

gatk_collect_read_counts:
  container: "docker://hydragenetics/gatk4:4.1.9.0"

gatk_denoise_read_counts:
  container: "docker://hydragenetics/gatk4:4.1.9.0"
  normal_reference: "{{REFERENCE_DIRECTORY}}/reference_files/gatk.PoN.hdf5"

purecn:
  container: docker://hydragenetics/purecn:2.2.0
  genome: hg19
  interval_padding: 100
  segmentation_method: internal
  fun_segmentation: PSCBS
  normaldb: "{{REFERENCE_DIRECTORY}}/reference_files/purecn_normal_db.rds"
  intervals: "{{REFERENCE_DIRECTORY}}/reference_files/purecn_targets_intervals.txt"
  mapping_bias_file: "{{REFERENCE_DIRECTORY}}/reference_files/purecn_mapping_bias.rds"
  extra: "--model betabin --post-optimize"

pindel_call:
  container: "docker://hydragenetics/pindel:0.2.5b9"
  extra: "-x 2 -B 60"
  include_bed: "/path/to/twist_shortlist_pindel.bed"

svdb_merge:
  container: docker://hydragenetics/svdb:2.6.0
  tc_method:
    - name: purecn
      cnv_caller:
        - cnvkit
        - gatk
    - name: pathology
      cnv_caller:
        - cnvkit
        - gatk

cnv_html_report:
  cytobands: false
  show_table: true

See the full config.yaml for comprehensive configurations, including references to the filters applied.