Variant Calling Workflow

Nextflow Variant Calling

Variant Calling Workflow
Variant-Calling BASH script
Variant Calling Nextflow Pipeline
- Handle completion event
- Metrics and reports

This is part 9 of 14 of a Introduction to NextFlow.

First, let’s create a new directory called workflow and navigate to it:

mkdir workflow
cd workflow

Variant Calling Workflow

Our variant calling workflow has the following steps:

Index the reference genome for use by bwa.
Align reads to the reference genome using bwa mem.
Convert the aligned SAM file to BAM format using samtools.
Convert the format of the alignment to sorted BAM, with some intermediate steps.
Calculate the read coverage of positions in the genome.
Detect the single nucleotide variants (SNVs).
Filter and report the SNVs in VCF (variant calling format).

Read more about the variant-calling pipeline here -> Data Carpentry: Wrangling Genomics Lesson

Variant-Calling BASH script

set -e
mkdir /workspace/nextflow_tutorial/bash-results
cd /workspace/nextflow_tutorial/bash-results

genome=/workspace/nextflow_tutorial/data/ref_genome/ecoli_rel606.fasta

bwa index $genome

mkdir -p sam bam bcf vcf

for fq1 in /workspace/nextflow_tutorial/data/trimmed_fastq/*_1.trim.fastq.gz;
do
    echo "working with file $fq1"; base=$(basename $fq1 _1.trim.fastq.gz);  echo "base name is $base" \

    fq1=/workspace/nextflow_tutorial/data/trimmed_fastq/${base}_1.trim.fastq.gz
    fq2=/workspace/nextflow_tutorial/data/trimmed_fastq/${base}_2.trim.fastq.gz
    sam=/workspace/nextflow_tutorial/results/sam/${base}.aligned.sam
    bam=/workspace/nextflow_tutorial/results/bam/${base}.aligned.bam
    sorted_bam=/workspace/nextflow_tutorial/results/bam/${base}.aligned.sorted.bam
    raw_bcf=/workspace/nextflow_tutorial/results/bcf/${base}_raw.bcf
    variants=/workspace/nextflow_tutorial/results/vcf/${base}_variants.vcf
    final_variants=/workspace/nextflow_tutorial/results/vcf/${base}_final_variants.vcf

    bwa mem $genome $fq1 $fq2 > $sam
    samtools view -S -b $sam > $bam
    samtools sort -o $sorted_bam $bam
    samtools index $sorted_bam
    bcftools mpileup -O b -o $raw_bcf -f $genome $sorted_bam
    bcftools call --ploidy 1 -m -v -o $variants $raw_bcf
    vcfutils.pl varFilter $variants > $final_variants

done

Variant Calling Nextflow Pipeline

Now, let’s create a new Nextflow script named variant-calling.nf:

code variant-calling.nf

You can now start implementing your Nextflow pipeline by following the provided workflow steps. Be sure to define the input channels for each process and utilize the Nextflow syntax when writing your pipeline. You can refer to the Nextflow documentation for syntax, process definition, and examples.

In the variant-calling.nf script, you will be converting the BASH commands from the provided script into Nextflow processes. You can use the input and output directives to define input and output files for each process. You can also use Nextflow operators like view, split, and join to manipulate channels. Additionally, you can use thescriptdirective to embed the BASH commands directly into the Nextflow processes. Remember to make use of theparams` feature to make your pipeline more flexible and configurable.

/*
========================================================================================
   Variant-Calling Nextflow Workflow
========================================================================================
   Github   :
   Contact  :
----------------------------------------------------------------------------------------
*/

nextflow.enable.dsl=2

// Pipeline Input parameters

params.outdir = 'results'
params.genome = "/workspace/nextflow_tutorial/data/ref_genome/ecoli_rel606.fasta"
params.reads = "/workspace/nextflow_tutorial/data/trimmed_fastq/*_{1,2}.trim.fastq.gz"

println """\
         V A R I A N T-C A L L I N G - N F   P I P E L I N E
         ===================================
         genome       : ${params.genome}
         reads        : ${params.reads}
         outdir       : ${params.outdir}
         """
         .stripIndent()

/*
========================================================================================
   Create Channels
========================================================================================
*/

ref_ch = Channel.fromPath( params.genome, checkIfExists: true )
reads_ch = Channel.fromFilePairs( params.reads, checkIfExists: true )

/*
========================================================================================
   MAIN Workflow
========================================================================================
*/

workflow {

    FASTQC( reads_ch )
    BWA_INDEX( ref_ch )
    BWA_ALIGN( BWA_INDEX.out.bwa_index.combine(reads_ch) ) // https://www.nextflow.io/docs/latest/process.html#understand-how-multiple-input-channels-work
    SAMTOOLS_SORT( BWA_ALIGN.out.aligned_bam )
    // Enter the rest of the processes for variant calling based on the bash script below

}

/*
========================================================================================
   Processes
========================================================================================
*/

/*
 * Align reads to reference genome & create BAM file.
 */
process FASTQC {
    tag{"FASTQC ${reads}"}
    label 'process_low'

    publishDir("${params.outdir}/fastqc_trim", mode: 'copy')

    input:
    tuple val( sample_id ), path( reads )

    output:
    path( "*_fastqc*" ), emit: fastqc_out

    script:
    """
    fastqc ${reads}
    """
}

/*
 * Index the reference genome for use by bwa and samtools.
 */
process BWA_INDEX {
  tag{"BWA_INDEX ${genome}"}
  label 'process_low'

  publishDir("${params.outdir}/bwa_index", mode: 'copy')

  input:
  path genome

  output:
  tuple path( genome ), path( "*" ), emit: bwa_index

  script:
  """
  bwa index ${genome}
  """
}

/*
 * Align reads to reference genome & create BAM file.
 */
process BWA_ALIGN {
    tag{"BWA_ALIGN ${sample_id}"}
    label 'process_medium'

    publishDir("${params.outdir}/bwa_align", mode: 'copy')

    input:
    tuple path( genome ), path( "*" ), val( sample_id ), path( reads )

    output:
    tuple val( sample_id ), path( "${sample_id}.aligned.bam" ), emit: aligned_bam

    script:
    """
    INDEX=`find -L ./ -name "*.amb" | sed 's/.amb//'`
    bwa mem \$INDEX ${reads} > ${sample_id}.aligned.sam
    samtools view -S -b ${sample_id}.aligned.sam > ${sample_id}.aligned.bam
    """
}

/*
 * Convert the format of the alignment to sorted BAM.
 */
process SAMTOOLS_SORT {
  tag{"SAMTOOLS_SORT ${sample_id}"}
  label 'process_low'

  publishDir("${params.outdir}/bam_align", mode: 'copy')

  input:
  tuple val( sample_id ), path( bam )

  output:
  tuple val( sample_id ), path( "${sample_id}.aligned.sorted.bam" ), emit: sorted_bam

  script:
  """
  samtools sort -o "${sample_id}.aligned.sorted.bam" ${bam}
  """
}

/*
 * Index the BAM file for visualization purpose
 */
process SAMTOOLS_INDEX {

}

/*
 * Calculate the read coverage of positions in the genome.
 */
process BCFTOOLS_MPILEUP {

}

/*
 * Detect the single nucleotide variants (SNVs).
 */
process BCFTOOLS_CALL {

}

/*
 * Filter and report the SNVs in VCF (variant calling format).
 */
process VCFUTILS {

}

/*
========================================================================================
   Workflow Event Handler
========================================================================================
*/

workflow.onComplete {

   println ( workflow.success ? """
       Pipeline execution summary
       ---------------------------
       Completed at: ${workflow.complete}
       Duration    : ${workflow.duration}
       Success     : ${workflow.success}
       workDir     : ${workflow.workDir}
       exit status : ${workflow.exitStatus}
       """ : """
       Failed: ${workflow.errorReport}
       exit status : ${workflow.exitStatus}
       """
   )
}

/*
========================================================================================
   THE END
========================================================================================
*/

For each process, define the required input and output files using the input and output directives. Make sure to declare the necessary executables, such as bwa, samtools, bcftools, and vcfutils.pl, within the script directive as needed.

Throughout the development of your pipeline, consult the Nextflow documentation for specific syntax, operators, and other features that may help in creating an efficient and well-structured pipeline. Once you have completed the variant-calling.nf script, you can execute it using the nextflow run command, providing any necessary parameters and configurations as needed.

In this section, we will discuss how to run a Nextflow workflow and handle completion events, as well as generate metrics and reports. To run the variant-calling.nf script, use the following command:

nextflow run variant-calling.nf

To understand how multiple input channels work in Nextflow, you can refer to the documentation.

Handle completion event

Handling the completion event is essential to know when your pipeline has finished executing. Nextflow provides the workflow.onComplete event handler for this purpose. You can add this code block to your variant-calling.nf script to print a confirmation message when the script completes:

workflow.onComplete {
   println ( workflow.success ? """
       Pipeline execution summary
       ---------------------------
       Completed at: ${workflow.complete}
       Duration    : ${workflow.duration}
       Success     : ${workflow.success}
       workDir     : ${workflow.workDir}
       exit status : ${workflow.exitStatus}
       """ : """
       Failed: ${workflow.errorReport}
       exit status : ${workflow.exitStatus}
       """
   )
}

This code utilizes the ternary operator, which is a concise way of writing an if/else branch:

expression_is_true ? "set_value_to_a" : "set_value_to_b"

Metrics and reports

Nextflow is capable of generating various reports and charts, providing runtime metrics and execution information. You can enable these options when running your script:

-with-report: Creates a workflow execution report.
-with-trace: Generates a tab-separated file containing runtime information for each executed task, such as submission time, start time, completion time, CPU, and memory usage.
-with-timeline: Produces a workflow timeline report, which helps identify time-consuming tasks and bottlenecks.
-with-dag: Renders a directed acyclic graph representation of the workflow execution. This feature requires the installation of Graphviz, an open-source graph visualization software.

To run the workflow with these options, use the following command:

nextflow run variant-calling.nf -resume -with-report -with-trace -with-timeline -with-dag dag.png

Your final Nextflow workflow DAG image should resemble this:

Quick Recap:

Nextflow can combine tasks (processes) and manage data flows using channels within a single pipeline/workflow.
A workflow can be parameterized using params. The values of these parameters can be captured in a log file using log.info.
Workflow steps are connected via their inputs and outputs using Channels.
Nextflow can execute an action when the pipeline completes the execution using the workflow.onComplete event handler to print a confirmation message.
Nextflow can produce multiple reports and charts, providing runtime metrics and execution information using the command-line options -with-report, -with-trace, -with-timeline, and -with-dag.

Variant Calling Workflow

Variant-Calling BASH script

Variant Calling Nextflow Pipeline

Handle completion event

Metrics and reports

Back to:Nextflow Operators Next:NextFlow Configuration