Introduction to NextFlow

Nextflow Introduction


This is part 1 of 14 of a Introduction to NextFlow.


Learning Objectives:


Workflows


However, managing programming logic and software becomes challenging as workflows grow larger and more complex.


Workflow Management Systems

Recently, Workflow Management Systems (WfMS), such as Snakemake, Galaxy, and Nextflow, have emerged to specifically manage computational data-analysis workflows.

These Workflow Management Systems offer multiple features that simplify the development, monitoring, execution, and sharing of pipelines:

drawing


Wratten, L., Wilm, A. & Göke, J. Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nat Methods 18, 1161–1168 (2021). https://doi.org/10.1038/s41592-021-01254-9


drawing

Core Features of Nextflow:


Nextflow Basic Concepts

Processes, Channels, and Workflows

Nextflow workflows consist of three primary components: processes, channels, and workflows.

Workflow Execution

Your First Nextflow Script

Navigate to the Nextflow tutorial directory:

cd /workspace/nextflow_tutorial

Now, create a Nextflow script that counts the number of lines in a file. To do this, create the file word_count.nf in the current directory using code word_count.nf or your preferred text editor and copy-paste the following code:

#!/usr/bin/env nextflow

nextflow.enable.dsl=2

/*
========================================================================================
    Workflow parameters are written as params.<parameter>
    and can be initialised using the `=` operator.
========================================================================================
*/

params.input = "data/untrimmed_fastq/SRR2584863_1.fastq.gz"

/*
========================================================================================
    Input data is received through channels
========================================================================================
*/

input_ch = Channel.fromPath(params.input)

/*
========================================================================================
   Main Workflow
========================================================================================
*/

workflow {
    // The script to execute is called by its process name, and input is provided between brackets.

    NUM_LINES(input_ch)

    /* Process output is accessed using the `out` channel.
       The channel operator view() is used to print process output to the terminal. */

    NUM_LINES.out.view()

}

/*
========================================================================================
    A Nextflow process block. Process names are written, by convention, in uppercase.
    This convention is used to enhance workflow readability.
========================================================================================
*/

process NUM_LINES {

    input:
    path read

    output:
    stdout

    script:
    """
    # Print reads
    printf '${read}\t'

    # Unzip file and count number of lines
    gunzip -c ${read} | wc -l
    """
}

This Nextflow script includes:

  1. An optional interpreter directive (“Shebang”) line, specifying the location of the Nextflow interpreter.
  2. nextflow.enable.dsl=2 to enable DSL2 syntax.
  3. A multi-line Nextflow comment, written using C-style block comments, followed by a single-line comment.
  4. A pipeline parameter params.input with a default value, which is the relative path to the location of a compressed fastq file, as a string.
  5. A Nextflow process block named NUM_LINES, which defines what the process does.
  6. An input definition block that assigns the input to the variable read, and declares that it should be interpreted as a file path.
  7. An output definition block that uses the Linux/Unix standard output stream stdout from the script block.
  8. A script block containing the bash commands printf '${read}' and gunzip -c ${read} | wc -l.
  9. A Nextflow channel input_ch used to read in data for the workflow.
  10. An unnamed workflow execution block, which is the default workflow to run.
  11. A call to the process NUM_LINES with the input channel input_ch.
  12. An operation on the process output, using the channel operator .view().

Run a Nextflow Script

To run the script, enter the following command in your terminal:

nextflow run word_count.nf

The output should resemble the text shown below:

N E X T F L O W  ~  version 21.04.3
Launching `word_count.nf` [marvelous_mestorf] - revision: c09ee14ad4
executor >  local (1)
[81/92e3e9] process > NUM_LINES (1) [100%] 1 of 1 ✔
SRR2584863_1.fastq.gz 6213036

Pipeline Parameters

Re-run the Nextflow script by entering the following command in your terminal:

nextflow run word_count.nf --input 'data/untrimmed_fastq/*.fastq.gz'

The string specified on the command line will override the default value of the parameter in the script. The output will look like this:

N E X T F L O W  ~  version 21.04.3
Launching `word_count.nf` [desperate_agnesi] - revision: bf77afb9d7
executor >  local (6)
[60/584db8] process > NUM_LINES (5) [100%] 6 of 6 ✔
SRR2589044_1.fastq.gz 4428360

SRR2589044_2.fastq.gz 4428360

SRR2584863_1.fastq.gz 6213036

SRR2584863_2.fastq.gz 6213036

SRR2584866_2.fastq.gz 11073592

SRR2584866_1.fastq.gz 11073592

The pipeline executes the process 6 times; one process for each file matching the string. Since each process is executed in parallel, there is no guarantee of which output is reported first. When you run this script, you may see the process output in a different order.

work/
├── 13
│   └── 46936e3927b74ea6e5555ce7b7b56a
│       └── SRR2589044_2.fastq.gz -> nextflow_tutorial/data/untrimmed_fastq/SRR2589044_2.fastq.gz
├── 43
│   └── 3819a4bc046dd7dc528de8eae1c6b8
│       └── SRR2584863_1.fastq.gz -> nextflow_tutorial/data/untrimmed_fastq/SRR2584863_1.fastq.gz
├── 5a
│   └── d446a792db2781ccd0c7aaafdff329
│       └── SRR2584866_2.fastq.gz -> nextflow_tutorial/data/untrimmed_fastq/SRR2584866_2.fastq.gz
├── 76
│   └── b1e11f2e706b901e0c57768a2af59f
│       └── SRR2589044_1.fastq.gz -> nextflow_tutorial/data/untrimmed_fastq/SRR2589044_1.fastq.gz
├── 9c
│   └── 1b1ebc2ea11a395e4e9dcf805b2c7d
│       └── SRR2584863_2.fastq.gz -> nextflow_tutorial/data/untrimmed_fastq/SRR2584863_2.fastq.gz
└── ce
    └── b94ef609ee54f4b7ea79dc23eb32bb
        └── SRR2584866_1.fastq.gz -> nextflow_tutorial/data/untrimmed_fastq/SRR2584866_1.fastq.gz

12 directories, 6 files

Type the following on the command line to display an output similar as below

nextflow log
TIMESTAMP               DURATION        RUN NAME                STATUS  REVISION ID     SESSION ID                              COMMAND
2021-11-16 07:17:23     5.9s            irreverent_leakey       OK      bf77afb9d7      17d06cd0-3bb9-4d32-9d75-48bfdf5401a9    nextflow run word_count.nf
2021-11-16 07:23:00     11.1s           desperate_agnesi        OK      bf77afb9d7      41f78242-27d7-462a-a88d-80b6ec9dc5db    nextflow run word_count.nf --input 'data/untrimmed_fastq/*.fastq.gz'

Quick Recap

  • A workflow is a sequence of tasks that process a set of data, and a workflow management system (WfMS) is a computational platform that provides an infrastructure for the set-up, execution and monitoring of workflows.
  • Nextflow scripts comprise of channels for controlling inputs and outputs, and processes for defining workflow tasks.
  • You run a Nextflow script using the nextflow run command.
  • Nextflow stores working files in the work directory.
  • The nextflow log command can be used to see information about executed pipelines.

Back to:Table of Contents Next:NF-Core