Skip main navigation

New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only. T&Cs apply

Find out more

What is a Workflow Management System (WfMS)?

On Workflow Management Systems (WfMS) and their main features; Introduction to Nextflow

Workflow Management Systems (WfMS) are specialized systems designed to compose and execute a series of computational or data manipulation steps, or a workflow

WfMS like Nexflow, Galaxy and Snakemake have been developed especially for managing computational data analysis workflows in diverse fields such as physics, chemistry and bioinformatics. The major advantages of using WfMS are that they simplify the development, execution and monitoring of pipelines as well as allowing these workflows to be shared ensuring that the same results can be obtained from the same data (i.e. they are reproducible) which is a key outcome for any scientific investigation. The key features of WfMS are:

  • Run time management: program execution on the operating system is managed for you. Tasks and data are split (‘parallelised’) to run at the same time speeding up your analyses
  • Software management: containerisation technologies like Docker and Singularity/Apptainer that package software tools and their dependencies are utilised natively by WfMS meaning pipelines can be deployed reliably on different platforms
  • Interoperability: pipelines can be run on different types of computing infrastructure from your own machine to a local cluster or cloud based services such as AWS or Azure
  • Reproducibility: the use of software management and version control means pipelines will produce the same results when re-run on different platforms
  • Resumption: continuous checkpointing allows pipelines to be resumed from the last successfully executed steps. This means you don’t have to restart the whole workflow if one of the steps fails or your computer decides to restart itself.

There are several different WfMS used by bioinformaticians such as Nextflow, Snakemake and Galaxy. This week we’re going to focus on Nextflow as an example of a widely used workflow management systems, with some key WfMS features as its characteristics. Here they are:

Nextflow

Nextflow is a WfMS that combines a runtime environment (a small operating system that provide the functionality necessary for a program to run), software designed to run other software (containerisation with Docker/Singularity) and a programming domain specific language (DSL) that is used to actually write the computational pipeline. One of the primary concepts behind Nextflow is that Linux is the lingua franca (common language) of data science and thus, it follows the Linux philosophy of “small pieces loosely joined”: many simple but powerful command-line and scripting tools can be chained together to enable complex data manipulations. Nextflow goes further and adds the ability to define complex interactions between different programs and parallel computing environments, such as those commonly found in high performance computing clusters. It works using a dataflow programming model where processes (typically these are programs or commands performing a given task) are connected via their inputs or outputs to other processes. Such processes are run as soon as they receive an input. This is illustrated in the diagram below:

diagrams of flows without and with workflow managers

Figure: a) a simple program and b) its dataflow equivalent

In the sample program shown on the left, the tasks would be executed sequentially (this is how a bash script would work) and would take three units of time as each step would only start when the prior step was completed. In a dataflow programming model like the one on the right, the workflow would only take two units of time. This is because the read quantitation and QC steps are not reliant on each other (i.e. they have no dependencies on each other) and can therefore be executed simultaneously in parallel. This is only a simple example. For a much more complicated workflow, like viralrecon which we’ll introduce later, with several processes and dozens or even hundreds of inputs, the time and computational resources saved by parallelising as many of the processes as possible is likely to be in the order of several hours.

Discussion point

Which is the best workflow management system for you?

There are a number of workflow management systems out there. We’ve mentioned Nextflow, Snakemake and Galaxy above, but there’s also Cromwell and Workflow Description Language (WDL). Each can help you create workflows to speed up your bioinformatics analyses but each comes with its own learning curve.

Question: What are the kinds of things you’re looking for in a workflow management tool? Do leave your comments in the discussion area below.

© Wellcome Connecting Science
This article is from the free online

Bioinformatics for Biologists: Analysing and Interpreting Genomics Datasets

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now