Version: Summer 21

nf-core

Workflow managers

The Makefile has been getting a little scary. It's great for one off commands for a project, but not so much for full blown data pipelines. There are plenty of more modern alternatives.

What's nextflow?

Nextflow is an incredibly powerful and flexible workflow language. It's mainly used for bioinformatics analysis.

rnatoy.nf
#!/usr/bin/env nextflow

/*
 * Defines some parameters in order to specify the refence genomes
 * and read pairs by using the command line options
 */
params.reads = "$baseDir/data/ggal/*_{1,2}.fq"
params.annot = "$baseDir/data/ggal/ggal_1_48850000_49020000.bed.gff"
params.genome = "$baseDir/data/ggal/ggal_1_48850000_49020000.Ggal71.500bpflank.fa"
params.outdir = 'results'

/*
 * Create the `read_pairs_ch` channel that emits tuples containing three elements:
 * the pair ID, the first read-pair file and the second read-pair file
 */
Channel
    .fromFilePairs( params.reads )
    .ifEmpty { error "Cannot find any reads matching: ${params.reads}" }
    .set { read_pairs_ch }

/*
 * Step 1. Builds the genome index required by the mapping process
 */
process buildIndex {
    tag "$genome.baseName"

    input:
    path genome from params.genome

    output:
    path 'genome.index*' into index_ch

    """
    bowtie2-build --threads ${task.cpus} ${genome} genome.index
    """
}

/*
 * Step 2. Maps each read-pair by using Tophat2 mapper tool
 */
process mapping {
    tag "$pair_id"

    input:
    path genome from params.genome
    path annot from params.annot
    path index from index_ch
    tuple val(pair_id), path(reads) from read_pairs_ch

    output:
    set pair_id, "accepted_hits.bam" into bam_ch

    """
    tophat2 -p ${task.cpus} --GTF $annot genome.index $reads
    mv tophat_out/accepted_hits.bam .
    """
}

/*
 * Step 3. Assembles the transcript by using the "cufflinks" tool
 */
process makeTranscript {
    tag "$pair_id"
    publishDir params.outdir, mode: 'copy'

    input:
    path annot from params.annot
    tuple val(pair_id), path(bam_file) from bam_ch

    output:
    tuple val(pair_id), path('transcript_*.gtf')

    """
    cufflinks --no-update-check -q -p $task.cpus -G $annot $bam_file
    mv transcripts.gtf transcript_${pair_id}.gtf
    """
}

DSL2 makes it a lot easier to follow

/*
 * Default pipeline parameters. They can be overriden on the command line eg.
 * given `params.foo` specify on the run command line `--foo some_value`.
 */

params.reads = "$baseDir/data/ggal/*_{1,2}.fq"
params.transcriptome = "$baseDir/data/ggal/ggal_1_48850000_49020000.Ggal71.500bpflank.fa"
params.outdir = "results"
params.multiqc = "$baseDir/multiqc"

log.info """\
 R N A S E Q - N F   P I P E L I N E
 ===================================
 transcriptome: ${params.transcriptome}
 reads        : ${params.reads}
 outdir       : ${params.outdir}
 """

// import modules
include { RNASEQ } from './modules/rnaseq'
include { MULTIQC } from './modules/multiqc'

/*
 * main script flow
 */
workflow {
  read_pairs_ch = channel.fromFilePairs( params.reads, checkIfExists: true )
  RNASEQ( params.transcriptome, read_pairs_ch )
  MULTIQC( RNASEQ.out, params.multiqc )
}

/*
 * completion handler
 */
workflow.onComplete {
    log.info ( workflow.success ? "\nDone! Open the following report in your browser --> $params.outdir/multiqc_report.html\n" : "Oops .. something went wrong" )
}

The thing that sets Nextflow apart is that it pushes the data through the pipeline, rather than pulling it through like make.

We're not going to cover how to write Nextflow scripts, but if you'd like to use them for your project you're welcome to learn.

nf-core Intro

A community effort to collect a curated set of analysis pipelines built using Nextflow.

We have the genomics core, imaging core, etc. core facilities, and nf-core!

Enough talk, let's run it!

Testing a pipeline

nf-core installation docs

Make a new directory in your scratch called rnaseq and open it up.
Install Nextflow

curl -fsSL get.nextflow.io | bash

Activate singularity

ml load singularity

nextflow run nf-core/rnaseq -profile test,utd_sysbio

Finding some data

There are lots of ways to shop for data.

SRA Explorer

Search "covid rnaseq"
Click the check box next to the runs you want
Scroll up and click "Add datasets to collection"
Open up your cart
Open up the bash script for downloading the fastq files
Move the bash script to sysbio
Run the script to download the data

nf-core/fetchngs

Click on the SRA accession number
Click on the run
Copy the Bioproject
Open up run Selector and paste in the bioproject accession number
Click on "Accession List" to download the list

SRR14607635
SRR14607636
SRR14607637
SRR14607638
SRR14607639
SRR14607640
SRR14607641
SRR14607642

Move the file over to sysbio

scp Downloads/SRR_Acc_List.txt sysbio:/scratch/applied-genomics/cov-t-rnaseq

Use the nf-core/fetchngs pipeline

nextflow run nf-core/fetchngs --input SRR_Acc_List.txt -profile utd_sysbio

SRA Run Selector

Running the nf-core pipeline

Refer to the usage section of the pipeline's docs

Using the nf-core launcher

Open up the nf-core launch utility
Select the rnaseq pipeline and click Launch
Fill out the following command-line flags:
- profile: utd_sysbio
- input: samplesheet.csv
- email: <netid>@utdallas.edu
- genome: GRCh37
Create a file with the nf-params.json file it generates.

nf-params.json
{
  "input": "samplesheet.csv",
  "email": "<netid>@utdallas.edu",
  "genome": "GRCh37"
}

We're going to need to create a samplesheet. Please refer to the usage section of the pipeline's docs

The data has been predownloaded for you to the group scratch directory /scratch/applied-genomics/ under cov-t-rnaseq

samplesheet.csv
sample,fastq_1,fastq_2,strandedness
patient1_plus_rep1,/scratch/applied-genomics/cov-t-rnaseq/SRR14607635_GSM5328143_PS_CD8_T_cells_patient1_Homo_sapiens_RNA-Seq.fastq.gz,,forward
patient2_plus_rep1,/scratch/applied-genomics/cov-t-rnaseq/SRR14607636_GSM5328144_PS_CD8_T_cells_patient2_Homo_sapiens_RNA-Seq.fastq.gz,,forward
patient3_plus_rep1,/scratch/applied-genomics/cov-t-rnaseq/SRR14607637_GSM5328145_PS_CD8_T_cells_patient3_Homo_sapiens_RNA-Seq.fastq.gz,,forward
patient4_plus_rep1,/scratch/applied-genomics/cov-t-rnaseq/SRR14607638_GSM5328146_PS_CD8_T_cells_patient4_Homo_sapiens_RNA-Seq.fastq.gz,,forward
patient1_minus_rep1,/scratch/applied-genomics/cov-t-rnaseq/SRR14607639_GSM5328147_PS-_CD8_T_cells_patient1_Homo_sapiens_RNA-Seq.fastq.gz,,forward
patient2_minus_rep1,/scratch/applied-genomics/cov-t-rnaseq/SRR14607640_GSM5328148_PS-_CD8_T_cells_patient2_Homo_sapiens_RNA-Seq.fastq.gz,,forward
patient3_minus_rep1,/scratch/applied-genomics/cov-t-rnaseq/SRR14607641_GSM5328149_PS-_CD8_T_cells_patient3_Homo_sapiens_RNA-Seq.fastq.gz,,forward
patient4_minus_rep1,/scratch/applied-genomics/cov-t-rnaseq/SRR14607642_GSM5328150_PS-_CD8_T_cells_patient4_Homo_sapiens_RNA-Seq.fastq.gz,,forward

tip

If you can't get the formatting right for whatever reason there's a backup samplesheet at /scratch/applied-genomics/cov-t-rnaseq/samplesheet.csv just need to update the input path

nf-params.json
{
  "input": "samplesheet.csv",
  "email": "<netid>@utdallas.edu",
  "genome": "GRCh37"
}

Start screen which is a screen manager

login$ screen

info

Useful screen commands

# Start a new screen session:
screen

# Start a new named screen session:
screen -S session_name

# Reattach to an open screen:
screen -r session_name

# Detach from inside a screen:
    Ctrl + A, D

# Kill the current screen session:
    Ctrl + A, K

Launch the pipeline

nextflow run nf-core/rnaseq -r 3.2 -params-file nf

The pipeline should start up, and email you when it's finished!

Download the Multiqc Report

Open up the file explorer and navigate to results/multiqc/star_salmon/multiqc_report.html and right-click the html file and select Download.
Now that the multiqc report is on your local computer open it up in a web browser. Preferably next to the pipeline's output docs.
Some files of note: results/salmon/*.tsv: Various gene and transcript counts results/star_salmon/*.bam: Aligned bam files

Workflow managers​

What's nextflow?​

DSL2 makes it a lot easier to follow​

nf-core Intro​

Testing a pipeline​

Finding some data​

SRA Explorer​

nf-core/fetchngs​

Running the nf-core pipeline​

Using the nf-core launcher​

tip

info

Download the Multiqc Report​

Workflow managers

What's nextflow?

DSL2 makes it a lot easier to follow

nf-core Intro

Testing a pipeline

Finding some data

SRA Explorer

nf-core/fetchngs

Running the nf-core pipeline

Using the nf-core launcher

Download the Multiqc Report