User Guides

WEB UI GUIDES

Step1. Selecting the workflow

To get started, click the "New run" button, and then click on the nf-core pipeline of your choice.

Step2. Sequencing data

Sequencing data files (fastq) must be stored on Katana servers in order to be analyzed in SEQ-flow.

Input directory

If the data are already stored on Katana, the path to the directory containing the fastq files should be specified in the dataset directory field.

Data can be automatically downloaded from:

a recent run at the Ramaciotti Centre by specifying a URL.
NCBI/SRA by specifying a project ID number.

If data is to be automatically downloaded, a path to a download directory should be specified.

Sample sheet

All pipelines require a sample sheet, as a .csv file, with the following columns common to all pipelines:

sample, fastq_1, fastq_2 corresponding to: sample name, absolute path to read 1 , absolute path to read 2.

In addition, each pipeline may require specific columns, eg.: strandedness for RNA-seq; antibody and input for ChIP-seq etc.

There are several ways to build the sample sheet:

•Generate it yourself, and specify the .csv file in the pipeline parameters.

•Turn the “Generate Sample Sheet” button on, to use the automatic sample sheet generator.

You can view and edit the content of the automatically generated sample sheet. You can also download it and edit it locally, and then upload it.

The automatic samplesheet generator searches your input directory for .fastq files. It assigns read 1 and read 2 files based on common suffixes (eg. R1.fastq.gz; R2.fastq.gz). The sample name is assigned as the text common to read1 and read2, up to a separator (“_”). If your files use uncommon separators and R1/R2 suffixes, you can input those in the corresponding field and click the Regenerate button.

Note that fastq files with the same Read suffix and sample name are merged (eg. Sequencing in multiple lanes).

Sample_L001_R1.fastq.gz and Sample_L002_R1.fastq.gz will be merged.

Sample_L001_R2.fastq.gz and Sample_L002_R2.fastq.gz will be merged.

Step3. Pipeline parameters

Each pipeline will have its specific parameters. Here we discuss those that are common to most pipelines.

Launch directory

This is the directory on Katana where the pipeline will output the results, and will save intermediate files (./work subdirectory) and log files. Make sure to read the section on storage space under "Getting Started", to understand how much storage space you will likely need for your launch directory.

Sample sheet parameter

The "Input" field will be auto-filled if you used the automatic sample sheet generator. If would like to upload your own, provide the path to the csv file in the "Input" field.

Outdir

This is the name of the results sub-directory that will be created in the Launch directory and it is a good idea to leave it as default (./results).

Genome sequence and genome annotation files

There are two options for providing the genome sequence and genome annotation files:

Providing a iGenomes reference genome in the Genome field https://sapac.support.illumina.com/sequencing/sequencing_software/igenome.html

For example, specifying GRCh38 will use the hg38 human genome guild with NCBI annotation

Note that the Ensembl annotation is only available in iGenomes with GRCh37

Alternatively, you can specify a Fasta genome sequence file and a Gtf or Gff genome annotation file.

These can be paths on Katana where you have previously downloaded the sequence and annotation files.

You can also provide urls where the genome and annotation files can be downloaded from.

Eg:

Fasta: https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

Gtf: https://ftp.ensembl.org/pub/current_gtf/homo_sapiens/Homo_sapiens.GRCh38.110.gtf.gz

However, be aware that these can be large files that you should not download and store repeatedly.

Step4. Saving the results

After completing the run, you should:

back-up the input fastq files (UNSW data archive)
move the ./results directory to your scratch space if you used the seqflow directory.
delete the ./work directory

Frequent errors and troubleshooting

1. Out of space

e.g. Failed to write..., File does not exist

If a cryptic filesystem error occurs, always check how much space is available in the output folder first - failing to write a file or read from an expected file can imply this.

2. Out of memory

e.g. java.lang.OutOfMemoryError: Java heap space

Subtasks will be retried if they run out of memory, but the main Nextflow job will fail - this should not happen with bulk RNA-Seq, since it's usually when docker containers need to be converted to SIF.

3. Missing parameter

Genome fasta file not specified with e.g. '--fasta genome.fa' or via a detectable config file.

Happens more than it should because "fasta" isn't a required field. For interfaces, the genome/fasta/gtf/gff fields are under "Reference genome options", refer to the Choosing Parameters.

commandline interface guides

Choosing an Interface

All interfaces are based on Nextflow. Nextflow is a workflow language and executor for reproducible, containerized bioinformatics. The following serves as a quick, comparative reference for different ways you can run the same workflow, including ones you write yourself.

Running SEQ-flow CLI

The following instructions can be applied for community workflows or your own:

1.Connect via ssh

ssh <zid>@kdm.restech.unsw.edu.au

2.Create and enter a new project folder

mkdir -p /srv/scratch/seqflow/$USER/myproject && cd $_

git clone https://github.com/WalshKieran/katana-rnaseq-start.git .

3.(Optional) Download your data from Ramaciotti and create samplesheet:

wget -qO- https://mydata.ramaciotti...MYDATA1234.tar | tar xvz -C ./mydata1234

wget https://raw.githubusercontent.com/nf-core/rnaseq/master/bin/fastq_dir_to_samplesheet.py

python3 fastq_dir_to_samplesheet.py --recursive ./mydata1234 ./samplesheet.csv

4.Launch and monitor

qsub run.pbs

qstat -u $USER

tail .nextflow.log

5. (Optional) Stop your job

qsig <ID returned from qsub>

SEQ-flow CLI Optimisation

Below is an illustration of how to run nf-core/rnaseq without previous similar runs (e.g. similar or greater read depth). This is not a substitute for reading the nf-optimizer documentation/drawbacks carefully.

1.Limit the samples in your samplesheet, or by other means

head -n 5 samplesheet.csv > samplesheet_4.csv

2.Run Nextflow on limited samples

export NXF_ENABLE_CACHE_INVALIDATION_ON_TASK_DIRECTIVE_CHANGE=false

nextflow run ... --input samplesheet_4.csv

3.Generate resources.config (limited to ~120GB, 12 hours)

nf-optimizer -m 500 120000 -t 300 43200 -o resources.config .

3.Run Nextflow on all samples

export NXF_ENABLE_CACHE_INVALIDATION_ON_TASK_DIRECTIVE_CHANGE=false

nextflow run ... --input samplesheet.csv -c resources.config -resume

pipeline-specific guides

nf-core RNA-seq

See below a video demonstrating how to run the nf-core RNA-seq pipeline on SEQ-flow, and a chart explaining the results folder.

Page updated

Google Sites

Report abuse