Step1. Selecting the workflow
To get started, click the "New run" button, and then click on the nf-core pipeline of your choice.
Step2. Sequencing data
Sequencing data files (fastq) must be stored on Katana servers in order to be analyzed in SEQ-flow.
Input directory
If the data are already stored on Katana, the path to the directory containing the fastq files should be specified in the dataset directory field.
Data can be automatically downloaded from:
a recent run at the Ramaciotti Centre by specifying a URL.
NCBI/SRA by specifying a project ID number.
If data is to be automatically downloaded, a path to a download directory should be specified.
Sample sheet
All pipelines require a sample sheet, as a .csv file, with the following columns common to all pipelines:
sample, fastq_1, fastq_2 corresponding to: sample name, absolute path to read 1 , absolute path to read 2.
In addition, each pipeline may require specific columns, eg.: strandedness for RNA-seq; antibody and input for ChIP-seq etc.
There are several ways to build the sample sheet:
•Generate it yourself, and specify the .csv file in the pipeline parameters.
•Turn the “Generate Sample Sheet” button on, to use the automatic sample sheet generator.
You can view and edit the content of the automatically generated sample sheet. You can also download it and edit it locally, and then upload it.
The automatic samplesheet generator searches your input directory for .fastq files. It assigns read 1 and read 2 files based on common suffixes (eg. R1.fastq.gz; R2.fastq.gz). The sample name is assigned as the text common to read1 and read2, up to a separator (“_”). If your files use uncommon separators and R1/R2 suffixes, you can input those in the corresponding field and click the Regenerate button.
Note that fastq files with the same Read suffix and sample name are merged (eg. Sequencing in multiple lanes).
Sample_L001_R1.fastq.gz and Sample_L002_R1.fastq.gz will be merged.
Sample_L001_R2.fastq.gz and Sample_L002_R2.fastq.gz will be merged.
Step3. Pipeline parameters
Each pipeline will have its specific parameters. Here we discuss those that are common to most pipelines.
Launch directory
This is the directory on Katana where the pipeline will output the results, and will save intermediate files (./work subdirectory) and log files. Make sure to read the section on storage space under "Getting Started", to understand how much storage space you will likely need for your launch directory.
Sample sheet parameter
The "Input" field will be auto-filled if you used the automatic sample sheet generator. If would like to upload your own, provide the path to the csv file in the "Input" field.
Outdir
This is the name of the results sub-directory that will be created in the Launch directory and it is a good idea to leave it as default (./results).
Genome sequence and genome annotation files
There are two options for providing the genome sequence and genome annotation files:
Providing a iGenomes reference genome in the Genome field https://sapac.support.illumina.com/sequencing/sequencing_software/igenome.html
For example, specifying GRCh38 will use the hg38 human genome guild with NCBI annotation
Note that the Ensembl annotation is only available in iGenomes with GRCh37
Alternatively, you can specify a Fasta genome sequence file and a Gtf or Gff genome annotation file.
These can be paths on Katana where you have previously downloaded the sequence and annotation files.
You can also provide urls where the genome and annotation files can be downloaded from.
Eg:
Fasta: https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
Gtf: https://ftp.ensembl.org/pub/current_gtf/homo_sapiens/Homo_sapiens.GRCh38.110.gtf.gz
However, be aware that these can be large files that you should not download and store repeatedly.
Step4. Saving the results
After completing the run, you should:
back-up the input fastq files (UNSW data archive)
move the ./results directory to your scratch space if you used the seqflow directory.
delete the ./work directory
Frequent errors and troubleshooting
1. Out of space
e.g. Failed to write..., File does not exist
If a cryptic filesystem error occurs, always check how much space is available in the output folder first - failing to write a file or read from an expected file can imply this.
2. Out of memory
e.g. java.lang.OutOfMemoryError: Java heap space
Subtasks will be retried if they run out of memory, but the main Nextflow job will fail - this should not happen with bulk RNA-Seq, since it's usually when docker containers need to be converted to SIF.
3. Missing parameter
Genome fasta file not specified with e.g. '--fasta genome.fa' or via a detectable config file.
Happens more than it should because "fasta" isn't a required field. For interfaces, the genome/fasta/gtf/gff fields are under "Reference genome options", refer to the Choosing Parameters.
Choosing an Interface
All interfaces are based on Nextflow. Nextflow is a workflow language and executor for reproducible, containerized bioinformatics. The following serves as a quick, comparative reference for different ways you can run the same workflow, including ones you write yourself.
Running SEQ-flow CLI
The following instructions can be applied for community workflows or your own:
1.Connect via ssh
ssh <zid>@kdm.restech.unsw.edu.au
2.Create and enter a new project folder
mkdir -p /srv/scratch/seqflow/$USER/myproject && cd $_
git clone https://github.com/WalshKieran/katana-rnaseq-start.git .
3.(Optional) Download your data from Ramaciotti and create samplesheet:
wget -qO- https://mydata.ramaciotti...MYDATA1234.tar | tar xvz -C ./mydata1234
wget https://raw.githubusercontent.com/nf-core/rnaseq/master/bin/fastq_dir_to_samplesheet.py
python3 fastq_dir_to_samplesheet.py --recursive ./mydata1234 ./samplesheet.csv
4.Launch and monitor
qsub run.pbs
qstat -u $USER
tail .nextflow.log
5. (Optional) Stop your job
qsig <ID returned from qsub>
SEQ-flow CLI Optimisation
Below is an illustration of how to run nf-core/rnaseq without previous similar runs (e.g. similar or greater read depth). This is not a substitute for reading the nf-optimizer documentation/drawbacks carefully.
1.Limit the samples in your samplesheet, or by other means
head -n 5 samplesheet.csv > samplesheet_4.csv
2.Run Nextflow on limited samples
export NXF_ENABLE_CACHE_INVALIDATION_ON_TASK_DIRECTIVE_CHANGE=false
nextflow run ... --input samplesheet_4.csv
3.Generate resources.config (limited to ~120GB, 12 hours)
nf-optimizer -m 500 120000 -t 300 43200 -o resources.config .
3.Run Nextflow on all samples
export NXF_ENABLE_CACHE_INVALIDATION_ON_TASK_DIRECTIVE_CHANGE=false
nextflow run ... --input samplesheet.csv -c resources.config -resume
nf-core RNA-seq
See below a video demonstrating how to run the nf-core RNA-seq pipeline on SEQ-flow, and a chart explaining the results folder.