nCoV-2019 novel coronavirus bioinformatics protocol

Nanopore | bioinformatics

Document: ARTIC-nCoV-bioinformaticsSOP-v1.1.0
Creation Date: 2020-01-23
Author: Nick Loman, Will Rowe, Andrew Rambaut
Licence: Creative Commons Attribution 4.0 International License
Overview: A complete bioinformatics protocol to take the output from the sequencing protocol to consensus genome sequences. Includes basecalling, de-multiplexing, mapping, polishing and consensus generation.

Preparation

Set up the computing environment as described here in this document: ncov2019-it-setup. This should be done and tested prior to sequencing, particularly if this will be done in an environment without internet access or where this is slow or unreliable. Once this is done, the bioinformatics can be performed largely off-line. If you are already using the lab-on-an-SSD, you can skip this step.

Updating the environment

First time only:

git clone https://github.com/artic-network/artic-ncov2019.git
cd artic-ncov2019
conda env remove -n artic-ncov2019
conda env create -f environment.yml

Make a new directory for analysis

Give your analysis directory a meaningful name, e.g.. analysis/run_name

mkdir analysis
cd analysis

mkdir run_name
cd run_name

Activate the ARTIC environment:

All steps in this tutorial should be performed in the artic-ncov2019 conda environment:

source activate artic-ncov2019

Basecalling with Guppy

If you did basecalling with MinKNOW, you can skip this step and go to Demultiplexing.

Run the Guppy basecaller on the new MinION run folder:

For fast mode basecalling:

guppy_basecaller -c dna_r9.4.1_450bps_fast.cfg -i /path/to/reads -s run_name -x auto -r

For high-accuracy mode basecalling:

guppy_basecaller -c dna_r9.4.1_450bps_hac.cfg -i /path/to/reads -s run_name -x auto -r

You need to substitute /path/to/reads to the folder where the FAST5 files from your run are. Common locations are:

  • Mac: /Library/MinKNOW/data/run_name
  • Linux: /var/lib/MinKNOW/data/run_name
  • Windows c:/data/reads

This will create a folder called run_name with the base-called reads in it.

Demultiplexing

For the current version of the ARTIC protocol it is essential to demultiplex using strict parameters to ensure barcodes are present at each end of the fragment.

Starting with this version of the protocol we are now recommending this is done with Guppy:

Guppy is not included with the computing environment and can be downloaded from the nanopore community website (https://community.nanoporetech.com)

guppy_barcoder --require_barcodes_both_ends -i run_name -s output_directory --arrangements_files "barcode_arrs_nb12.cfg barcode_arrs_nb24.cfg"

Read filtering

Because ARTIC protocol can generate chimeric reads, we perform length filtering.

This step is performed for each barcode in the run.

We first collect all the FASTQ files (typically stored in files each containing 4000 reads) into a single file.

To collect and filter the reads for barcode03, we would run:

artic guppyplex --min-length 400 --max-length 700 --directory output_directory/barcode03 --prefix run_name

This will perform a quality check. If you are only using “pass” reads you can speed up the process with:

artic guppyplex --skip-quality-check --min-length 400 --max-length 700 --directory output_directory/barcode03 --prefix run_name

We use a length filter here of between 400 and 700 to remove obviously chimeric reads.

You may need to change these numbers if you are using different length primer schemes. Try the minimum lengths of the amplicons as the minimum, and the maximum length of the amplicons plus 200 as the maximum.

I.e. if your amplicons are 300 base pairs, use –min-length 300 –max-length 500

You will now have a files called: run_name_barcode03.fastq

Run the MinION pipeline

For each barcode you wish to process (e.g. run this command 12 times for 12 barcodes), replacing the file name and sample name as appropriate:

E.g. for barcode03

artic minion --normalise 200 --threads 4 --scheme-directory ~/artic-ncov2019/primer_schemes --read-file run_name_barcode03.fastq --fast5-directory path_to_fast5 --sequencing-summary path_to_sequencing_summary.txt nCoV-2019/V3 samplename

Replace samplename as appropriate.

Output files

  • samplename.rg.primertrimmed.bam - BAM file for visualisation after primer-binding site trimming
  • samplename.trimmed.bam - BAM file with the primers left on (used in variant calling)
  • samplename.merged.vcf - all detected variants in VCF format
  • samplename.pass.vcf - detected variants in VCF format passing quality filter
  • samplename.fail.vcf - detected variants in VCF format failing quality filter
  • samplename.primers.vcf - detected variants falling in primer-binding regions
  • samplename.variants.tab - detected variants in tabular format
  • samplename.consensus.fasta - consensus sequence

To put all the consensus sequences in one filei called my_consensus_genomes.fasta, run

cat *.consensus.fasta > my_consensus_genomes.fasta

To visualise genomes in Tablet

Open a new Terminal window:

conda activate tablet
tablet

Go to “Open Assembly”

Load the BAM (binary alignment file) as the first file.

Load the reference file (in artic/artic-ncov2019/primer_schemes/nCoV-2019/V1/nCoV-2019.reference.fasta) as the second file.

Select Variants mode in Color Schemes for ease of viewing variants.

Experimental Medaka pipeline

An alternative to nanopolish to calling variants is to use medaka. Medaka is faster than nanopolish and seems to perform mostly equivalently in (currently limited) testing.

If you want to use Medaka, you can skip the nanopolish index step, and add the parameter --medaka to the command, as below:

artic minion --medaka --normalise 200 --threads 4 --scheme-directory ~/artic-ncov2019/primer-schemes --read-file run_name_barcode01.fastq nCoV-2019/V1 samplename

Replace samplename as appropriate.

E.g. for barcode02

artic minion --medaka --normalise 200 --threads 4 --scheme-directory ~/artic/artic-ncov2019/primer-schemes --read-file run_name_barcode02.fastq nCoV-2019/V1 samplename