MPXV Nanopore sequencing bioinformatics protocol
Nanopore | bioinformatics
Document: | ARTIC-MPXV-bioinformatics-SOP-v1.0.0 |
Creation Date: | 2024-08-20 |
Author: | Sam Wilkinson |
Licence: | Creative Commons Attribution 4.0 International License |
Preparation
This SOP is purely for command line usage, if you are uncomfortable using the command line for bioinformatics data analysis do not worry! We have an SOP for performing the exact same steps using the Epi2me front end available here: epi2me MPXV bioinformatics.
If you are using a windows PC we recommend that you do any bioinformatics in a WSL2 environment, a tutorial for installing WSL2 is available here: Microsoft WSL Install Tutorial, this will be required before moving to the next steps.
Firstly, you will need Conda >= v23.10.0 (or a lower version with mamba installed), this will handle installation of software that the pipeline needs to run. You can find the appropriate installer for miniforge (a conda distribution) by going to this link: miniforge downloads, if you are using WSL2 you should download the Linux version of the installer.
Once you have downloaded the installer you can install it by opening a terminal session, moving to the path where the file was downloaded, then running it, like this:
bash Miniforge3-Linux-x86_64.sh
You will then be asked to confirm the install location of miniforge, it defaults to the current users home directory (~/miniforge3/
) which is fine for the majority of usecases. After, you will be asked to read and agree the end-user license agreement, once you have done so enter yes
to agree.
Miniforge will then install into the previously specified directory, after which you will be asked if you would like to automatically initialise Conda when starting a new terminal session, most users will want to select yes
here.
Once this completes you should have a working Conda install and can move onto the next step.
Creating the environment
First time only, clone the fieldbioinformatics github repository and checkout the 1.4.0-dev branch, once 1.4.0 has been properly released this section will be updated to reflect this:
git clone https://github.com/artic-network/fieldbioinformatics.git
cd fieldbioinformatics
git checkout 1.4.0-dev
conda env create -f environment.yml
This has created a conda environment named artic
containing all the software requirements for the fieldbioinformatics pipeline, you can activate this environment like so:
conda activate artic
Finally, to install the pipeline itself, we use pip install while inside the fieldbioinformatics directory:
pip install .
Make a new directory for analysis
Give your analysis directory a meaningful name, e.g.. analysis/run_name
mkdir analysis
cd analysis
mkdir run_name
cd run_name
Activate the ARTIC environment:
All steps in this tutorial should be performed in the artic
conda environment (if you have previously activated the environment in the previous steps you will not need to do so again):
conda activate artic
Read filtering
Because ARTIC protocol can generate chimeric reads, we perform length filtering.
This step is performed for each barcode in the run.
We first collect all the FASTQ files into a single file.
To collect and filter the reads for barcode03, we would run:
artic guppyplex --min-length 1500 --max-length 3000 --directory output_directory/barcode03 --prefix run_name
This will perform a quality check. If you are only using “pass” reads you can speed up the process with:
artic guppyplex --skip-quality-check --min-length 1500 --max-length 3000 --directory output_directory/barcode03 --prefix run_name
We use a length filter here of between 1500 and 3000 to remove obviously chimeric reads.
You may need to change these numbers if you are using different length primer schemes. Try the minimum lengths of the amplicons as the minimum, and the maximum length of the amplicons plus 200 as the maximum. Also, the rapid barcoding reaction randomly fragments reads so your minimum read threshold should be adjusted accordingly, we have had good results with a limit of around a 1/4 of the amplicon length, e.g. for a 2000bp scheme, set a minimum length threshold of 500bp.
I.e. if your amplicons are 300 base pairs, use –min-length 300 –max-length 500
You will now have a files called: run_name_barcode03.fastq
Run the MinION pipeline
Fieldbioinformatics uses the clair3 variant caller, previously both medaka and clair3 were available but problems with medaka forced our adoption of clair3 as the only workflow. This requires the selection of an appropriate model based upon the flowcell chemistry, sequencing speed, basecaller preset, and version. The pipeline will try to select an appropriate model based upon the basecall_model_version_id
flag in the read file header (the sequencing instrument adds this by default), if this is not present or the pipeline cannot decide on an appropriate model you should provide one using the --model
parameter.
If you install the pipeline via conda by default only r9.4.1 models will be available, the pipeline can automatically fetch the pre-trained r10.4.1 models from the ONT Rerio repository by running the following command:
artic_get_models
By default models are stored in the users conda environment $CONDA_PREFIX/bin/models
however this may be changed to another location if desired in the artic_get_models
and artic minion
commands by using the --model-dir
argument.
The following command will automatically pull primer schemes from the PrimalScheme primerschemes repository based on the --scheme-name
, --scheme-version
, and --scheme-length
arguments, the scheme length arg is optional in most cases since the vast majority of primer schemes are only available in a single amplicon length. If the scheme you specify in this command is available in multiple different lengths you will be prompted to specify which length should be downloaded.
For each barcode you wish to process (e.g. run this command 12 times for 12 barcodes), replacing the file name and sample name as appropriate:
E.g. for barcode03
artic minion --normalise 200 --threads 4 --scheme-directory ~/primer_schemes --scheme-name artic-inrb-mpox --scheme-length 2500 --scheme-version v1.0.0-cladeib --read-file run_name_barcode03.fastq samplename
Custom primer schemes
If you wish to utilise a custom primer scheme not available in the PrimalScheme repository you may instead provide the scheme bedfile and reference fasta directly using the --bed
and --ref
arguments, for example:
artic minion --normalise 200 --threads 4 --bed ~/primer_schemes/some-scheme/some_virus.scheme.bed --ref ~/primer_schemes/some-scheme/some_virus.reference.fasta --read-file run_name_barcode03.fastq samplename
Output files
samplename.rg.primertrimmed.bam
- BAM file for visualisation after primer-binding site trimmingsamplename.trimmed.bam
- BAM file with the primers left on (used in variant calling)samplename.normalised.vcf.gz
- all detected variants in VCF formatsamplename.pass.vcf
- detected variants in VCF format passing quality filtersamplename.fail.vcf
- detected variants in VCF format failing quality filtersamplename.primers.vcf
- detected variants falling in primer-binding regionssamplename.consensus.fasta
- consensus sequencesamplename.amplicon_depths.tsv
- a TSV (tab delimited) file containing mean amplicon depths across the genome
To put all the consensus sequences in one file called my_consensus_genomes.fasta
, run
cat *.consensus.fasta > my_consensus_genomes.fasta
Software credits
The ARTIC pipeline and fieldbioinformatics software include a number of software packages: