MPXV alignment and phylogenetics pipeline using Epi2Me

Squirrel | bioinformatics

Document: ARTIC-MPXV-phylogeneticsSOP-v1.0
Creation Date: 2024-08-21
Author: Áine O'Toole
Licence: Creative Commons Attribution 4.0 International License
Overview: A complete protocol to take the output consensus genome sequences from the sequencing protocol to a robust, interpretable phylogenetic tree with APOBEC3-editing reconstruction. Includes background dataset, alignment, maximum likelihood phylogenetics and ancestral reconstruction with IQTREE, figure generation and interpretation.

Rationale

MPXV is a large poxvirus, with a complex dsDNA genome ~200kb in length. Alignment of sequences, and therefore phylogenetics, is challenging using classic due to tracts of low-complexity and repetitive regions. Squirrel provides an efficient map-to-reference alignment pipeline with masking of problematic regions of the genome.

Command line

Squirrel can be used as a command line tool, with full command-line documentation available on the squirrel GitHub repository at (github.com/aineniamh/squirrel)[https://github.com/aineniamh/squirrel].

User interface

Squirrel can also be run through the EPI2ME user interface. Please first install the EPI2ME desktop application using the provided link. You can then go to ‘available workflows’ then ‘Import workflow’ from https://github.com/artic-network/squirrel-nf as shown below:

link

Once the workflow has successfully downloaded, you can click the X to exit to download window, and select it from the list of available workflows. Next select Run this workflow from the available options, and then Run on your computer:

launch run

This will bring up a menu where you can provide the inputs for your analysis. The only required file is a single FASTA file containing all the sequences and outgroups for your analysis and you must also select the clade (i or ii) from the drop down list:

fasta clade

Running with just a FASTA file will generate an alignment of the input sequences. We recommend selecting the check box for Seq QC to check this alignment for problematic sites.

seqqc

Scrolling down the menu, select the box to Run Phylo. At this point you have 2 options. EITHER you can select the check box to Include Background, in which case a default panel of clade-specific outgroups sequences will be used.

includebackground

OR you can specify a number of outgroups IDs. These outgroups must be present in the FASTA file you provided and will be pruned out of the final alignment. For Clade I we recommend outgroups KJ642617,KJ642615,KJ642616 and for Clade IIb we recommend KJ642617,KJ642615. If you also selected the Include Background option your specified outgroups will be ignored.

outgroups

Optionally you can provide a different reference sequence, but this is usually unnecessary - a clade specific reference will be used by default. No Advanced Options or Nextflow Configuration options are required by default.

Click Launch:

launch2

This will start the workflow. A progress bar is displayed with the run status but you will not be able to see the stdout that is generated on the command line.

progress

Once the run is completed, a number of files will be available and you can double-click to view them:

complete

This includes a suggested_mask.csv file generated by the run with potentially problematic sites. If you start a new run with the same inputs and additionally provide this mask file in the menu, it will improve the alignment and phylogeny.

This document is part of the MPXV sequencing protocol package:
http://artic.network/mpxv/
Setting up the laptop computing environment using Conda:
http://artic.network/ebov/ebov-it-setup.html
Document developed by ARTIC network
Funded by the Wellcome Trust
Collaborators Award 206298/Z/17/Z