Running ANIm analysis¶
pyani implements average nucleotide identity analysis using MUMmer3 (ANIm) as defined in Richter & Rosselló-Móra (2009) (doi:10.1073/pnas.0906412106). To run ANIm on a set of input genomes, use the
pyani anim subcommand.
In brief, the analysis proceeds as follows for a set of input prokaryotic genomes:
MUMmer3 is used to perform pairwise comparisons between each possible pair of input genomes, to identify homologous (alignable) regions.
For each comparison, the alignment output is parsed, and the following values are calculated:
- total number of aligned bases on each genome
- fraction of each genome that is aligned (the coverage)
- the proportion of all aligned regions that is identical in each genome (the ANI)
- the number of unaligned or non-identical bases (the similarity errors)
- the product of coverage and ANI
The output values are recorded in the
MUMmer comparison is performed between each pair of genomes. Input genomes are sorted into alphabetical order by filename, and the query sequence is the genome that occurs earliest in the list; the subject sequence is the genome that occurs latest in the list.
MUMmer comparisons are embarrasingly parallel, and can be distributed across cores on an Open Grid Scheduler-compatible cluster, using the
--scheduler SGE option.
pyani anim requires that a working copy of MUMmer3 is available. Please see Installation Guide for information about installing this package.
For more information about the
pyani anim subcommand, please see the pyani anim page, or issue the command
pyani anim -h to see the inline help.
Perform ANIm analysis¶
The basic form of the command is:
pyani anim -i <INPUT_DIRECTORY> -o <OUTPUT_DIRECTORY>
pyani to perform ANIm on the genome FASTA files in
<INPUT_DIRECTORY>, which is passed to the
-i argument, and write any output files to
<OUTPUT_DIRECTORY>, which is passed to the
-o argument. For example, the following command performs ANIm on genomes in the directory
genomes and writes output to a new directory
pyani anim -i genomes -o genomes_ANIm
pyani anim will show progress bars unless these are disabled with the option
This command will write the intermediate
MUMmer output to the directory
genomes_ANIm, in a subdirectory called
nucmer_output, where the results can be inspected if required.
$ ls genomes_ANIm/
Perform ANIm analysis with Open Grid Scheduler¶
MUMmer comparison step of ANIm is embarrassingly parallel, and
nucmer jobs can be distributed across cores in a cluster using the Open Grid Scheduler. To enable this during the analysis, use the
--scheduler SGE option:
pyani anim --scheduler SGE -i genomes -o genomes_ANIm
Jobs are submitted as array jobs to keep the scheduler queue short.
--scheduler SGE is not specified, all
MUMmer jobs are run locally with
Controlling parameters of Open Grid Scheduler¶
It is possible to control the following features of Open Grid Scheduler via the
pyani anim subcommand:
- The array job size (by default, comparison jobs are batched in arrays of 10,000)
- The prefix string for the job, as reported in the scheduler queue
- Arguments to the
qsubjob submission command
These allow for useful control of job execution. For example, the command:
pyani anim --scheduler SGE --SGEgroupsize 5000 -i genomes -o genomes_ANIm
MUMmer jobs in groups of 500 for the scheduler. The command:
pyani anim --scheduler SGE --jobprefix My_Ace_Job -i genomes -o genomes_ANIm
will prepend the string
My_Ace_Job to your job in the scheduler queue. And the command:
pyani anim --scheduler SGE --SGEargs "-m e -M email@example.com" 5000 -i genomes -o genomes_ANIm
firstname.lastname@example.org when the jobs finish.