Running ANIm analysis¶

pyani implements average nucleotide identity analysis using MUMmer3 (ANIm) as defined in Richter & Rosselló-Móra (2009) (doi:10.1073/pnas.0906412106). To run ANIm on a set of input genomes, use the pyani anim subcommand.

In brief, the analysis proceeds as follows for a set of input prokaryotic genomes:

MUMmer3 is used to perform pairwise comparisons between each possible pair of input genomes, to identify homologous (alignable) regions.
For each comparison, the alignment output is parsed, and the following values are calculated:
- total number of aligned bases on each genome
- fraction of each genome that is aligned (the coverage)
- the proportion of all aligned regions that is identical in each genome (the ANI)
- the number of unaligned or non-identical bases (the similarity errors)
- the product of coverage and ANI

The output values are recorded in the pyani database.

Note

A single MUMmer comparison is performed between each pair of genomes. Input genomes are sorted into alphabetical order by filename, and the query sequence is the genome that occurs earliest in the list; the subject sequence is the genome that occurs latest in the list.

Tip

The MUMmer comparisons are embarrasingly parallel, and can be distributed across cores on an Open Grid Scheduler-compatible cluster, using the --scheduler SGE option.

Attention

pyani anim requires that a working copy of MUMmer3 is available. Please see Installation Guide for information about installing this package.

For more information about the pyani anim subcommand, please see the pyani anim page, or issue the command pyani anim -h to see the inline help.

Perform ANIm analysis¶

The basic form of the command is:

pyani anim -i <INPUT_DIRECTORY> -o <OUTPUT_DIRECTORY>

This instructs pyani to perform ANIm on the genome FASTA files in <INPUT_DIRECTORY>, which is passed to the -i argument, and write any output files to <OUTPUT_DIRECTORY>, which is passed to the -o argument. For example, the following command performs ANIm on genomes in the directory genomes and writes output to a new directory genomes_ANIm:

pyani anim -i genomes -o genomes_ANIm

Note

While running, pyani anim will show progress bars unless these are disabled with the option --disable_tqdm

This command will write the intermediate nucmer/MUMmer output to the directory genomes_ANIm, in a subdirectory called nucmer_output, where the results can be inspected if required.

$ ls genomes_ANIm/
nucmer_output

Attention

To view the output ANIm results, you will need to use the pyani report or pyani plot subcommands. Please see pyani report and pyani plot for more details.

Perform ANIm analysis with Open Grid Scheduler¶

The MUMmer comparison step of ANIm is embarrassingly parallel, and nucmer jobs can be distributed across cores in a cluster using the Open Grid Scheduler. To enable this during the analysis, use the --scheduler SGE option:

pyani anim --scheduler SGE -i genomes -o genomes_ANIm

Note

Jobs are submitted as array jobs to keep the scheduler queue short.

Note

If --scheduler SGE is not specified, all MUMmer jobs are run locally with Python’s multiprocessing module.

Controlling parameters of Open Grid Scheduler¶

It is possible to control the following features of Open Grid Scheduler via the pyani anim subcommand:

The array job size (by default, comparison jobs are batched in arrays of 10,000)
The prefix string for the job, as reported in the scheduler queue
Arguments to the qsub job submission command

These allow for useful control of job execution. For example, the command:

pyani anim --scheduler SGE --SGEgroupsize 5000 -i genomes -o genomes_ANIm

will batch MUMmer jobs in groups of 500 for the scheduler. The command:

pyani anim --scheduler SGE --jobprefix My_Ace_Job -i genomes -o genomes_ANIm

will prepend the string My_Ace_Job to your job in the scheduler queue. And the command:

pyani anim --scheduler SGE --SGEargs "-m e -M my.name@my.domain" 5000 -i genomes -o genomes_ANIm

will email my.name@my.domain when the jobs finish.

References¶

Richter & Rosselló-Móra (2009) Proc Natl Acad Sci USA 106: 19126-19131 doi:10.1073/pnas.0906412106.