pyani.anim module

Code to implement the ANIm average nucleotide identity method.

Calculates ANI by the ANIm method, as described in Richter et al (2009) Proc Natl Acad Sci USA 106: 19126-19131 doi:10.1073/pnas.0906412106.

All input FASTA format files are compared against each other, pairwise, using NUCmer (binary location must be provided). NUCmer output will be stored in a specified output directory.

The NUCmer .delta file output is parsed to obtain an alignment length and similarity error count for every unique region alignment. These are processed to give matrices of aligned sequence lengths, similarity error counts, average nucleotide identity (ANI) percentages, and minimum aligned percentage (of whole genome) for each pairwise comparison.

exception pyani.anim.PyaniANImException[source]

Bases: pyani.PyaniException

ANIm-specific exception for pyani.

pyani.anim.construct_nucmer_cmdline(fname1: pathlib.Path, fname2: pathlib.Path, outdir: pathlib.Path = PosixPath('.'), nucmer_exe: pathlib.Path = PosixPath('nucmer'), filter_exe: pathlib.Path = PosixPath('delta-filter'), maxmatch: bool = False) → Tuple[str, str][source]

Return a tuple of corresponding NUCmer and delta-filter commands.

Parameters:
  • fname1 – path to query FASTA file
  • fname2 – path to subject FASTA file
  • outdir – path to output directory
  • nucmer_exe
  • filter_exe
  • maxmatch – Boolean flag indicating whether to use NUCmer’s -maxmatch

option. If not, the -mum option is used instead

The split into a tuple was made necessary by changes to SGE/OGE. The delta-filter command must now be run as a dependency of the NUCmer command, and be wrapped in a Python script to capture STDOUT.

NOTE: This command-line writes output data to a subdirectory of the passed outdir, called “nucmer_output”.

pyani.anim.generate_nucmer_commands(filenames: List[pathlib.Path], outdir: pathlib.Path = PosixPath('.'), nucmer_exe: pathlib.Path = PosixPath('nucmer'), filter_exe: pathlib.Path = PosixPath('delta-filter'), maxmatch: bool = False) → Tuple[List[T], List[T]][source]

Return list of NUCmer command-lines for ANIm.

Parameters:
  • filenames – a list of paths to input FASTA files
  • outdir – path to output directory
  • nucmer_exe – location of the nucmer binary
  • maxmatch – Boolean flag indicating to use NUCmer’s -maxmatch option

The first element returned is a list of NUCmer commands, and the second a corresponding list of delta_filter_wrapper.py commands. The NUCmer commands should each be run before the corresponding delta-filter command.

TODO: This return value needs to be reworked as a collection.

Loop over all FASTA files generating NUCmer command lines for each pairwise comparison.

pyani.anim.generate_nucmer_jobs(filenames: List[pathlib.Path], outdir: pathlib.Path = PosixPath('.'), nucmer_exe: pathlib.Path = PosixPath('nucmer'), filter_exe: pathlib.Path = PosixPath('delta-filter'), maxmatch: bool = False, jobprefix: str = 'ANINUCmer')[source]

Return list of Jobs describing NUCmer command-lines for ANIm.

Parameters:
  • filenames – Iterable, Paths to input FASTA files
  • outdir – str, path to output directory
  • nucmer_exe – str, location of the nucmer binary
  • filter_exe
  • maxmatch – Boolean flag indicating to use NUCmer’s -maxmatch option
  • jobprefix

Loop over all FASTA files, generating Jobs describing NUCmer command lines for each pairwise comparison.

pyani.anim.get_fasta_files(dirname: pathlib.Path = PosixPath('.')) → Iterable[T_co][source]

Return iterable of FASTA files in the passed directory.

Parameters:dirname – str, path to input directory
pyani.anim.get_version(nucmer_exe: pathlib.Path = PosixPath('nucmer')) → str[source]

Return NUCmer package version as a string.

Parameters:nucmer_exe – path to NUCmer executable

We expect NUCmer to return a string on STDERR as

$ nucmer
NUCmer (NUCleotide MUMmer) version 3.1

we concatenate this with the OS name.

The following circumstances are explicitly reported as strings

  • no executable at passed path
  • non-executable file at passed path (this includes cases where the user doesn’t have execute permissions on the file)
  • no version info returned
pyani.anim.parse_delta(filename: pathlib.Path) → Tuple[int, int][source]

Return (alignment length, similarity errors) tuple from passed .delta.

Parameters:filename – Path, path to the input .delta file

Extracts the aligned length and number of similarity errors for each aligned uniquely-matched region, and returns the cumulative total for each as a tuple.

Similarity errors are defined in the .delta file spec (see below) as non-positive match scores. For NUCmer output, this is identical to the number of errors (non-identities and indels).

Delta file format has seven numbers in the lines of interest: see http://mummer.sourceforge.net/manual/ for specification

  • start on query
  • end on query
  • start on target
  • end on target
  • error count (non-identical, plus indels)
  • similarity errors (non-positive match scores)
    [NOTE: with PROmer this is equal to error count]
  • stop codons (always zero for nucmer)

To calculate alignment length, we take the length of the aligned region of the reference (no gaps), and process the delta information. This takes the form of one value per line, following the header sequence. Positive values indicate an insertion in the reference; negative values a deletion in the reference (i.e. an insertion in the query). The total length of the alignment is then:

reference_length + insertions - deletions

For example:

A = ABCDACBDCAC$ B = BCCDACDCAC$ Delta = (1, -3, 4, 0) A = ABC.DACBDCAC$ B = .BCCDAC.DCAC$

A is the reference and has length 11. There are two insertions (positive delta), and one deletion (negative delta). Alignment length is then 11 + 1 = 12.

pyani.anim.process_deltadir(delta_dir: pathlib.Path, org_lengths: Dict[KT, VT], logger: Optional[logging.Logger] = None) → pyani.pyani_tools.ANIResults[source]

Return tuple of ANIm results for .deltas in passed directory.

Parameters:
  • delta_dir – Path, path to the directory containing .delta files
  • org_lengths – dictionary of total sequence lengths, keyed by sequence

Returns the following pandas dataframes in an ANIResults object; query sequences are rows, subject sequences are columns:

  • alignment_lengths - symmetrical: total length of alignment
  • percentage_identity - symmetrical: percentage identity of alignment
  • alignment_coverage - non-symmetrical: coverage of query and subject
  • similarity_errors - symmetrical: count of similarity errors

May throw a ZeroDivisionError if one or more NUCmer runs failed, or a very distant sequence was included in the analysis.