pyani.scripts.average_nucleotide_identity module

Script that calculates ANI measures for a directory of genomes.

This script calculates Average Nucleotide Identity (ANI) according to one of a number of alternative methods described in, e.g.

Richter M, Rossello-Mora R (2009) Shifting the genomic gold standard for the prokaryotic species definition. Proc Natl Acad Sci USA 106: 19126-19131. doi:10.1073/pnas.0906412106. (ANI1020, ANIm, ANIb)

Goris J, Konstantinidis KT, Klappenbach JA, Coenye T, Vandamme P, et al. (2007) DNA-DNA hybridization values and their relationship to whole-genome sequence similarities. Int J Syst Evol Micr 57: 81-91. doi:10.1099/ijs.0.64483-0.

ANI is proposed to be the appropriate in silico substitute for DNA-DNA hybridisation (DDH), and so useful for delineating species boundaries. A typical percentage threshold for species boundary in the literature is 95% ANI (e.g. Richter et al. 2009).

All ANI methods follow the basic algorithm:

  • Align the genome of organism 1 against that of organism 2, and identify the matching regions
  • Calculate the percentage nucleotide identity of the matching regions, as an average for all matching regions

Methods differ on: (1) what alignment algorithm is used, and the choice of parameters (this affects the aligned region boundaries); (2) what the input is for alignment (typically either fragments of fixed size, or the most complete assembly available); (3) whether a reciprocal comparison is necessary or desirable.

ANIm: uses MUMmer (NUCmer) to align the input sequences. ANIb: uses BLASTN to align 1000nt fragments of the input sequences TETRA: calculates tetranucleotide frequencies of each input sequence

This script takes as main input a directory containing a set of correctly-formatted FASTA multiple sequence files. All sequences for a single organism should be contained in only one sequence file. The names of these files are used for identification, so it would be advisable to name them sensibly.

Output is written to a named directory. The output files differ depending on the chosen ANI method.

ANIm: MUMmer/NUCmer .delta files, describing the sequence
alignment; tab-separated format plain text tables describing total alignment lengths, and total alignment percentage identity
ANIb: FASTA sequences describing 1000nt fragments of each input sequence;
BLAST nucleotide databases - one for each set of fragments; and BLASTN output files (tab-separated tabular format plain text) - one for each pairwise comparison of input sequences. There are potentially a lot of intermediate files.
TETRA: Tab-separated text file describing the Z-scores for each
tetranucleotide in each input sequence.

In addition, all methods produce a table of output percentage identity (ANIm and ANIb) or correlation (TETRA), between each sequence.

If graphical output is chosen, the output directory will also contain PDF files representing the similarity between sequences as a heatmap with row and column dendrograms.

DEPENDENCIES

o Biopython (http://www.biopython.org)

o BLAST+ executable in the $PATH, or available on the command line (ANIb)
(ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/)
o MUMmer executables in the $PATH, or available on the command line (ANIm)
(http://mummer.sourceforge.net/)

For graphical output

o R with shared libraries installed on the system, for graphical output
(http://cran.r-project.org/)

o Rpy2 (http://rpy.sourceforge.net/rpy2.html)

pyani.scripts.average_nucleotide_identity.calculate_anim(args: argparse.Namespace, infiles: List[pathlib.Path], org_lengths: Dict[KT, VT]) → pyani.pyani_tools.ANIResults[source]

Return ANIm result dataframes for files in input directory.

Parameters:
  • args – Namespace, command-line arguments
  • logger – logging object
  • infiles – list of paths to each input file
  • org_lengths – dict, input sequence lengths, keyed by sequence

Finds ANI by the ANIm method, as described in Richter et al (2009) Proc Natl Acad Sci USA 106: 19126-19131 doi:10.1073/pnas.0906412106.

All FASTA format files (selected by suffix) in the input directory are compared against each other, pairwise, using NUCmer (which must be in the path). NUCmer output is stored in the output directory.

The NUCmer .delta file output is parsed to obtain an alignment length and similarity error count for every unique region alignment between the two organisms, as represented by the sequences in the FASTA files.

These are processed to give matrices of aligned sequence lengths, average nucleotide identity (ANI) percentages, coverage (aligned percentage of whole genome), and similarity error cound for each pairwise comparison.

pyani.scripts.average_nucleotide_identity.calculate_tetra(infiles: List[pathlib.Path]) → pandas.core.frame.DataFrame[source]

Calculate TETRA for files in input directory.

Parameters:
  • logger – logging object
  • infiles – list, paths to each input file

Calculates TETRA correlation scores, as described in:

Richter M, Rossello-Mora R (2009) Shifting the genomic gold standard for the prokaryotic species definition. Proc Natl Acad Sci USA 106: 19126-19131. doi:10.1073/pnas.0906412106.

and

Teeling et al. (2004) Application of tetranucleotide frequencies for the assignment of genomic fragments. Env. Microbiol. 6(9): 938-947. doi:10.1111/j.1462-2920.2004.00624.x

pyani.scripts.average_nucleotide_identity.compress_delete_outdir(outdir: pathlib.Path, logger: logging.Logger) → None[source]

Compress the contents of the passed directory to .tar.gz and delete.

pyani.scripts.average_nucleotide_identity.draw(args: argparse.Namespace, filestems: List[str], gformat: str) → None[source]

Draw ANIb/ANIm/TETRA results.

Parameters:
  • args – Namespace, command-line arguments
  • logger – logging object
  • filestems
    • filestems for output files
  • gformat
    • the format for output graphics
pyani.scripts.average_nucleotide_identity.get_method(args: argparse.Namespace) → Tuple[source]

Return function and config for the chosen method.

Parameters:
  • args – Namespace of command-line arguments
  • logger – logging object

The dictionary defines pairs of method function and configurations, keyed by method name.

pyani.scripts.average_nucleotide_identity.last_exception() → str[source]

Return last exception as a string, or use in logging.

pyani.scripts.average_nucleotide_identity.make_outdirs(args: argparse.Namespace)[source]

Make the output directory, if required.

Parameters:
  • args – Namespace of command-line options
  • logger – logging object

If the output directory already exists and args.force is not set True, stop with an error.

If args.force is set…
If args.noclobber is not set True, delete the output directory tree If args.noclobber is set True, use the existing output directory, and keep any existing output
pyani.scripts.average_nucleotide_identity.make_sequence_fragments(args: argparse.Namespace, logger: logging.Logger, infiles: List[pathlib.Path], blastdir: pathlib.Path) → Tuple[List[T], Dict[KT, VT]][source]

Return tuple of fragment files, and fragment sizes.

Parameters:
  • args – Namespace of command-line arguments
  • logger – logging object
  • infiles – iterable of sequence files to fragment
  • blastdir – path of directory to place BLASTN databases of fragments

Splits input FASTA sequence files into the fragments (a requirement for ANIb methods), and writes BLAST databases of these fragments, and fragment lengths of sequences, to local files.

pyani.scripts.average_nucleotide_identity.parse_cmdline(argv: Optional[List[T]] = None) → argparse.Namespace[source]

Parse command-line arguments for script.

Parameters:argv – list of arguments from command-line
pyani.scripts.average_nucleotide_identity.process_arguments(args: Optional[argparse.Namespace]) → argparse.Namespace[source]

Process command-line arguments.

Parameters:args – Namespace of command-line arguments

Either returns parsed arguments or - if only the script name is used, shows the version and exits.

pyani.scripts.average_nucleotide_identity.run_blast(args: argparse.Namespace, logger: logging.Logger, infiles: List[pathlib.Path], blastdir: pathlib.Path) → Tuple[source]

Run BLAST commands for ANIb methods.

Parameters:
  • args – Namespace of command-line options
  • logger – logging object
  • infiles – iterable of sequence files to compare
  • blastdir – path of directory to fragment BLASTN databases

Runs BLAST database creation and comparisons, returning the cumulative return values of the BLAST tool subprocesses, and the fragment sizes for each input file

pyani.scripts.average_nucleotide_identity.run_main(argsin: Optional[argparse.Namespace] = None) → int[source]

Run main process for average_nucleotide_identity.py script.

Parameters:
  • argsin – Namespace, command-line arguments
  • logger – logging object
pyani.scripts.average_nucleotide_identity.subsample_input(args: argparse.Namespace, logger: logging.Logger, infiles: List[pathlib.Path]) → List[pathlib.Path][source]

Return a random subsample of the passed input files.

Parameters:
  • args – Namespace, command-line arguments
  • logger – logging object
  • infiles – list of input files for analysis
pyani.scripts.average_nucleotide_identity.test_class_label_paths(args: argparse.Namespace, logger: logging.Logger) → None[source]

Raise error and exit if label and class files exist.

Parameters:
  • args – Namespace of command-line arguments
  • logger – logging object

Exits if class and label files are not found

pyani.scripts.average_nucleotide_identity.test_scheduler(args: argparse.Namespace, logger: logging.Logger) → None[source]

Test if the specified scheduler can be used.

Parameters:
  • args – Namespace of command-line arguments
  • logger – logging object

Exits if the scheduler is invalid

pyani.scripts.average_nucleotide_identity.unified_anib(args: argparse.Namespace, infiles: List[pathlib.Path], org_lengths: Dict[str, int]) → pyani.pyani_tools.ANIResults[source]

Calculate ANIb for files in input directory.

Parameters:
  • args – Namespace of command-line options
  • logger – logging object
  • infiles – iterable of paths to each input file
  • org_lengths – dict of input sequence lengths keyed by sequence name

Calculates ANI by the ANIb method, as described in Goris et al. (2007) Int J Syst Evol Micr 57: 81-91. doi:10.1099/ijs.0.64483-0. There are some minor differences depending on whether BLAST+ or legacy BLAST (BLASTALL) methods are used.

All FASTA format files (selected by suffix) in the input directory are used to construct BLAST databases, placed in the output directory. Each file’s contents are also split into sequence fragments of length options.fragsize, and the multiple FASTA file that results written to the output directory. These are BLASTNed, pairwise, against the databases.

The BLAST output is interrogated for all fragment matches that cover at least 70% of the query sequence, with at least 30% nucleotide identity over the full length of the query sequence. This is an odd choice and doesn’t correspond to the twilight zone limit as implied by Goris et al. We persist with their definition, however. Only these qualifying matches contribute to the total aligned length, and total aligned sequence identity used to calculate ANI.

The results are processed to give matrices of aligned sequence length (aln_lengths.tab), similarity error counts (sim_errors.tab), ANIs (perc_ids.tab), and minimum aligned percentage (perc_aln.tab) of each genome, for each pairwise comparison. These are written to the output directory in plain text tab-separated format.

pyani.scripts.average_nucleotide_identity.write(args: argparse.Namespace, results: pandas.core.frame.DataFrame) → None[source]

Write ANIb/ANIm/TETRA results to output directory.

Parameters:
  • args – Namespace, command-line arguments
  • logger – logging object
  • results – Results object from analysis

Each dataframe is written to an Excel-format file (if args.write_excel is True), and plain text tab-separated file in the output directory. The order of result output must be reflected in the order of filestems.