pyani.anib module

Code to implement the ANIb average nucleotide identity method.

Calculates ANI by the ANIb method, as described in Goris et al. (2007) Int J Syst Evol Micr 57: 81-91. doi:10.1099/ijs.0.64483-0.

From Goris et al.

‘’’The genomic sequence from one of the genomes in a pair (the query) was cut into consecutive 1020 nt fragments. The 1020 nt cut-off was used to correspond with the fragmentation of the genomic DNA to approximately 1 kb fragments during the DDH experiments. […] The 1020 nt fragments were then used to search against the whole genomic sequence of the other genome in the pair (the reference) by using the BLASTN algorithm; the best BLASTN match was saved for further analysis. The BLAST algorithm was run using the following settings: X=150 (where X is the drop-off value for gapped alignment), q=-1 (where q is the penalty for nucleotide mismatch) and F=F (where F is the filter for repeated sequences); the rest of the parameters were used at the default settings. These settings give better sensitivity than the default settings when more distantly related genomes are being compared, as the latter target sequences that are more similar to each other. […] The ANI between the query genome and the reference genome was calculated as the mean identity of all BLASTN matches that showed more than 30% overall sequence identity (recalculated to an identity along the entire sequence) over an alignable region of at least 70% of their length. This cut-off is above the ‘twilight zone’ of similarity searches in which an inference of homology is error prone because of low levels of Reverse searching, i.e. in which the reference genome is used as the query, was also performed to provide reciprocal values.’’’

All input FASTA format files are used to construct BLAST databases. Each file’s contents are also split into sequence fragments of length options.fragsize, and the multiple FASTA file that results written to the output directory. These are BLASTNed, pairwise, against the databases.

BLAST output is interrogated for all fragment matches that cover at least 70% of the query sequence, with at least 30% nucleotide identity over the full length of the query sequence. This is an odd choice and doesn’t correspond to the twilight zone limit as implied by Goris et al. We persist with their definition, however. Only these qualifying matches contribute to the total aligned length, and total aligned sequence identity used to calculate ANI.

exception pyani.anib.PyaniANIbException[source]

Bases: pyani.PyaniException

ANIb-specific exception for pyani.

pyani.anib.build_db_jobs(infiles: List[pathlib.Path], blastcmds: pyani.pyani_tools.BLASTcmds) → Dict[KT, VT][source]

Return dictionary of db-building commands, keyed by dbname.

Parameters:
  • infiles
  • blastcmds
pyani.anib.construct_blastall_cmdline(fname1: pathlib.Path, fname2: pathlib.Path, outdir: pathlib.Path, blastall_exe: pathlib.Path = PosixPath('blastall')) → str[source]

Return single blastall command.

Parameters:
  • fname1
  • fname2
  • outdir
  • blastall_exe – str, path to BLASTALL executable
pyani.anib.construct_blastn_cmdline(fname1: pathlib.Path, fname2: pathlib.Path, outdir: pathlib.Path, blastn_exe: pathlib.Path = PosixPath('blastn')) → str[source]

Return a single blastn command.

Parameters:
  • fname1
  • fname2
  • outdir
  • blastn_exe – str, path to blastn executable
pyani.anib.construct_formatdb_cmd(filename: pathlib.Path, outdir: pathlib.Path, blastdb_exe: pathlib.Path = PosixPath('formatdb')) → Tuple[str, pathlib.Path][source]

Return formatdb command and path to output file.

Parameters:
  • filename – Path, input filename
  • outdir – Path, path to output directory
  • blastdb_exe – Path, path to the formatdb executable
pyani.anib.construct_makeblastdb_cmd(filename: pathlib.Path, outdir: pathlib.Path, blastdb_exe: pathlib.Path = PosixPath('makeblastdb')) → Tuple[str, pathlib.Path][source]

Return makeblastdb command and path to output file.

Parameters:
  • filename – Path, input filename
  • outdir – Path, directory for output
  • blastdb_exe – Path, path to the makeblastdb executable
pyani.anib.fragment_fasta_files(infiles: List[pathlib.Path], outdirname: pathlib.Path, fragsize: int) → Tuple[List[T], Dict[KT, VT]][source]

Chop sequences of the passed files into fragments, return filenames.

Parameters:
  • infiles – collection of paths to each input sequence file
  • outdirname – Path, path to output directory
  • fragsize – Int, the size of sequence fragments

Takes every sequence from every file in infiles, and splits them into consecutive fragments of length fragsize, (with any trailing sequences being included, even if shorter than fragsize), writing the resulting set of sequences to a file with the same name in the specified output directory.

All fragments are named consecutively and uniquely (within a file) as fragNNNNN. Sequence description fields are retained.

Returns a tuple (filenames, fragment_lengths) where filenames is a list of paths to the fragment sequence files, and fragment_lengths is a dictionary of sequence fragment lengths, keyed by the sequence files, with values being a dictionary of fragment lengths, keyed by fragment IDs.

pyani.anib.generate_blastdb_commands(filenames: List[pathlib.Path], outdir: pathlib.Path, blastdb_exe: Optional[pathlib.Path] = None, mode: str = 'ANIb') → List[Tuple[str, pathlib.Path]][source]

Return list of makeblastdb command-lines for ANIb/ANIblastall.

Parameters:
  • filenames – a list of paths to input FASTA files
  • outdir – path to output directory
  • blastdb_exe – path to the makeblastdb executable
  • mode – str, ANIb analysis type (ANIb or ANIblastall)
pyani.anib.generate_blastn_commands(filenames: List[pathlib.Path], outdir: pathlib.Path, blast_exe: Optional[pathlib.Path] = None, mode: str = 'ANIb') → List[str][source]

Return a list of blastn command-lines for ANIm.

Parameters:
  • filenames – a list of paths to fragmented input FASTA files
  • outdir – path to output directory
  • blastn_exe – path to BLASTN executable
  • mode – str, analysis type (ANIb or ANIblastall)

Assumes that the fragment sequence input filenames have the form ACCESSION-fragments.ext, where the corresponding BLAST database filenames have the form ACCESSION.ext. This is the convention followed by the fragment_FASTA_files() function above.

pyani.anib.get_fraglength_dict(fastafiles: List[pathlib.Path]) → Dict[KT, VT][source]

Return dictionary of sequence fragment lengths, keyed by query name.

Parameters:fastafiles – list of paths to FASTA input whole sequence files

Loops over input files and, for each, produces a dictionary with fragment lengths, keyed by sequence ID. These are returned as a dictionary with the keys being query IDs derived from filenames.

pyani.anib.get_fragment_lengths(fastafile: pathlib.Path) → Dict[KT, VT][source]

Return dictionary of sequence fragment lengths, keyed by fragment ID.

Parameters:fastafile

Biopython’s SeqIO module is used to parse all sequences in the FASTA file.

NOTE: ambiguity symbols are not discounted.

pyani.anib.get_version(blast_exe: pathlib.Path = PosixPath('blastn')) → str[source]

Return BLAST+ blastn version as a string.

Parameters:blast_exe – path to blastn executable

We expect blastn to return a string as, for example

$ blastn -version
blastn: 2.9.0+
Package: blast 2.9.0, build Jun 10 2019 09:40:53

This is concatenated with the OS name.

The following circumstances are explicitly reported as strings

  • no executable at passed path
  • non-executable file at passed path (this includes cases where the user doesn’t have execute permissions on the file)
  • no version info returned
pyani.anib.make_blastcmd_builder(mode: str, outdir: pathlib.Path, format_exe: Optional[pathlib.Path] = None, blast_exe: Optional[pathlib.Path] = None, prefix: str = 'ANIBLAST') → pyani.pyani_tools.BLASTcmds[source]

Return BLASTcmds object for construction of BLAST commands.

Parameters:
  • mode – str, the kind of ANIb analysis (ANIb or ANIblastall)
  • outdir
  • format_exe
  • blast_exe
  • prefix
pyani.anib.make_job_graph(infiles: List[pathlib.Path], fragfiles: List[pathlib.Path], blastcmds: pyani.pyani_tools.BLASTcmds) → List[pyani.pyani_jobs.Job][source]

Return job dependency graph, based on the passed input sequence files.

Parameters:
  • infiles – list of paths to input FASTA files
  • fragfiles – list of paths to fragmented input FASTA files
  • blastcmds

By default, will run ANIb - it is possible to make a mess of passing the wrong executable for the mode you’re using.

All items in the returned graph list are BLAST executable jobs that must be run after the corresponding database creation. The Job objects corresponding to the database creation are contained as dependencies. How those jobs are scheduled depends on the scheduler (see run_multiprocessing.py, run_sge.py)

pyani.anib.parse_blast_tab(filename: pathlib.Path, fraglengths: Dict[KT, VT], mode: str = 'ANIb') → Tuple[int, int, int][source]

Return (alignment length, similarity errors, mean_pid) tuple.

Parameters:
  • filename – Path, path to .blast_tab file
  • fraglengths – Optional[Dict], dictionary of fragment lengths for each genome.
  • mode – str, analysis type (ANIb or ANIblastall)

Calculate the alignment length and total number of similarity errors (as we would with ANIm), as well as the Goris et al.-defined mean identity of all valid BLAST matches for the passed BLASTALL alignment .blast_tab file.

‘’’ANI between the query genome and the reference genome was calculated as the mean identity of all BLASTN matches that showed more than 30% overall sequence identity (recalculated to an identity along the entire sequence) over an alignable region of at least 70% of their length. ‘’’

pyani.anib.process_blast(blast_dir: pathlib.Path, org_lengths: Dict[KT, VT], fraglengths: Dict[KT, VT], mode: str = 'ANIb', logger: Optional[logging.Logger] = None) → pyani.pyani_tools.ANIResults[source]

Return tuple of ANIb results for .blast_tab files in the output dir.

Parameters:
  • blast_dir – Path, path to the directory containing .blast_tab files
  • org_lengths – Dict, the base count for each input sequence
  • fraglengths – dictionary of query sequence fragment lengths, only needed for BLASTALL output
  • mode – str, analysis type (ANIb or ANIblastall)
  • logger – a logger for messages

Returns the following pandas dataframes in an ANIResults object; query sequences are rows, subject sequences are columns:

  • alignment_lengths - non-symmetrical: total length of alignment
  • percentage_identity - non-symmetrical: ANIb (Goris) percentage identity
  • alignment_coverage - non-symmetrical: coverage of query
  • similarity_errors - non-symmetrical: count of similarity errors

May throw a ZeroDivisionError if one or more BLAST runs failed, or a very distant sequence was included in the analysis.