pyani.anib module¶

Code to implement the ANIb average nucleotide identity method.

Calculates ANI by the ANIb method, as described in Goris et al. (2007) Int J Syst Evol Micr 57: 81-91. doi:10.1099/ijs.0.64483-0.

From Goris et al.

‘’’The genomic sequence from one of the genomes in a pair (the query) was cut into consecutive 1020 nt fragments. The 1020 nt cut-off was used to correspond with the fragmentation of the genomic DNA to approximately 1 kb fragments during the DDH experiments. […] The 1020 nt fragments were then used to search against the whole genomic sequence of the other genome in the pair (the reference) by using the BLASTN algorithm; the best BLASTN match was saved for further analysis. The BLAST algorithm was run using the following settings: X=150 (where X is the drop-off value for gapped alignment), q=-1 (where q is the penalty for nucleotide mismatch) and F=F (where F is the filter for repeated sequences); the rest of the parameters were used at the default settings. These settings give better sensitivity than the default settings when more distantly related genomes are being compared, as the latter target sequences that are more similar to each other. […] The ANI between the query genome and the reference genome was calculated as the mean identity of all BLASTN matches that showed more than 30% overall sequence identity (recalculated to an identity along the entire sequence) over an alignable region of at least 70% of their length. This cut-off is above the ‘twilight zone’ of similarity searches in which an inference of homology is error prone because of low levels of Reverse searching, i.e. in which the reference genome is used as the query, was also performed to provide reciprocal values.’’’

All input FASTA format files are used to construct BLAST databases. Each file’s contents are also split into sequence fragments of length options.fragsize, and the multiple FASTA file that results written to the output directory. These are BLASTNed, pairwise, against the databases.

BLAST output is interrogated for all fragment matches that cover at least 70% of the query sequence, with at least 30% nucleotide identity over the full length of the query sequence. This is an odd choice and doesn’t correspond to the twilight zone limit as implied by Goris et al. We persist with their definition, however. Only these qualifying matches contribute to the total aligned length, and total aligned sequence identity used to calculate ANI.

exception pyani.anib.PyaniANIbException[source]¶

Bases: pyani.PyaniException

ANIb-specific exception for pyani.

pyani.anib.build_db_jobs(infiles: List[pathlib.Path], blastcmds: pyani.pyani_tools.BLASTcmds) → Dict[KT, VT][source]¶

Return dictionary of db-building commands, keyed by dbname.

Parameters:	infiles – blastcmds –

pyani.anib.construct_blastall_cmdline(fname1: pathlib.Path, fname2: pathlib.Path, outdir: pathlib.Path, blastall_exe: pathlib.Path = PosixPath('blastall')) → str[source]¶

Return single blastall command.

Parameters:	fname1 – fname2 – outdir – blastall_exe – str, path to BLASTALL executable

pyani.anib.construct_blastn_cmdline(fname1: pathlib.Path, fname2: pathlib.Path, outdir: pathlib.Path, blastn_exe: pathlib.Path = PosixPath('blastn')) → str[source]¶

Return a single blastn command.

Parameters:	fname1 – fname2 – outdir – blastn_exe – str, path to blastn executable

pyani.anib.construct_formatdb_cmd(filename: pathlib.Path, outdir: pathlib.Path, blastdb_exe: pathlib.Path = PosixPath('formatdb')) → Tuple[str, pathlib.Path][source]¶

Return formatdb command and path to output file.

Parameters:	filename – Path, input filename outdir – Path, path to output directory blastdb_exe – Path, path to the formatdb executable

pyani.anib.construct_makeblastdb_cmd(filename: pathlib.Path, outdir: pathlib.Path, blastdb_exe: pathlib.Path = PosixPath('makeblastdb')) → Tuple[str, pathlib.Path][source]¶

Return makeblastdb command and path to output file.

Parameters:	filename – Path, input filename outdir – Path, directory for output blastdb_exe – Path, path to the makeblastdb executable

pyani.anib.fragment_fasta_files(infiles: List[pathlib.Path], outdirname: pathlib.Path, fragsize: int) → Tuple[List[T], Dict[KT, VT]][source]¶

Chop sequences of the passed files into fragments, return filenames.

Parameters:	infiles – collection of paths to each input sequence file outdirname – Path, path to output directory fragsize – Int, the size of sequence fragments

Takes every sequence from every file in infiles, and splits them into consecutive fragments of length fragsize, (with any trailing sequences being included, even if shorter than fragsize), writing the resulting set of sequences to a file with the same name in the specified output directory.

All fragments are named consecutively and uniquely (within a file) as fragNNNNN. Sequence description fields are retained.

Returns a tuple (filenames, fragment_lengths) where filenames is a list of paths to the fragment sequence files, and fragment_lengths is a dictionary of sequence fragment lengths, keyed by the sequence files, with values being a dictionary of fragment lengths, keyed by fragment IDs.

pyani.anib.generate_blastdb_commands(filenames: List[pathlib.Path], outdir: pathlib.Path, blastdb_exe: Optional[pathlib.Path] = None, mode: str = 'ANIb') → List[Tuple[str, pathlib.Path]][source]¶

Return list of makeblastdb command-lines for ANIb/ANIblastall.

Parameters:	filenames – a list of paths to input FASTA files outdir – path to output directory blastdb_exe – path to the makeblastdb executable mode – str, ANIb analysis type (ANIb or ANIblastall)

pyani.anib.generate_blastn_commands(filenames: List[pathlib.Path], outdir: pathlib.Path, blast_exe: Optional[pathlib.Path] = None, mode: str = 'ANIb') → List[str][source]¶

Return a list of blastn command-lines for ANIm.

Parameters:	filenames – a list of paths to fragmented input FASTA files outdir – path to output directory blastn_exe – path to BLASTN executable mode – str, analysis type (ANIb or ANIblastall)

Assumes that the fragment sequence input filenames have the form ACCESSION-fragments.ext, where the corresponding BLAST database filenames have the form ACCESSION.ext. This is the convention followed by the fragment_FASTA_files() function above.

pyani.anib.get_fraglength_dict(fastafiles: List[pathlib.Path]) → Dict[KT, VT][source]¶

Return dictionary of sequence fragment lengths, keyed by query name.

Parameters:	fastafiles – list of paths to FASTA input whole sequence files

Loops over input files and, for each, produces a dictionary with fragment lengths, keyed by sequence ID. These are returned as a dictionary with the keys being query IDs derived from filenames.

pyani.anib.get_fragment_lengths(fastafile: pathlib.Path) → Dict[KT, VT][source]¶

Return dictionary of sequence fragment lengths, keyed by fragment ID.

Parameters:	fastafile –

Biopython’s SeqIO module is used to parse all sequences in the FASTA file.

NOTE: ambiguity symbols are not discounted.

pyani.anib.get_version(blast_exe: pathlib.Path = PosixPath('blastn')) → str[source]¶

Return BLAST+ blastn version as a string.

Parameters:	blast_exe – path to blastn executable

We expect blastn to return a string as, for example

$ blastn -version
blastn: 2.9.0+
Package: blast 2.9.0, build Jun 10 2019 09:40:53

This is concatenated with the OS name.

The following circumstances are explicitly reported as strings

no executable at passed path
non-executable file at passed path (this includes cases where the user doesn’t have execute permissions on the file)
no version info returned

pyani.anib.make_blastcmd_builder(mode: str, outdir: pathlib.Path, format_exe: Optional[pathlib.Path] = None, blast_exe: Optional[pathlib.Path] = None, prefix: str = 'ANIBLAST') → pyani.pyani_tools.BLASTcmds[source]¶

Return BLASTcmds object for construction of BLAST commands.

Parameters:	mode – str, the kind of ANIb analysis (ANIb or ANIblastall) outdir – format_exe – blast_exe – prefix –

pyani.anib.make_job_graph(infiles: List[pathlib.Path], fragfiles: List[pathlib.Path], blastcmds: pyani.pyani_tools.BLASTcmds) → List[pyani.pyani_jobs.Job][source]¶

Return job dependency graph, based on the passed input sequence files.

Parameters:	infiles – list of paths to input FASTA files fragfiles – list of paths to fragmented input FASTA files blastcmds –

By default, will run ANIb - it is possible to make a mess of passing the wrong executable for the mode you’re using.

All items in the returned graph list are BLAST executable jobs that must be run after the corresponding database creation. The Job objects corresponding to the database creation are contained as dependencies. How those jobs are scheduled depends on the scheduler (see run_multiprocessing.py, run_sge.py)

pyani.anib.parse_blast_tab(filename: pathlib.Path, fraglengths: Dict[KT, VT], mode: str = 'ANIb') → Tuple[int, int, int][source]¶

Return (alignment length, similarity errors, mean_pid) tuple.

Parameters:	filename – Path, path to .blast_tab file fraglengths – Optional[Dict], dictionary of fragment lengths for each genome. mode – str, analysis type (ANIb or ANIblastall)

Calculate the alignment length and total number of similarity errors (as we would with ANIm), as well as the Goris et al.-defined mean identity of all valid BLAST matches for the passed BLASTALL alignment .blast_tab file.

‘’’ANI between the query genome and the reference genome was calculated as the mean identity of all BLASTN matches that showed more than 30% overall sequence identity (recalculated to an identity along the entire sequence) over an alignable region of at least 70% of their length. ‘’’

pyani.anib.process_blast(blast_dir: pathlib.Path, org_lengths: Dict[KT, VT], fraglengths: Dict[KT, VT], mode: str = 'ANIb', logger: Optional[logging.Logger] = None) → pyani.pyani_tools.ANIResults[source]¶

Return tuple of ANIb results for .blast_tab files in the output dir.

Parameters:	blast_dir – Path, path to the directory containing .blast_tab files org_lengths – Dict, the base count for each input sequence fraglengths – dictionary of query sequence fragment lengths, only needed for BLASTALL output mode – str, analysis type (ANIb or ANIblastall) logger – a logger for messages

Returns the following pandas dataframes in an ANIResults object; query sequences are rows, subject sequences are columns:

alignment_lengths - non-symmetrical: total length of alignment
percentage_identity - non-symmetrical: ANIb (Goris) percentage identity
alignment_coverage - non-symmetrical: coverage of query
similarity_errors - non-symmetrical: count of similarity errors

May throw a ZeroDivisionError if one or more BLAST runs failed, or a very distant sequence was included in the analysis.