pyani.tetra module

Code to implement the TETRA average nucleotide identity method.

Provides functions for calculation of TETRA as described in:

Richter M, Rossello-Mora R (2009) Shifting the genomic gold standard for the prokaryotic species definition. Proc Natl Acad Sci USA 106: 19126-19131. doi:10.1073/pnas.0906412106.

and

Teeling et al. (2004) Application of tetranucleotide frequencies for the assignment of genomic fragments. Env. Microbiol. 6(9): 938-947. doi:10.1111/j.1462-2920.2004.00624.x

pyani.tetra.calculate_correlations(tetra_z: Dict[str, Dict[str, float]]) → pandas.core.frame.DataFrame[source]

Return dataframe of Pearson correlation coefficients.

Parameters:tetra_z – dict, Z-scores, keyed by sequence ID

Calculates Pearson correlation coefficient from Z scores for each tetranucleotide. This is done longhand here, which is fast enough, but for robustness we might want to do something else… (TODO).

Note that we report a correlation by this method, rather than a percentage identity.

pyani.tetra.calculate_tetra_zscore(filename: pathlib.Path) → Dict[str, float][source]

Return TETRA Z-score for the sequence in the passed file.

Parameters:filename – path to sequence file

Calculates mono-, di-, tri- and tetranucleotide frequencies for each sequence, on each strand, and follows Teeling et al. (2004) in calculating a corresponding Z-score for each observed tetranucleotide frequency, dependent on the mono-, di- and tri- nucleotide frequencies for that input sequence.

pyani.tetra.calculate_tetra_zscores(infilenames: Iterable[T_co]) → Dict[str, Dict[str, float]][source]

Return dictionary of TETRA Z-scores for each input file.

Parameters:infilenames – iterable of paths to input sequence files
pyani.tetra.tetra_clean(instr: str) → bool[source]

Return True if string contains only unambiguous IUPAC nucleotide symbols.

Parameters:instr – str, nucleotide sequence

We are assuming that a low frequency of IUPAC ambiguity symbols doesn’t affect our calculation.