pyani.pyani_files module

Code to handle files for average nucleotide identity calculations.

exception pyani.pyani_files.PyaniFilesException[source]

Bases: pyani.PyaniException

Exception raised by pyani when file interaction goes bad.

pyani.pyani_files.collect_existing_output(dirpath: pathlib.Path, program: str, args: argparse.Namespace) → List[pathlib.Path][source]

Return a list of existing output files at dirpath.

Parameters:
  • dirpath – Path, path to existing output directory
  • program – str, name of program to use for comparisons
  • args – Namespace, command-line arguments for the run
pyani.pyani_files.get_fasta_and_hash_paths(dirname: pathlib.Path = PosixPath('.')) → List[Tuple[pathlib.Path, pathlib.Path]][source]

Return a list of (FASTA file, hash file) tuples in passed directory.

Parameters:dirname – Path, path to input directory

Raises an IOError if the corresponding hash for a FASTA file does not exist

pyani.pyani_files.get_fasta_files(dirname: pathlib.Path = PosixPath('.')) → List[pathlib.Path][source]

Return a list of FASTA files in the passed directory.

Parameters:dirname – Path, path to input directory
pyani.pyani_files.get_fasta_paths(dirname: pathlib.Path = PosixPath('.'), extlist: Optional[List[T]] = None) → List[pathlib.Path][source]

Return a list of paths to files matching a list of FASTA extensions.

Parameters:
  • dirname – Path, path to directory containing input FASTA files
  • extlist – List, file suffixes for FASTA files

Returns sorted list of the full path to each file.

pyani.pyani_files.get_input_files(dirname: pathlib.Path, *ext) → List[pathlib.Path][source]

Return sorted files in passed directory, filtered by extension.

Parameters:
  • dirname – Path, path to input directory
  • *ext

    optional iterable of arguments describing permitted file extensions

pyani.pyani_files.get_sequence_lengths(fastafilenames: Iterable[pathlib.Path]) → Dict[str, int][source]

Return dictionary of sequence lengths, keyed by organism.

Parameters:fastafilenames – Iterable[Path], paths to input FASTA files

Biopython’s SeqIO module is used to parse all sequences in the FASTA file corresponding to each organism, and the total base count in each is obtained.

NOTE: ambiguity symbols are not discounted.

pyani.pyani_files.load_classes_labels(path: pathlib.Path) → Dict[str, str][source]

Return a dictionary of genome classes or labels keyed by hash.

Parameters:path – Path, path to classes or labels file

The expected format of the classes and labels files is: <HASH>t<FILESTEM>t<CLASS>|<LABEL>, where <HASH> is the MD5 hash of the genome data (this is not checked); <FILESTEM> is the path to the genome file (this is intended to be a record for humans to audit, it’s not needed for the database interaction; and <CLASS>|<LABEL> is the class or label associated with that genome.

pyani.pyani_files.read_fasta_description(filename: pathlib.Path) → str[source]

Return the first description string from a FASTA file.

Parameters:filename – Path, path to FASTA file
pyani.pyani_files.read_hash_string(filename: pathlib.Path) → Tuple[str, str][source]

Return the hash and file strings from the passed hash file.

Parameters:filename – Path, path to file containing hash string