pyani.download module

Module providing functions useful for downloading genomes from NCBI.

class pyani.download.ASMIDs[source]

Bases: tuple

Matching Assembly ID information for a query taxID.

asm_ids

Alias for field number 2

query

Alias for field number 0

result_count

Alias for field number 1

class pyani.download.Classification[source]

Bases: tuple

Taxonomic classification for an isolate.

genus

Alias for field number 1

organism

Alias for field number 0

species

Alias for field number 2

strain

Alias for field number 3

class pyani.download.DLFileData[source]

Bases: tuple

Convenience struct for file download data.

filestem

Alias for field number 0

ftpstem

Alias for field number 1

suffix

Alias for field number 2

class pyani.download.DLStatus(url: str, hashurl: str, outfname: pathlib.Path, outfhash: pathlib.Path, skipped: bool, error: Optional[str] = None)[source]

Bases: object

Download status data.

exception pyani.download.FileExistsException(msg: str = 'Specified file exists')[source]

Bases: Exception

A specified file exists.

class pyani.download.Hashstatus[source]

Bases: tuple

Status report on file hash comparison.

filehash

Alias for field number 2

localhash

Alias for field number 1

passed

Alias for field number 0

exception pyani.download.NCBIDownloadException(msg: str = 'Error downloading file from NCBI')[source]

Bases: Exception

General exception for failed NCBI download.

exception pyani.download.PyaniIndexException[source]

Bases: Exception

General exception for indexing with pyani

pyani.download.check_hash(fname: pathlib.Path, hashfile: pathlib.Path) → pyani.download.Hashstatus[source]

Check MD5 of passed file against downloaded NCBI hash file.

Parameters:
  • fname – Path, path to local hash file
  • hashfile – Path, path to NCBI hash file
pyani.download.compile_url(filestem: str, suffix: str, ftpstem: str) → Tuple[str, str][source]

Compile download URLs given a passed filestem.

Parameters:
  • filestem
  • suffix
  • ftpstem

The filestem corresponds to <AA>_<AN>, where <AA> and <AN> are AssemblyAccession and AssemblyName: data fields in the eSummary record. These correspond to downloadable files for each assembly at ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GC[AF]/nnn/nnn/nnn/<AA>_<AN>/ where <AA> is AssemblyAccession, and <AN> is AssemblyName. The choice of GCA vs GCF, and the values of nnn, are derived from <AA>

The files in this directory all have the stem <AA>_<AN>_<suffix>, where suffixes are: assembly_report.txt assembly_stats.txt feature_table.txt.gz genomic.fna.gz genomic.gbff.gz genomic.gff.gz protein.faa.gz protein.gpff.gz rm_out.gz rm.run wgsmaster.gbff.gz

pyani.download.construct_output_paths(filestem: str, suffix: str, outdir: pathlib.Path) → Tuple[pathlib.Path, pathlib.Path][source]

Construct paths to output files for genome and hash.

Parameters:
  • filestem – str, output filename stem
  • suffix – str, output filename suffix
  • outdir – Path, path to output directory
pyani.download.create_hash(fname: pathlib.Path) → str[source]

Return MD5 hash of the passed file contents.

Parameters:fname – Path, path to file for hashing

We can ignore the Bandit B303 error as we’re not using the hash for cryptographic purposes.

pyani.download.create_labels(classification: pyani.download.Classification, filestem: str, genomehash: str) → Tuple[str, str][source]

Return class and label text from UID classification.

Parameters:
  • classification – Classification named tuple (org, genus, species, strain)
  • filestem – str, filestem of input genome file
  • genomehash – str, MD5 hash of genome data

The ‘class’ data is the organism as provided in the passed Classification named tuple; the ‘label’ data is genus, species and strain information from the same tuple. The label is intended to be human-readable, the class data to be a genuine class identifier.

Returns a tuple of two strings: (label, class).

The two strings are tab-separated strings: <HASH>t<FILE>t<CLASS/LABEL>. The hash is used to help uniquely identify the genome in the database (label/class is unique by a combination of hash and run ID).

pyani.download.download_genome_and_hash(outdir: pathlib.Path, timeout: int, dlfiledata: pyani.download.DLFileData, dltype: str = 'RefSeq', disable_tqdm: bool = False) → namedlist.namedlist[source]

Download genome and accompanying MD5 hash from NCBI.

Parameters:
  • args – Namespace for command-line arguments
  • outdir – Path to output directory for downloads
  • timeout – int: timeout for download attempt
  • dlfiledata – namedtuple of info for file to download
  • dltype – reference database to use: RefSeq or GenBank
  • disable_tqdm – disable progress bar

This function tries the (assumed to be passed) RefSeq FTP URL first and, if that fails, then attempts to download the corresponding GenBank data.

We attempt to gracefully skip genomes with download errors.

pyani.download.download_url(url: str, outfname: pathlib.Path, timeout: int, disable_tqdm: bool = False) → None[source]

Download remote URL to a local directory.

Parameters:
  • url – URL of remote file for download
  • outfname – Path, path to write output
  • timeout
  • disable_tqdm – Boolean, show tqdm progress bar?

This function downloads the contents of the passed URL to the passed filename, in buffered chunks

pyani.download.entrez_batch(func)[source]

Decorator to compile batches from the wrapped function into a single set of results.

The entrez_batch decorator should go outside the entrez_retry decorator.

pyani.download.entrez_batched_webhistory(*args, expected=None, batchsize=None, **kwargs)[source]
pyani.download.entrez_esearch(*args, retries=1, **kwargs)[source]
pyani.download.entrez_esummary(*args, retries=1, **kwargs)[source]
pyani.download.entrez_retry(func)[source]

Decorator to retry the wrapped function up to ‘retries’ times.

pyani.download.extract_contigs(fname: pathlib.Path, ename: pathlib.Path) → subprocess.CompletedProcess[source]

Extract contents of fname to ename using gunzip.

Parameters:
  • fname – str, path to input compressed file
  • ename – str, path to output uncompressed file

Returns status of subprocess call

pyani.download.extract_filestem(esummary) → str[source]

Extract filestem from Entrez eSummary data.

Parameters:esummary

Function expects esummary[‘DocumentSummarySet’][‘DocumentSummary’][0]

Some illegal characters may occur in AssemblyName - for these, a more robust regex replace/escape may be required. Sadly, NCBI don’t just use standard percent escapes, but instead replace certain characters with underscores: white space, slash, comma, hash, brackets.

pyani.download.extract_hash(hashfile: pathlib.Path, name: str) → str[source]

Return MD5 hash from file of name:MD5 hashes.

Parameters:
  • hashfile – Path, path to file containing name:MD5 pairs
  • name – str, name associated with hash
pyani.download.get_asm_uids(taxon_uid: str, retries: int) → pyani.download.ASMIDs[source]

Return set of NCBI UIDs associated with the passed taxon UID.

Parameters:
  • taxon_uid – str, NCBI taxID for taxon to download
  • retries – int, number of download retry attempts

This query at NCBI returns all assemblies for the taxon subtree rooted at the passed taxon_uid.

pyani.download.get_ncbi_classification(esummary) → pyani.download.Classification[source]

Return organism, genus, species, strain info from eSummary data.

Parameters:esummary
pyani.download.get_ncbi_esummary(asm_uid, retries, api_key=None) → Tuple[source]

Obtain full eSummary info for the passed assembly UID.

Parameters:
  • asm_uid
  • retries
  • api_key
pyani.download.last_exception() → str[source]

Return last exception as a string.

pyani.download.make_asm_dict(taxon_ids: List[str], retries: int) → Dict[KT, VT][source]

Return a dict of assembly UIDs, keyed by passed taxon IDs.

Parameters:
  • taxon_ids
  • retries

Takes the passed list of taxon IDs and calls get_asm_uids to generate a dictionary linking each taxon ID to a list of assembly IDs at NCBI.

pyani.download.retrieve_genome_and_hash(filestem: str, suffix: str, ftpstem: str, outdir: pathlib.Path, timeout: int, disable_tqdm: bool = False) → pyani.download.DLStatus[source]

Download genome contigs and MD5 hash data from NCBI.

Parameters:
  • filestem
  • suffix
  • ftpstem
  • outdir
  • timeout
  • disable_tqdm – Boolean, show tqdm progress bar?
pyani.download.set_ncbi_email(email: str) → None[source]

Set contact email for NCBI.

Parameters:email – str, email address to give to Entrez at NCBI
pyani.download.split_taxa(taxa: str) → List[str][source]

Return list of taxon ids from the passed comma-separated list.

Parameters:taxa – str, comma-separated list of valid NCBI taxonomy IDs

The function checks the passed taxon argument against a regular expression that permits comma-separated numerical symbols only.