pyani.download module¶

Module providing functions useful for downloading genomes from NCBI.

class pyani.download.ASMIDs[source]¶

Bases: tuple

Matching Assembly ID information for a query taxID.

asm_ids¶: Alias for field number 2

query¶: Alias for field number 0

result_count¶: Alias for field number 1

class pyani.download.Classification[source]¶

Bases: tuple

Taxonomic classification for an isolate.

genus¶: Alias for field number 1

organism¶: Alias for field number 0

species¶: Alias for field number 2

strain¶: Alias for field number 3

class pyani.download.DLFileData[source]¶

Bases: tuple

Convenience struct for file download data.

filestem¶: Alias for field number 0

ftpstem¶: Alias for field number 1

suffix¶: Alias for field number 2

class pyani.download.DLStatus(url: str, hashurl: str, outfname: pathlib.Path, outfhash: pathlib.Path, skipped: bool, error: Optional[str] = None)[source]¶

Bases: object

Download status data.

exception pyani.download.FileExistsException(msg: str = 'Specified file exists')[source]¶

Bases: Exception

A specified file exists.

class pyani.download.Hashstatus[source]¶

Bases: tuple

Status report on file hash comparison.

filehash¶: Alias for field number 2

localhash¶: Alias for field number 1

passed¶: Alias for field number 0

exception pyani.download.NCBIDownloadException(msg: str = 'Error downloading file from NCBI')[source]¶

Bases: Exception

General exception for failed NCBI download.

exception pyani.download.PyaniIndexException[source]¶

Bases: Exception

General exception for indexing with pyani

pyani.download.check_hash(fname: pathlib.Path, hashfile: pathlib.Path) → pyani.download.Hashstatus[source]¶

Check MD5 of passed file against downloaded NCBI hash file.

Parameters:	fname – Path, path to local hash file hashfile – Path, path to NCBI hash file

pyani.download.compile_url(filestem: str, suffix: str, ftpstem: str) → Tuple[str, str][source]¶

Compile download URLs given a passed filestem.

Parameters:	filestem – suffix – ftpstem –

The filestem corresponds to <AA>_<AN>, where <AA> and <AN> are AssemblyAccession and AssemblyName: data fields in the eSummary record. These correspond to downloadable files for each assembly at ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GC[AF]/nnn/nnn/nnn/<AA>_<AN>/ where <AA> is AssemblyAccession, and <AN> is AssemblyName. The choice of GCA vs GCF, and the values of nnn, are derived from <AA>

The files in this directory all have the stem <AA>_<AN>_<suffix>, where suffixes are: assembly_report.txt assembly_stats.txt feature_table.txt.gz genomic.fna.gz genomic.gbff.gz genomic.gff.gz protein.faa.gz protein.gpff.gz rm_out.gz rm.run wgsmaster.gbff.gz

pyani.download.construct_output_paths(filestem: str, suffix: str, outdir: pathlib.Path) → Tuple[pathlib.Path, pathlib.Path][source]¶

Construct paths to output files for genome and hash.

Parameters:	filestem – str, output filename stem suffix – str, output filename suffix outdir – Path, path to output directory

pyani.download.create_hash(fname: pathlib.Path) → str[source]¶

Return MD5 hash of the passed file contents.

Parameters:	fname – Path, path to file for hashing

We can ignore the Bandit B303 error as we’re not using the hash for cryptographic purposes.

pyani.download.create_labels(classification: pyani.download.Classification, filestem: str, genomehash: str) → Tuple[str, str][source]¶

Return class and label text from UID classification.

Parameters:	classification – Classification named tuple (org, genus, species, strain) filestem – str, filestem of input genome file genomehash – str, MD5 hash of genome data

The ‘class’ data is the organism as provided in the passed Classification named tuple; the ‘label’ data is genus, species and strain information from the same tuple. The label is intended to be human-readable, the class data to be a genuine class identifier.

Returns a tuple of two strings: (label, class).

The two strings are tab-separated strings: <HASH>t<FILE>t<CLASS/LABEL>. The hash is used to help uniquely identify the genome in the database (label/class is unique by a combination of hash and run ID).

pyani.download.download_genome_and_hash(outdir: pathlib.Path, timeout: int, dlfiledata: pyani.download.DLFileData, dltype: str = 'RefSeq', disable_tqdm: bool = False) → namedlist.namedlist[source]¶

Download genome and accompanying MD5 hash from NCBI.

Parameters:	args – Namespace for command-line arguments outdir – Path to output directory for downloads timeout – int: timeout for download attempt dlfiledata – namedtuple of info for file to download dltype – reference database to use: RefSeq or GenBank disable_tqdm – disable progress bar

This function tries the (assumed to be passed) RefSeq FTP URL first and, if that fails, then attempts to download the corresponding GenBank data.

We attempt to gracefully skip genomes with download errors.

pyani.download.download_url(url: str, outfname: pathlib.Path, timeout: int, disable_tqdm: bool = False) → None[source]¶

Download remote URL to a local directory.

Parameters:	url – URL of remote file for download outfname – Path, path to write output timeout – disable_tqdm – Boolean, show tqdm progress bar?

This function downloads the contents of the passed URL to the passed filename, in buffered chunks

pyani.download.entrez_batch(func)[source]¶

Decorator to compile batches from the wrapped function into a single set of results.

The entrez_batch decorator should go outside the entrez_retry decorator.

pyani.download.entrez_batched_webhistory(*args, expected=None, batchsize=None, **kwargs)[source]¶

pyani.download.entrez_esearch(*args, retries=1, **kwargs)[source]¶

pyani.download.entrez_esummary(*args, retries=1, **kwargs)[source]¶

pyani.download.entrez_retry(func)[source]¶: Decorator to retry the wrapped function up to ‘retries’ times.

pyani.download.extract_contigs(fname: pathlib.Path, ename: pathlib.Path) → subprocess.CompletedProcess[source]¶

Extract contents of fname to ename using gunzip.

Parameters:	fname – str, path to input compressed file ename – str, path to output uncompressed file

Returns status of subprocess call

pyani.download.extract_filestem(esummary) → str[source]¶

Extract filestem from Entrez eSummary data.

Parameters:	esummary –

Function expects esummary[‘DocumentSummarySet’][‘DocumentSummary’][0]

Some illegal characters may occur in AssemblyName - for these, a more robust regex replace/escape may be required. Sadly, NCBI don’t just use standard percent escapes, but instead replace certain characters with underscores: white space, slash, comma, hash, brackets.

pyani.download.extract_hash(hashfile: pathlib.Path, name: str) → str[source]¶

Return MD5 hash from file of name:MD5 hashes.

Parameters:	hashfile – Path, path to file containing name:MD5 pairs name – str, name associated with hash

pyani.download.get_asm_uids(taxon_uid: str, retries: int) → pyani.download.ASMIDs[source]¶

Return set of NCBI UIDs associated with the passed taxon UID.

Parameters:	taxon_uid – str, NCBI taxID for taxon to download retries – int, number of download retry attempts

This query at NCBI returns all assemblies for the taxon subtree rooted at the passed taxon_uid.

pyani.download.get_ncbi_classification(esummary) → pyani.download.Classification[source]¶

Return organism, genus, species, strain info from eSummary data.

Parameters:	esummary –

pyani.download.get_ncbi_esummary(asm_uid, retries, api_key=None) → Tuple[source]¶

Obtain full eSummary info for the passed assembly UID.

Parameters:	asm_uid – retries – api_key –

pyani.download.last_exception() → str[source]¶: Return last exception as a string.

pyani.download.make_asm_dict(taxon_ids: List[str], retries: int) → Dict[KT, VT][source]¶

Return a dict of assembly UIDs, keyed by passed taxon IDs.

Parameters:	taxon_ids – retries –

Takes the passed list of taxon IDs and calls get_asm_uids to generate a dictionary linking each taxon ID to a list of assembly IDs at NCBI.

pyani.download.retrieve_genome_and_hash(filestem: str, suffix: str, ftpstem: str, outdir: pathlib.Path, timeout: int, disable_tqdm: bool = False) → pyani.download.DLStatus[source]¶

Download genome contigs and MD5 hash data from NCBI.

Parameters:	filestem – suffix – ftpstem – outdir – timeout – disable_tqdm – Boolean, show tqdm progress bar?

pyani.download.set_ncbi_email(email: str) → None[source]¶

Set contact email for NCBI.

Parameters:	email – str, email address to give to Entrez at NCBI

pyani.download.split_taxa(taxa: str) → List[str][source]¶

Return list of taxon ids from the passed comma-separated list.

Parameters:	taxa – str, comma-separated list of valid NCBI taxonomy IDs

The function checks the passed taxon argument against a regular expression that permits comma-separated numerical symbols only.