pyani.download module¶
Module providing functions useful for downloading genomes from NCBI.
-
class
pyani.download.
ASMIDs
[source]¶ Bases:
tuple
Matching Assembly ID information for a query taxID.
-
asm_ids
¶ Alias for field number 2
-
query
¶ Alias for field number 0
-
result_count
¶ Alias for field number 1
-
-
class
pyani.download.
Classification
[source]¶ Bases:
tuple
Taxonomic classification for an isolate.
-
genus
¶ Alias for field number 1
-
organism
¶ Alias for field number 0
-
species
¶ Alias for field number 2
-
strain
¶ Alias for field number 3
-
-
class
pyani.download.
DLFileData
[source]¶ Bases:
tuple
Convenience struct for file download data.
-
filestem
¶ Alias for field number 0
-
ftpstem
¶ Alias for field number 1
-
suffix
¶ Alias for field number 2
-
-
class
pyani.download.
DLStatus
(url: str, hashurl: str, outfname: pathlib.Path, outfhash: pathlib.Path, skipped: bool, error: Optional[str] = None)[source]¶ Bases:
object
Download status data.
-
exception
pyani.download.
FileExistsException
(msg: str = 'Specified file exists')[source]¶ Bases:
Exception
A specified file exists.
-
class
pyani.download.
Hashstatus
[source]¶ Bases:
tuple
Status report on file hash comparison.
-
filehash
¶ Alias for field number 2
-
localhash
¶ Alias for field number 1
-
passed
¶ Alias for field number 0
-
-
exception
pyani.download.
NCBIDownloadException
(msg: str = 'Error downloading file from NCBI')[source]¶ Bases:
Exception
General exception for failed NCBI download.
-
exception
pyani.download.
PyaniIndexException
[source]¶ Bases:
Exception
General exception for indexing with pyani
-
pyani.download.
check_hash
(fname: pathlib.Path, hashfile: pathlib.Path) → pyani.download.Hashstatus[source]¶ Check MD5 of passed file against downloaded NCBI hash file.
Parameters: - fname – Path, path to local hash file
- hashfile – Path, path to NCBI hash file
-
pyani.download.
compile_url
(filestem: str, suffix: str, ftpstem: str) → Tuple[str, str][source]¶ Compile download URLs given a passed filestem.
Parameters: - filestem –
- suffix –
- ftpstem –
The filestem corresponds to <AA>_<AN>, where <AA> and <AN> are AssemblyAccession and AssemblyName: data fields in the eSummary record. These correspond to downloadable files for each assembly at ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GC[AF]/nnn/nnn/nnn/<AA>_<AN>/ where <AA> is AssemblyAccession, and <AN> is AssemblyName. The choice of GCA vs GCF, and the values of nnn, are derived from <AA>
The files in this directory all have the stem <AA>_<AN>_<suffix>, where suffixes are: assembly_report.txt assembly_stats.txt feature_table.txt.gz genomic.fna.gz genomic.gbff.gz genomic.gff.gz protein.faa.gz protein.gpff.gz rm_out.gz rm.run wgsmaster.gbff.gz
-
pyani.download.
construct_output_paths
(filestem: str, suffix: str, outdir: pathlib.Path) → Tuple[pathlib.Path, pathlib.Path][source]¶ Construct paths to output files for genome and hash.
Parameters: - filestem – str, output filename stem
- suffix – str, output filename suffix
- outdir – Path, path to output directory
-
pyani.download.
create_hash
(fname: pathlib.Path) → str[source]¶ Return MD5 hash of the passed file contents.
Parameters: fname – Path, path to file for hashing We can ignore the Bandit B303 error as we’re not using the hash for cryptographic purposes.
-
pyani.download.
create_labels
(classification: pyani.download.Classification, filestem: str, genomehash: str) → Tuple[str, str][source]¶ Return class and label text from UID classification.
Parameters: - classification – Classification named tuple (org, genus, species, strain)
- filestem – str, filestem of input genome file
- genomehash – str, MD5 hash of genome data
The ‘class’ data is the organism as provided in the passed Classification named tuple; the ‘label’ data is genus, species and strain information from the same tuple. The label is intended to be human-readable, the class data to be a genuine class identifier.
Returns a tuple of two strings: (label, class).
The two strings are tab-separated strings: <HASH>t<FILE>t<CLASS/LABEL>. The hash is used to help uniquely identify the genome in the database (label/class is unique by a combination of hash and run ID).
-
pyani.download.
download_genome_and_hash
(outdir: pathlib.Path, timeout: int, dlfiledata: pyani.download.DLFileData, dltype: str = 'RefSeq', disable_tqdm: bool = False) → namedlist.namedlist[source]¶ Download genome and accompanying MD5 hash from NCBI.
Parameters: - args – Namespace for command-line arguments
- outdir – Path to output directory for downloads
- timeout – int: timeout for download attempt
- dlfiledata – namedtuple of info for file to download
- dltype – reference database to use: RefSeq or GenBank
- disable_tqdm – disable progress bar
This function tries the (assumed to be passed) RefSeq FTP URL first and, if that fails, then attempts to download the corresponding GenBank data.
We attempt to gracefully skip genomes with download errors.
-
pyani.download.
download_url
(url: str, outfname: pathlib.Path, timeout: int, disable_tqdm: bool = False) → None[source]¶ Download remote URL to a local directory.
Parameters: - url – URL of remote file for download
- outfname – Path, path to write output
- timeout –
- disable_tqdm – Boolean, show tqdm progress bar?
This function downloads the contents of the passed URL to the passed filename, in buffered chunks
-
pyani.download.
entrez_batch
(func)[source]¶ Decorator to compile batches from the wrapped function into a single set of results.
The entrez_batch decorator should go outside the entrez_retry decorator.
-
pyani.download.
entrez_retry
(func)[source]¶ Decorator to retry the wrapped function up to ‘retries’ times.
-
pyani.download.
extract_contigs
(fname: pathlib.Path, ename: pathlib.Path) → subprocess.CompletedProcess[source]¶ Extract contents of fname to ename using gunzip.
Parameters: - fname – str, path to input compressed file
- ename – str, path to output uncompressed file
Returns status of subprocess call
-
pyani.download.
extract_filestem
(esummary) → str[source]¶ Extract filestem from Entrez eSummary data.
Parameters: esummary – Function expects esummary[‘DocumentSummarySet’][‘DocumentSummary’][0]
Some illegal characters may occur in AssemblyName - for these, a more robust regex replace/escape may be required. Sadly, NCBI don’t just use standard percent escapes, but instead replace certain characters with underscores: white space, slash, comma, hash, brackets.
-
pyani.download.
extract_hash
(hashfile: pathlib.Path, name: str) → str[source]¶ Return MD5 hash from file of name:MD5 hashes.
Parameters: - hashfile – Path, path to file containing name:MD5 pairs
- name – str, name associated with hash
-
pyani.download.
get_asm_uids
(taxon_uid: str, retries: int) → pyani.download.ASMIDs[source]¶ Return set of NCBI UIDs associated with the passed taxon UID.
Parameters: - taxon_uid – str, NCBI taxID for taxon to download
- retries – int, number of download retry attempts
This query at NCBI returns all assemblies for the taxon subtree rooted at the passed taxon_uid.
-
pyani.download.
get_ncbi_classification
(esummary) → pyani.download.Classification[source]¶ Return organism, genus, species, strain info from eSummary data.
Parameters: esummary –
-
pyani.download.
get_ncbi_esummary
(asm_uid, retries, api_key=None) → Tuple[source]¶ Obtain full eSummary info for the passed assembly UID.
Parameters: - asm_uid –
- retries –
- api_key –
-
pyani.download.
make_asm_dict
(taxon_ids: List[str], retries: int) → Dict[KT, VT][source]¶ Return a dict of assembly UIDs, keyed by passed taxon IDs.
Parameters: - taxon_ids –
- retries –
Takes the passed list of taxon IDs and calls get_asm_uids to generate a dictionary linking each taxon ID to a list of assembly IDs at NCBI.
-
pyani.download.
retrieve_genome_and_hash
(filestem: str, suffix: str, ftpstem: str, outdir: pathlib.Path, timeout: int, disable_tqdm: bool = False) → pyani.download.DLStatus[source]¶ Download genome contigs and MD5 hash data from NCBI.
Parameters: - filestem –
- suffix –
- ftpstem –
- outdir –
- timeout –
- disable_tqdm – Boolean, show tqdm progress bar?
-
pyani.download.
set_ncbi_email
(email: str) → None[source]¶ Set contact email for NCBI.
Parameters: email – str, email address to give to Entrez at NCBI
-
pyani.download.
split_taxa
(taxa: str) → List[str][source]¶ Return list of taxon ids from the passed comma-separated list.
Parameters: taxa – str, comma-separated list of valid NCBI taxonomy IDs The function checks the passed taxon argument against a regular expression that permits comma-separated numerical symbols only.