pyani.scripts.genbank_get_genomes_by_taxon module¶
Script to download from NCBI all genomes in a specified taxon subtree.
This script takes an NCBI taxonomy identifier (or string, though this is not always reliable for taxonomy tree subgraphs…) and downloads all genomes it can find from NCBI in the corresponding taxon subgraph that has the passed argument as root.
-
exception
pyani.scripts.genbank_get_genomes_by_taxon.
NCBIDownloadException
[source]¶ Bases:
Exception
General exception for failed NCBI download.
-
pyani.scripts.genbank_get_genomes_by_taxon.
entrez_batch_webhistory
(args, record, expected, batchsize, *fnargs, **fnkwargs)[source]¶ Recover Entrez data from a prior NCBI webhistory search.
Parameters: - args – Namespace, command-line arguments
- record – Entrez webhistory record
- expected – int, number of expected search returns
- batchsize – int, number of search returns to retrieve in each batch
- *fnargs –
tuple, arguments to Efetch
- **fnkwargs –
dict, keyword arguments to Efetch
Recovery is performed in in batches of defined size, using Efetch. Returns all results as a list.
-
pyani.scripts.genbank_get_genomes_by_taxon.
entrez_retry
(args, func, *fnargs, **fnkwargs)[source]¶ Retry the passed function a defined number of times.
Parameters: - args – Namespace, command-line arguments
- func – func, Entrez function to attempt
- *fnargs –
tuple, arguments to the Entrez function
- **fnkwargs –
dict, keyword arguments to the Entrez function
-
pyani.scripts.genbank_get_genomes_by_taxon.
extract_archive
(archivepath)[source]¶ Return path to extracted gzip file.
Parameters: archivepath – Path, path to gzipped file with “.tar.gz” suffix
-
pyani.scripts.genbank_get_genomes_by_taxon.
extract_filestem
(data)[source]¶ Extract filestem from Entrez eSummary data.
Parameters: data – Entrez eSummary Function expects esummary[‘DocumentSummarySet’][‘DocumentSummary’][0]
Some illegal characters may occur in AssemblyName - for these, a more robust regex replace/escape may be required. Sadly, NCBI don’t just use standard percent escapes, but instead replace certain characters with underscores: white space, slash, comma, hash, brackets.
-
pyani.scripts.genbank_get_genomes_by_taxon.
get_asm_uids
(args, taxon_uid)[source]¶ Return a set of NCBI UIDs associated with the passed taxon.
Parameters: - args – Namespace, command-line arguments
- taxon_uid – str, NCBI taxon ID
This query at NCBI returns all assemblies for the taxon subtree rooted at the passed taxon_uid.
-
pyani.scripts.genbank_get_genomes_by_taxon.
get_ncbi_asm
(args, asm_uid, fmt='fasta')[source]¶ Return the NCBI AssemblyAccession and AssemblyName for an assembly.
Parameters: - args – Namespace, command-line arguments
- asm_uid – NCBI assembly UID
- fmt – str, format to retrieve assembly information
Returns organism data for class/label files also, as well as accession, so we can track whether downloads fail because only the most recent version is available..
AssemblyAccession and AssemblyName are data fields in the eSummary record, and correspond to downloadable files for each assembly at ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GC[AF]/nnn/nnn/nnn/<AA>_<AN> where <AA> is AssemblyAccession, and <AN> is AssemblyName, and the choice of GCA vs GCF, and the three values of nnn are taken from <AA>
-
pyani.scripts.genbank_get_genomes_by_taxon.
last_exception
()[source]¶ Return last exception as a string, or use in logging.
-
pyani.scripts.genbank_get_genomes_by_taxon.
logreport_downloaded
(accn, skiplist, accndict, uidaccndict)[source]¶ Report to logger if alternative assemblies were downloaded.
Parameters: - accn –
- skiplist –
- accndict –
- uidaccndict –
-
pyani.scripts.genbank_get_genomes_by_taxon.
make_outdir
(args: argparse.Namespace) → None[source]¶ Make the output directory, if required.
Parameters: args – Namespace, command-line arguments This is a little involved. If the output directory already exists, we take the safe option by default, and stop with an error. We can, however, choose to force the program to go on, in which case we can either clobber the existing directory, or not. The options turn out as the following, if the directory exists: DEFAULT: stop and report the collision FORCE: continue, and remove the existing output directory NOCLOBBER+FORCE: continue, but do not remove the existing output
-
pyani.scripts.genbank_get_genomes_by_taxon.
parse_cmdline
(argv=None)[source]¶ Parse command-line arguments.
Parameters: argv – list of command-line arguments
-
pyani.scripts.genbank_get_genomes_by_taxon.
retrieve_asm_contigs
(args, filestem, ftpstem='ftp://ftp.ncbi.nlm.nih.gov/genomes/all', fmt='fasta')[source]¶ Download assembly sequence to a local directory.
Parameters: - args – Namespace, command-line arguments
- filestem – str, filestem for output file
- ftpstem – str, URI stem for NCBI FTP site
- fmt – str, format for output file
The filestem corresponds to <AA>_<AN>, where <AA> and <AN> are AssemblyAccession and AssemblyName: data fields in the eSummary record. These correspond to downloadable files for each assembly at ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GC[AF]/nnn/nnn/nnn/<AA>_<AN>/ where <AA> is AssemblyAccession, and <AN> is AssemblyName. The choice of GCA vs GCF, and the values of nnn, are derived from <AA>
The files in this directory all have the stem <AA>_<AN>_<suffix>, where suffixes are: assembly_report.txt assembly_stats.txt feature_table.txt.gz genomic.fna.gz genomic.gbff.gz genomic.gff.gz protein.faa.gz protein.gpff.gz rm_out.gz rm.run wgsmaster.gbff.gz
This function downloads the genomic_fna.gz file, and extracts it in the output directory name specified when the script is called.
-
pyani.scripts.genbank_get_genomes_by_taxon.
run_main
(args=None)[source]¶ Run main process for average_nucleotide_identity.py script.
Parameters: args – Namespace, command-line arguments
-
pyani.scripts.genbank_get_genomes_by_taxon.
set_ncbi_email
(args: argparse.Namespace) → None[source]¶ Set contact email for NCBI.
Parameters: args – Namespace, command-line arguments
-
pyani.scripts.genbank_get_genomes_by_taxon.
write_contigs
(args, asm_uid, contig_uids, batchsize=10000)[source]¶ Write assembly contigs to a single FASTA file.
Parameters: - args – Namespace, command-line arguments
- asm_uid – str, NCBI assembly UID
- contig_uids –
- batchsize – int
FASTA records are returned, as GenBank and even GenBankWithParts format records don’t reliably give correct sequence in all cases.
The script returns two strings for each assembly, a ‘class’ and a ‘label’ string - this is for use with, e.g. pyani.