pyani.scripts.genbank_get_genomes_by_taxon module

Script to download from NCBI all genomes in a specified taxon subtree.

This script takes an NCBI taxonomy identifier (or string, though this is not always reliable for taxonomy tree subgraphs…) and downloads all genomes it can find from NCBI in the corresponding taxon subgraph that has the passed argument as root.

exception pyani.scripts.genbank_get_genomes_by_taxon.NCBIDownloadException[source]

Bases: Exception

General exception for failed NCBI download.

pyani.scripts.genbank_get_genomes_by_taxon.entrez_batch_webhistory(args, record, expected, batchsize, *fnargs, **fnkwargs)[source]

Recover Entrez data from a prior NCBI webhistory search.

Parameters:
  • args – Namespace, command-line arguments
  • record – Entrez webhistory record
  • expected – int, number of expected search returns
  • batchsize – int, number of search returns to retrieve in each batch
  • *fnargs

    tuple, arguments to Efetch

  • **fnkwargs

    dict, keyword arguments to Efetch

Recovery is performed in in batches of defined size, using Efetch. Returns all results as a list.

pyani.scripts.genbank_get_genomes_by_taxon.entrez_retry(args, func, *fnargs, **fnkwargs)[source]

Retry the passed function a defined number of times.

Parameters:
  • args – Namespace, command-line arguments
  • func – func, Entrez function to attempt
  • *fnargs

    tuple, arguments to the Entrez function

  • **fnkwargs

    dict, keyword arguments to the Entrez function

pyani.scripts.genbank_get_genomes_by_taxon.extract_archive(archivepath)[source]

Return path to extracted gzip file.

Parameters:archivepath – Path, path to gzipped file with “.tar.gz” suffix
pyani.scripts.genbank_get_genomes_by_taxon.extract_filestem(data)[source]

Extract filestem from Entrez eSummary data.

Parameters:data – Entrez eSummary

Function expects esummary[‘DocumentSummarySet’][‘DocumentSummary’][0]

Some illegal characters may occur in AssemblyName - for these, a more robust regex replace/escape may be required. Sadly, NCBI don’t just use standard percent escapes, but instead replace certain characters with underscores: white space, slash, comma, hash, brackets.

pyani.scripts.genbank_get_genomes_by_taxon.get_asm_uids(args, taxon_uid)[source]

Return a set of NCBI UIDs associated with the passed taxon.

Parameters:
  • args – Namespace, command-line arguments
  • taxon_uid – str, NCBI taxon ID

This query at NCBI returns all assemblies for the taxon subtree rooted at the passed taxon_uid.

pyani.scripts.genbank_get_genomes_by_taxon.get_ncbi_asm(args, asm_uid, fmt='fasta')[source]

Return the NCBI AssemblyAccession and AssemblyName for an assembly.

Parameters:
  • args – Namespace, command-line arguments
  • asm_uid – NCBI assembly UID
  • fmt – str, format to retrieve assembly information

Returns organism data for class/label files also, as well as accession, so we can track whether downloads fail because only the most recent version is available..

AssemblyAccession and AssemblyName are data fields in the eSummary record, and correspond to downloadable files for each assembly at ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GC[AF]/nnn/nnn/nnn/<AA>_<AN> where <AA> is AssemblyAccession, and <AN> is AssemblyName, and the choice of GCA vs GCF, and the three values of nnn are taken from <AA>

pyani.scripts.genbank_get_genomes_by_taxon.last_exception()[source]

Return last exception as a string, or use in logging.

pyani.scripts.genbank_get_genomes_by_taxon.logreport_downloaded(accn, skiplist, accndict, uidaccndict)[source]

Report to logger if alternative assemblies were downloaded.

Parameters:
  • accn
  • skiplist
  • accndict
  • uidaccndict
pyani.scripts.genbank_get_genomes_by_taxon.make_outdir(args: argparse.Namespace) → None[source]

Make the output directory, if required.

Parameters:args – Namespace, command-line arguments

This is a little involved. If the output directory already exists, we take the safe option by default, and stop with an error. We can, however, choose to force the program to go on, in which case we can either clobber the existing directory, or not. The options turn out as the following, if the directory exists: DEFAULT: stop and report the collision FORCE: continue, and remove the existing output directory NOCLOBBER+FORCE: continue, but do not remove the existing output

pyani.scripts.genbank_get_genomes_by_taxon.parse_cmdline(argv=None)[source]

Parse command-line arguments.

Parameters:argv – list of command-line arguments
pyani.scripts.genbank_get_genomes_by_taxon.retrieve_asm_contigs(args, filestem, ftpstem='ftp://ftp.ncbi.nlm.nih.gov/genomes/all', fmt='fasta')[source]

Download assembly sequence to a local directory.

Parameters:
  • args – Namespace, command-line arguments
  • filestem – str, filestem for output file
  • ftpstem – str, URI stem for NCBI FTP site
  • fmt – str, format for output file

The filestem corresponds to <AA>_<AN>, where <AA> and <AN> are AssemblyAccession and AssemblyName: data fields in the eSummary record. These correspond to downloadable files for each assembly at ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GC[AF]/nnn/nnn/nnn/<AA>_<AN>/ where <AA> is AssemblyAccession, and <AN> is AssemblyName. The choice of GCA vs GCF, and the values of nnn, are derived from <AA>

The files in this directory all have the stem <AA>_<AN>_<suffix>, where suffixes are: assembly_report.txt assembly_stats.txt feature_table.txt.gz genomic.fna.gz genomic.gbff.gz genomic.gff.gz protein.faa.gz protein.gpff.gz rm_out.gz rm.run wgsmaster.gbff.gz

This function downloads the genomic_fna.gz file, and extracts it in the output directory name specified when the script is called.

pyani.scripts.genbank_get_genomes_by_taxon.run_main(args=None)[source]

Run main process for average_nucleotide_identity.py script.

Parameters:args – Namespace, command-line arguments
pyani.scripts.genbank_get_genomes_by_taxon.set_ncbi_email(args: argparse.Namespace) → None[source]

Set contact email for NCBI.

Parameters:args – Namespace, command-line arguments
pyani.scripts.genbank_get_genomes_by_taxon.write_contigs(args, asm_uid, contig_uids, batchsize=10000)[source]

Write assembly contigs to a single FASTA file.

Parameters:
  • args – Namespace, command-line arguments
  • asm_uid – str, NCBI assembly UID
  • contig_uids
  • batchsize – int

FASTA records are returned, as GenBank and even GenBankWithParts format records don’t reliably give correct sequence in all cases.

The script returns two strings for each assembly, a ‘class’ and a ‘label’ string - this is for use with, e.g. pyani.