Downloading Genomes from NCBI¶
This page describes some typical use cases for downloading genomes from NCBI using the pyani download
subcommand. This command downloads all assembled genomes found beneath a given taxon identifier, at the NCBI GenBank database. You will need to know the NCBI taxon ID for each taxon you wish to download.
Attention
To use their online resources programmatically, NCBI require that you provide your email address for contact purposes if jobs go wrong, and for their own usage statistics. This should be specified with the --email <EMAIL ADDRESS>
argument of pyani download
.
For more information about the pyani download
subcommand, please see the pyani download page, or issue the command pyani download -h
to see the inline help.
Download all genomes in a single taxon¶
The basic form of the command is:
pyani download --email my.email@my.domain -t <TAXON_ID> -o <OUTPUT_DIRECTORY>
This instructs pyani
to use the download
subcommand to obtain all available genome assemblies below the taxon ID <TAXON_ID>
, passed with the -t
argument, and place the downloaded files - along with label and class information files created by pyani
in the subdirectory <OUTPUT_DIRECTORY>
, passed with the -o
argument.
For example, if we wished to download all available assemblies for the bacterium Pseudomonas flexibilis we would identify the taxon ID to be 706570, and use this as the argument to -t
, placing the output in a convenient subdirectory (e.g. genomes
, the argument to -o
), with the command:
$ pyani download --email my.email@my.domain -t 706570 -o genomes
GCF_900155995.1_IMG-taxon_2681812811_annotated_assembly_genomic.fna.gz: 2097152it [00:00, 3224293.90it/s]
GCF_900155995.1_IMG-taxon_2681812811_annotated_assembly_hashes.txt: 1048576it [00:00, 110097041.36it/s]
GCF_900101515.1_IMG-taxon_2596583557_annotated_assembly_genomic.fna.gz: 2097152it [00:00, 3125724.89it/s]
GCF_900101515.1_IMG-taxon_2596583557_annotated_assembly_hashes.txt: 1048576it [00:00, 55139621.76it/s]
GCA_001312105.1_ASM131210v1_genomic.fna.gz: 1048576it [00:00, 1519534.52it/s]
GCA_001312105.1_ASM131210v1_hashes.txt: 1048576it [00:00, 272426072.29it/s]
GCF_000806415.1_ASM80641v1_genomic.fna.gz: 2097152it [00:00, 3097667.41it/s]
GCF_000806415.1_ASM80641v1_hashes.txt: 1048576it [00:00, 129632638.05it/s]
GCF_000802425.1_ASM80242v1_genomic.fna.gz: 2097152it [00:00, 2973018.02it/s]
GCF_000802425.1_ASM80242v1_hashes.txt: 1048576it [00:00, 233590743.10it/s]
This displays each assembly as a download is attempted, and places all output in the named subdirectory:
$ tree genomes/
genomes/
├── GCA_001312105.1_ASM131210v1_genomic.fna
├── GCA_001312105.1_ASM131210v1_genomic.fna.gz
├── GCA_001312105.1_ASM131210v1_genomic.fna.md5
├── GCA_001312105.1_ASM131210v1_hashes.txt
├── GCF_000802425.1_ASM80242v1_genomic.fna
├── GCF_000802425.1_ASM80242v1_genomic.fna.gz
├── GCF_000802425.1_ASM80242v1_genomic.fna.md5
├── GCF_000802425.1_ASM80242v1_hashes.txt
├── GCF_000806415.1_ASM80641v1_genomic.fna
├── GCF_000806415.1_ASM80641v1_genomic.fna.gz
├── GCF_000806415.1_ASM80641v1_genomic.fna.md5
├── GCF_000806415.1_ASM80641v1_hashes.txt
├── GCF_900101515.1_IMG-taxon_2596583557_annotated_assembly_genomic.fna
├── GCF_900101515.1_IMG-taxon_2596583557_annotated_assembly_genomic.fna.gz
├── GCF_900101515.1_IMG-taxon_2596583557_annotated_assembly_genomic.fna.md5
├── GCF_900101515.1_IMG-taxon_2596583557_annotated_assembly_hashes.txt
├── GCF_900155995.1_IMG-taxon_2681812811_annotated_assembly_genomic.fna
├── GCF_900155995.1_IMG-taxon_2681812811_annotated_assembly_genomic.fna.gz
├── GCF_900155995.1_IMG-taxon_2681812811_annotated_assembly_genomic.fna.md5
├── GCF_900155995.1_IMG-taxon_2681812811_annotated_assembly_hashes.txt
├── classes.txt
└── labels.txt
Each genome is downloaded in compressed format (.fna.gz
files) and expanded in-place to give the FASTA file (.fna
) files. The MD5 hash of each FASTA file is also calculated (.md5
). This will be used by pyani
to uniquely identify that assembly throughout the analysis process.
$ head genomes/GCA_001312105.1_ASM131210v1_genomic.fna.md5
e55cd3d913a198ac60afd8d509c02ab4 genomes/GCA_001312105.1_ASM131210v1_genomic.fna
pyani
also creates two files:
classes.txt
: each genome is assigned a class which is used to annotate genomes in the graphical output.pyani
attempts to infer genus and species as the default classlabels.txt
: each genome is assigned a text label, which is used to label genomes in the graphical output.pyani
attempts to infer genus, species, and strain ID as the default label
$ head genomes/classes.txt
1ac4941c569f32f044eba0a8540d4704 GCF_900155995.1_IMG-taxon_2681812811_annotated_assembly_genomic Pseudomonas flexibilis
8664341798070f1d70b2569a5b3a2320 GCF_900101515.1_IMG-taxon_2596583557_annotated_assembly_genomic Pseudomonas flexibilis
e55cd3d913a198ac60afd8d509c02ab4 GCA_001312105.1_ASM131210v1_genomic Pseudomonas flexibilis
9b9719eb78bf7cf6dd0146a3f9426f60 GCF_000806415.1_ASM80641v1_genomic Pseudomonas flexibilis
2bdfffd867d843f970e4dfd388d5332a GCF_000802425.1_ASM80242v1_genomic Pseudomonas flexibilis
$ head genomes/labels.txt
1ac4941c569f32f044eba0a8540d4704 GCF_900155995.1_IMG-taxon_2681812811_annotated_assembly_genomic P. flexibilis ATCC 29606
8664341798070f1d70b2569a5b3a2320 GCF_900101515.1_IMG-taxon_2596583557_annotated_assembly_genomic P. flexibilis CGMCC 1.1365
e55cd3d913a198ac60afd8d509c02ab4 GCA_001312105.1_ASM131210v1_genomic P. flexibilis JCM 14085
9b9719eb78bf7cf6dd0146a3f9426f60 GCF_000806415.1_ASM80641v1_genomic P. flexibilis JCM 14085
2bdfffd867d843f970e4dfd388d5332a GCF_000802425.1_ASM80242v1_genomic P. flexibilis ATCC 29606
These files are used to associate labels and classes to the genome files in the pyani
database, specific to the analysis run.
Both classes.txt
and labels.txt
can be edited to suit the user’s classification and labelling scheme.
Download all genomes from multiple taxa¶
To download genomes from more than one taxon, you can provide a comma-separated list of taxon IDs to the pyani download
subcommand, e.g.:
pyani download --email my.email@my.domain -t <TAXON_ID1>,<TAXON_ID2>,... -o <OUTPUT_DIRECTORY>
The following command can be used to download assemblies from three different Pseudomonas taxa (P. flexibilis: 706570, P. mosselli: 78327, and P. fulva: 47880):
$ pyani download --email my.email@my.domain -t 706570,78327,47880 -o multi_taxa
GCF_900155995.1_IMG-taxon_2681812811_annotated_assembly_genomic.fna.gz: 2097152it [00:00, 3081776.59it/s]
GCF_900155995.1_IMG-taxon_2681812811_annotated_assembly_hashes.txt: 1048576it [00:00, 63489526.95it/s]
GCF_900101515.1_IMG-taxon_2596583557_annotated_assembly_genomic.fna.gz: 2097152it [00:00, 3194885.99it/s]
GCF_900101515.1_IMG-taxon_2596583557_annotated_assembly_hashes.txt: 1048576it [00:00, 77838775.82it/s]
Dry-run test (identify, but do not download, files)¶
If you only want to see which genomes will be downloaded from NCBI with a given pyani download
subcommand, but not download them, then you can use the --dry-run
option. For example:
$ pyani download --email my.email@my.domain -t 706570,78327,47880 -o multi_taxa --dry-run
WARNING: Dry run only: will not overwrite or download
WARNING: (dry-run) skipping download of GCF_900155995.1
WARNING: (dry-run) skipping download of GCF_900101515.1
WARNING: (dry-run) skipping download of GCA_001312105.1
WARNING: (dry-run) skipping download of GCF_000806415.1
[...]
Download genomes for compilation of a custom Kraken
database¶
Kraken is a bioinformatics tool that assigns taxonomic identities to short DNA sequences, such as Illumina or Nanopore reads. Several wide-ranging Kraken
databases are available for download, and around the community, but it can sometimes be useful to construct a custom local Kraken
database specific for your organism or taxon of interest (e.g. for filtering out contaminating or suspect reads).
The pyani download
command can prepare downloaded genome files for immediate use with the Kraken
database-building tools, by specifying the --kraken
option:
$ pyani download --email my.email@my.domain -t 706570,78327,47880 -o genomes_kraken --kraken
GCF_900155995.1_IMG-taxon_2681812811_annotated_assembly_genomic.fna.gz: 2097152it [00:00, 3085741.03it/s]
GCF_900155995.1_IMG-taxon_2681812811_annotated_assembly_hashes.txt: 1048576it [00:00, 140958511.30it/s]
WARNING: Modifying downloaded sequence for Kraken compatibility
GCF_900101515.1_IMG-taxon_2596583557_annotated_assembly_genomic.fna.gz: 2097152it [00:01, 1902023.68it/s]
GCF_900101515.1_IMG-taxon_2596583557_annotated_assembly_hashes.txt: 1048576it [00:00, 31428985.47it/s]
WARNING: Modifying downloaded sequence for Kraken compatibility
[...]
$ head multi_taxa/GCA_001312105.1_ASM131210v1_genomic.fna
>BBCY01000001.1 Pseudomonas tuomuerensis JCM 14085 DNA, contig: JCM14085.contig00001, whole genome shotgun sequence
ACCAGCATCTGGCGGATCAGGTCGCGGGCCTTCTCGGCCGATTGGCGGATGCGCCCGAGGTAGCGGCCGAGCGGCGCGTC
GCCGCGCTCGCCCGCCAGCTCCTCGGCCATCTGCGTGTAGCCGAGCATGCTGGTCAGCAGGTTGTTGAAGTCGTGGGCAA
TGCCGCCGGTCAGGTGGCCGATGGCTTCCATGCGCTGCGCCTGGCGCAGCTGCTGTTCCAGCGCCGCCCGCTCCACCTCG
$ head genomes_kraken/GCA_001312105.1_ASM131210v1_genomic.fna
>BBCY01000001.1|kraken:taxid|706570 BBCY01000001.1 Pseudomonas tuomuerensis JCM 14085 DNA, contig: JCM14085.contig00001, whole genome shotgun sequence
ACCAGCATCTGGCGGATCAGGTCGCGGGCCTTCTCGGCCGATTGGCGGATGCGCCCGAGG
TAGCGGCCGAGCGGCGCGTCGCCGCGCTCGCCCGCCAGCTCCTCGGCCATCTGCGTGTAG
CCGAGCATGCTGGTCAGCAGGTTGTTGAAGTCGTGGGCAATGCCGCCGGTCAGGTGGCCG
ATGGCTTCCATGCGCTGCGCCTGGCGCAGCTGCTGTTCCAGCGCCGCCCGCTCCACCTCG
Using this option does affects downstream performance or use of pyani
only in that the two different input files for the same genome will have distinct hashes:
$ head multi_taxa/GCA_001312105.1_ASM131210v1_genomic.fna.md5
e55cd3d913a198ac60afd8d509c02ab4 multi_taxa/GCA_001312105.1_ASM131210v1_genomic.fna
$ head genomes_kraken/GCA_001312105.1_ASM131210v1_genomic.fna.md5
053fd98d8c9ab30de46f56fd601ef529 genomes_kraken/GCA_001312105.1_ASM131210v1_genomic.fna
and so will not be considered to be the “same sequence” when repeating comparisons.