Indexing Genomes¶
Indexing genomes is a necessary step in pyani
to prepare input genomes for analysis.
Attention
If you use the pyani download
subcommand (see Downloading Genomes from NCBI) to obtain genomes for analysis, then indexing is carried out automatically. However, if you collect a local set of genomes (e.g. from your own sequencing project), then you will need to index the genomes with the pyani index
subcommand.
For more information about the pyani index
subcommand, please see the pyani index page, or issue the command pyani index -h
to see the inline help.
What does indexing do?¶
In the context of pyani
, indexing refers to generating an index code that is unique to each input genome FASTA file in the input directory. The index code is the MD5 hash for the FASTA file.
This MD5 index code is used to identify each specific input genome sequence (and associated metadata) so that duplicate comparisons can be readily identified, and previous results reused from the pyani
database, if they are available.
Indexing also generates two files (see Downloading Genomes from NCBI):
classes.txt
: each genome is assigned a class which is used to annotate genomes in the graphical output.pyani
attempts to infer genus and species as the default classlabels.txt
: each genome is assigned a text label, which is used to label genomes in the graphical output.pyani
attempts to infer genus, species, and strain ID as the default label
These files are used to associate labels and classes to the genome files in the pyani
database, specific to the analysis run. Both classes.txt
and labels.txt
can be edited to suit the user’s classification and labelling scheme.
Index a directory of FASTA files¶
The basic form of the command is:
pyani index -i <GENOME_DIRECTORY>
This instructs pyani
to search <GENOME_DIRECTORY>
for files with a standard FASTA suffix (.fna
, .fasta
, .fa
, .fas
, .fsa_nt
). For each file found, it calculates the MD5 hash and writes it to an accompanying file with extension .md5
. The hash is then associated with a genome label and a genome class, written to the two files labels.txt
and classes.txt
(see above).
For example, if we have a directory called unindexed
that contains some FASTA format genome sequence files:
$ tree unindexed
unindexed
├── GCA_001312105.1_ASM131210v1_genomic.fna
├── GCF_000834555.1_ASM83455v1_genomic.fna
└── GCF_005796105.1_ASM579610v1_genomic.fna
We could run the pyani index
command:
$ pyani index -i unindexed/
$ tree unindexed
unindexed
├── GCA_001312105.1_ASM131210v1_genomic.fna
├── GCA_001312105.1_ASM131210v1_genomic.fna.md5
├── GCF_000834555.1_ASM83455v1_genomic.fna
├── GCF_000834555.1_ASM83455v1_genomic.fna.md5
├── GCF_005796105.1_ASM579610v1_genomic.fna
├── GCF_005796105.1_ASM579610v1_genomic.fna.md5
├── classes.txt
└── labels.txt
This creates an .md5
file for each genome, and corresponding classes.txt
and labels.txt
files:
$ head unindexed/GCA_001312105.1_ASM131210v1_genomic.fna
>BBCY01000001.1 Pseudomonas tuomuerensis JCM 14085 DNA, contig: JCM14085.contig00001, whole genome shotgun sequence
ACCAGCATCTGGCGGATCAGGTCGCGGGCCTTCTCGGCCGATTGGCGGATGCGCCCGAGGTAGCGGCCGAGCGGCGCGTC
GCCGCGCTCGCCCGCCAGCTCCTCGGCCATCTGCGTGTAGCCGAGCATGCTGGTCAGCAGGTTGTTGAAGTCGTGGGCAA
$ head unindexed/GCA_001312105.1_ASM131210v1_genomic.fna.md5
e55cd3d913a198ac60afd8d509c02ab4 unindexed/GCA_001312105.1_ASM131210v1_genomic.fna
$ head unindexed/classes.txt
527f35b3eb9dd371d8d5309b6043dd9f GCF_000834555.1_ASM83455v1_genomic Pseudomonas fulva strain MEJ086 contig_1, whole genome shotgun sequence
b00c5b1f636b8083b68b128e7ee28a40 GCF_005796105.1_ASM579610v1_genomic Pseudomonas mosselii strain SC006 Scaffold1, whole genome shotgun sequence
e55cd3d913a198ac60afd8d509c02ab4 GCA_001312105.1_ASM131210v1_genomic Pseudomonas tuomuerensis JCM 14085 DNA, contig: JCM14085.contig00001, whole genome shotgun sequence
$ head unindexed/labels.txt
527f35b3eb9dd371d8d5309b6043dd9f GCF_000834555.1_ASM83455v1_genomic Pseudomonas fulva strain MEJ086 contig_1, whole genome shotgun sequence
b00c5b1f636b8083b68b128e7ee28a40 GCF_005796105.1_ASM579610v1_genomic Pseudomonas mosselii strain SC006 Scaffold1, whole genome shotgun sequence
e55cd3d913a198ac60afd8d509c02ab4 GCA_001312105.1_ASM131210v1_genomic Pseudomonas tuomuerensis JCM 14085 DNA, contig: JCM14085.contig00001, whole genome shotgun sequence
Tip
The class and label information produced by pyani index
is different to that generated with pyani download
. Genus, species and strain identifiers can reliably be obtained from NCBI metadata when downloading genomes, but with user-provided sequences the information may not be encoded in the sequence description line in a standard manner.
As a result, when using pyani index
it is often useful to edit the classes.txt
and labels.txt
directly, or generate these files in some other way.