Distances
Last updated
Last updated
MiGA estimates distances (or similarities) between datasets using different techniques. Only genome-to-genome comparisons have been implemented, including the genomes of isolates, metagenome-assembled genomes, or single-cell amplified genomes. No metagenome-to-metagenome or genome-to-metagenome distances are currently available in MiGA.
For any given pair of genomes, MiGA attempts a hierarchical approach to identify the most appopriate metric of similarity:
1. First, the genomes are compared using . If this method is skipped, if it fails, or if the value is greater than 90%, MiGA continues to step 2. Otherwise, this value is used to estimate the AAI, both values are recorded, and the comparison ends.
2. Next, MiGA compares genomes using . Whenever the AAI is 85% or higher, MiGA continues to step 3. Otherwise the comparison ends.
3. Finally, MiGA estimates .
Heuristic Average Amino Acid Identity. The hAAI is the average amino acid identity between the highly conserved proteins of two genomes, as identified by . It is used to estimate AAI for distant pairs, but it loses resolution between close relatives. This metric is completely bypassed in projects of as well as projects with the haai_p=no
. This field also controls the Software used: blast+
(default), blast
, blat
, or diamond
.
Average Amino Acid Identity. The AAI is the average amino acid identity between all proteins of two genomes, as identified by . When running this analysis, the intermediate reciprocal best matches (RBMs) are also stored in projects of . This feature can be turned off to save storage space or forced to be on in any project type using the aai_save_rbm=false
or aai_save_rbm=true
, respectively. The Software used as a search engine can be controlled using the haai_p
: blast+
(default), blast
, blat
, or diamond
. use blast+
by default, or diamond
if the flag --fast
is passed (whenever available).
Average Nucleotide Identity. The ANI is the average nucleotide identity between fragments of two genomes. The Software used as a search engine can be controlled using the haai_p
: blast+
(default), blast
, blat
, or fastani
. use blast+
by default, or fastani
if the flag --fast
is passed (whenever available).
The ani
table holds the metric values, including: the name of the genomes compared seq1
and seq2
, the ANI as percentage ani
, the standard deviation across fragments when available sd
, the total number of RBM fragments n
, and the smaller number of fragments from the two genomes omega
. The tables rbm
and regions
are always empty.
The information on the different similarity metrics above is stored in SQLite3 database files. The general schema for and is:
The aai
table holds the metric values, including: the name of the genomes compared seq1
and seq2
, the AAI or hAAI as percentage aai
, the standard deviation across proteins when available sd
, the total number of RBMs n
, and the smaller number of proteins from the two genomes omega
. The rbm
table holds the RBMs for (whenever available if stored) and is always empty for .
The general schema for is: