# Distances

MiGA estimates distances (or similarities) between datasets using different techniques. Only genome-to-genome comparisons have been implemented, including the genomes of isolates, metagenome-assembled genomes, or single-cell amplified genomes. No metagenome-to-metagenome or genome-to-metagenome distances are currently available in MiGA.

## Hierarchical approach to distances

For any given pair of genomes, MiGA attempts a hierarchical approach to identify the most appopriate metric of similarity:

**1**. First, the genomes are compared using [hAAI](#haai). If this method is skipped, if it fails, or if the value is greater than 90%, MiGA continues to step 2. Otherwise, this value is used to estimate the AAI, both values are recorded, and the comparison ends.

**2**. Next, MiGA compares genomes using [AAI](#aai). Whenever the AAI is 85% or higher, MiGA continues to step 3. Otherwise the comparison ends.

**3**. Finally, MiGA estimates [ANI](#ani).

## hAAI

**Heuristic Average Amino Acid Identity**. The hAAI is the average amino acid identity between the highly conserved proteins of two genomes, as identified by [essential genes](https://manual.microbial-genomes.org/part5/workflow#essential-genes). It is used to estimate AAI for distant pairs, but it loses resolution between close relatives. This metric is completely bypassed in projects of [type clade](https://manual.microbial-genomes.org/types#clade) as well as projects with the [metadata field](https://manual.microbial-genomes.org/part5/metadata#projects) `haai_p=no`. This field also controls the Software used: `blast+` (default), `blast`, `blat`, or `diamond`.

## AAI

**Average Amino Acid Identity**. The AAI is the average amino acid identity between all proteins of two genomes, as identified by [cds](https://manual.microbial-genomes.org/part5/workflow#cds). When running this analysis, the intermediate reciprocal best matches (RBMs) are also stored in projects of [type clade](https://manual.microbial-genomes.org/types#clade). This feature can be turned off to save storage space or forced to be on in any project type using the [metadata field](https://manual.microbial-genomes.org/part5/metadata#projects) `aai_save_rbm=false` or `aai_save_rbm=true`, respectively. The Software used as a search engine can be controlled using the [metadata field](https://manual.microbial-genomes.org/part5/metadata#projects) `haai_p`: `blast+` (default), `blast`, `blat`, or `diamond`. [Workflows](https://manual.microbial-genomes.org/part6) use `blast+` by default, or `diamond` if the flag `--fast` is passed (whenever available).

## ANI

**Average Nucleotide Identity**. The ANI is the average nucleotide identity between fragments of two genomes. The Software used as a search engine can be controlled using the [metadata field](https://manual.microbial-genomes.org/part5/metadata#projects) `haai_p`: `blast+` (default), `blast`, `blat`, or `fastani`. [Workflows](https://manual.microbial-genomes.org/part6) use `blast+` by default, or `fastani` if the flag `--fast` is passed (whenever available).

## SQLite3 schema

The information on the different similarity metrics above is stored in SQLite3 database files. The general schema for [hAAI](#haai) and [AAI](#aai) is:

```sql
CREATE TABLE aai(
  seq1 varchar(256), seq2 varchar(256),
  aai float, sd float, n int, omega int
);
CREATE TABLE rbm(
  seq1 varchar(256), seq2 varchar(256),
  id1 varchar(256), id2 varchar(256),
  id float, evalue float, bitscore float
);
```

The `aai` table holds the metric values, including: the name of the genomes compared `seq1` and `seq2`, the AAI or hAAI as percentage `aai`, the standard deviation across proteins when available `sd`, the total number of RBMs `n`, and the smaller number of proteins from the two genomes `omega`. The `rbm` table holds the RBMs for [AAI](#aai) (whenever available if stored) and is always empty for [hAAI](#haai).

The general schema for [ANI](#ani) is:

```sql
CREATE TABLE ani(
  seq1 varchar(256), seq2 varchar(256),
  ani float, sd float, n int, omega int
);
CREATE TABLE rbm(
  seq1 varchar(256), seq2 varchar(256),
  id1 int, id2 int, id float,
  evalue float, bitscore float
);
CREATE TABLE regions(
  seq varchar(256), id int, source varchar(256),
  `start` int, `end` int
);
```

The `ani` table holds the metric values, including: the name of the genomes compared `seq1` and `seq2`, the ANI as percentage `ani`, the standard deviation across fragments when available `sd`, the total number of RBM fragments `n`, and the smaller number of fragments from the two genomes `omega`. The tables `rbm` and `regions` are always empty.
