# Distances

MiGA estimates distances (or similarities) between datasets using different techniques. Only genome-to-genome comparisons have been implemented, including the genomes of isolates, metagenome-assembled genomes, or single-cell amplified genomes. No metagenome-to-metagenome or genome-to-metagenome distances are currently available in MiGA.

## Hierarchical approach to distances

For any given pair of genomes, MiGA attempts a hierarchical approach to identify the most appopriate metric of similarity:

**1**. First, the genomes are compared using [hAAI](/master/part2/distances.md#haai). If this method is skipped, if it fails, or if the value is greater than 90%, MiGA continues to step 2. Otherwise, this value is used to estimate the AAI, both values are recorded, and the comparison ends.

**2**. Next, MiGA compares genomes using [AAI](/master/part2/distances.md#aai). Whenever the AAI is 85% or higher, MiGA continues to step 3. Otherwise the comparison ends.

**3**. Finally, MiGA estimates [ANI](/master/part2/distances.md#ani).

## hAAI

**Heuristic Average Amino Acid Identity**. The hAAI is the average amino acid identity between the highly conserved proteins of two genomes, as identified by [essential genes](/master/part5/workflow.md#essential-genes). It is used to estimate AAI for distant pairs, but it loses resolution between close relatives. This metric is completely bypassed in projects of [type clade](/master/part2/types.md#clade) as well as projects with the [metadata field](/master/part5/metadata.md#projects) `haai_p=no`. This field also controls the Software used: `blast+` (default), `blast`, `blat`, or `diamond`.

## AAI

**Average Amino Acid Identity**. The AAI is the average amino acid identity between all proteins of two genomes, as identified by [cds](/master/part5/workflow.md#cds). When running this analysis, the intermediate reciprocal best matches (RBMs) are also stored in projects of [type clade](/master/part2/types.md#clade). This feature can be turned off to save storage space or forced to be on in any project type using the [metadata field](/master/part5/metadata.md#projects) `aai_save_rbm=false` or `aai_save_rbm=true`, respectively. The Software used as a search engine can be controlled using the [metadata field](/master/part5/metadata.md#projects) `haai_p`: `blast+` (default), `blast`, `blat`, or `diamond`. [Workflows](/master/part6.md) use `blast+` by default, or `diamond` if the flag `--fast` is passed (whenever available).

## ANI

**Average Nucleotide Identity**. The ANI is the average nucleotide identity between fragments of two genomes. The Software used as a search engine can be controlled using the [metadata field](/master/part5/metadata.md#projects) `haai_p`: `blast+` (default), `blast`, `blat`, or `fastani`. [Workflows](/master/part6.md) use `blast+` by default, or `fastani` if the flag `--fast` is passed (whenever available).

## SQLite3 schema

The information on the different similarity metrics above is stored in SQLite3 database files. The general schema for [hAAI](/master/part2/distances.md#haai) and [AAI](/master/part2/distances.md#aai) is:

```sql
CREATE TABLE aai(
  seq1 varchar(256), seq2 varchar(256),
  aai float, sd float, n int, omega int
);
CREATE TABLE rbm(
  seq1 varchar(256), seq2 varchar(256),
  id1 varchar(256), id2 varchar(256),
  id float, evalue float, bitscore float
);
```

The `aai` table holds the metric values, including: the name of the genomes compared `seq1` and `seq2`, the AAI or hAAI as percentage `aai`, the standard deviation across proteins when available `sd`, the total number of RBMs `n`, and the smaller number of proteins from the two genomes `omega`. The `rbm` table holds the RBMs for [AAI](/master/part2/distances.md#aai) (whenever available if stored) and is always empty for [hAAI](/master/part2/distances.md#haai).

The general schema for [ANI](/master/part2/distances.md#ani) is:

```sql
CREATE TABLE ani(
  seq1 varchar(256), seq2 varchar(256),
  ani float, sd float, n int, omega int
);
CREATE TABLE rbm(
  seq1 varchar(256), seq2 varchar(256),
  id1 int, id2 int, id float,
  evalue float, bitscore float
);
CREATE TABLE regions(
  seq varchar(256), id int, source varchar(256),
  `start` int, `end` int
);
```

The `ani` table holds the metric values, including: the name of the genomes compared `seq1` and `seq2`, the ANI as percentage `ani`, the standard deviation across fragments when available `sd`, the total number of RBM fragments `n`, and the smaller number of fragments from the two genomes `omega`. The tables `rbm` and `regions` are always empty.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://manual.microbial-genomes.org/master/part2/distances.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
