MiGA Manual
AboutCodebaseMiGA Online
main
main
  • Introduction
  • Part I: What is MiGA?
    • How can MiGA help me?
    • Who is using MiGA?
    • Who is behind MiGA?
    • Definitions
  • Part II: Getting started
    • Requirements
      • Using Homebrew
      • Using apt-get
      • Using Conda
      • Installing from source
      • MyTaxa Utils
    • Installation
    • MiGA types
    • Input data
    • Distances
    • Clustering
  • Part III: Interfaces
    • MiGA API
    • MiGA CLI
    • MiGA Web
  • Part IV: Deploying examples
    • RefSeq in MiGA
    • Build a clade collection
    • Launching daemons
    • Setting up MiGA in a cluster
  • Part V: Additional details
    • Advanced configuration
    • MiGA workflow
    • Metadata
    • External Software
  • Part VI: Workflows
    • Quality
    • Dereplicate
    • Classify
    • Preprocess
    • Index
    • Summaries
Powered by GitBook
On this page
  • Hierarchical approach to distances
  • hAAI
  • AAI
  • ANI
  • SQLite3 schema
  1. Part II: Getting started

Distances

PreviousInput dataNextClustering

Last updated 4 years ago

MiGA estimates distances (or similarities) between datasets using different techniques. Only genome-to-genome comparisons have been implemented, including the genomes of isolates, metagenome-assembled genomes, or single-cell amplified genomes. No metagenome-to-metagenome or genome-to-metagenome distances are currently available in MiGA.

Hierarchical approach to distances

For any given pair of genomes, MiGA attempts a hierarchical approach to identify the most appopriate metric of similarity:

1. First, the genomes are compared using . If this method is skipped, if it fails, or if the value is greater than 90%, MiGA continues to step 2. Otherwise, this value is used to estimate the AAI, both values are recorded, and the comparison ends.

2. Next, MiGA compares genomes using . Whenever the AAI is 85% or higher, MiGA continues to step 3. Otherwise the comparison ends.

3. Finally, MiGA estimates .

hAAI

Heuristic Average Amino Acid Identity. The hAAI is the average amino acid identity between the highly conserved proteins of two genomes, as identified by . It is used to estimate AAI for distant pairs, but it loses resolution between close relatives. This metric is completely bypassed in projects of as well as projects with the haai_p=no. This field also controls the Software used: blast+ (default), blast, blat, or diamond.

AAI

Average Amino Acid Identity. The AAI is the average amino acid identity between all proteins of two genomes, as identified by . When running this analysis, the intermediate reciprocal best matches (RBMs) are also stored in projects of . This feature can be turned off to save storage space or forced to be on in any project type using the aai_save_rbm=false or aai_save_rbm=true, respectively. The Software used as a search engine can be controlled using the haai_p: blast+ (default), blast, blat, or diamond. use blast+ by default, or diamond if the flag --fast is passed (whenever available).

ANI

Average Nucleotide Identity. The ANI is the average nucleotide identity between fragments of two genomes. The Software used as a search engine can be controlled using the haai_p: blast+ (default), blast, blat, or fastani. use blast+ by default, or fastani if the flag --fast is passed (whenever available).

SQLite3 schema

CREATE TABLE aai(
  seq1 varchar(256), seq2 varchar(256),
  aai float, sd float, n int, omega int
);
CREATE TABLE rbm(
  seq1 varchar(256), seq2 varchar(256),
  id1 varchar(256), id2 varchar(256),
  id float, evalue float, bitscore float
);
CREATE TABLE ani(
  seq1 varchar(256), seq2 varchar(256),
  ani float, sd float, n int, omega int
);
CREATE TABLE rbm(
  seq1 varchar(256), seq2 varchar(256),
  id1 int, id2 int, id float,
  evalue float, bitscore float
);
CREATE TABLE regions(
  seq varchar(256), id int, source varchar(256),
  `start` int, `end` int
);

The ani table holds the metric values, including: the name of the genomes compared seq1 and seq2, the ANI as percentage ani, the standard deviation across fragments when available sd, the total number of RBM fragments n, and the smaller number of fragments from the two genomes omega. The tables rbm and regions are always empty.

The information on the different similarity metrics above is stored in SQLite3 database files. The general schema for and is:

The aai table holds the metric values, including: the name of the genomes compared seq1 and seq2, the AAI or hAAI as percentage aai, the standard deviation across proteins when available sd, the total number of RBMs n, and the smaller number of proteins from the two genomes omega. The rbm table holds the RBMs for (whenever available if stored) and is always empty for .

The general schema for is:

hAAI
AAI
AAI
hAAI
ANI
Workflows
Workflows
hAAI
AAI
ANI
type clade
type clade
metadata field
metadata field
metadata field
metadata field
essential genes
cds