MiGA Manual
AboutCodebaseMiGA Online
Primary version
Primary version
  • Introduction
  • Part I: What is MiGA?
    • How can MiGA help me?
    • Who is using MiGA?
    • Who is behind MiGA?
    • Definitions
  • Part II: Getting started
    • Requirements
      • Using Homebrew
      • Using apt-get
      • Using Conda
      • Installing from source
      • MyTaxa Utils
    • Installation
    • MiGA types
    • Input data
    • Distances
    • Clustering
  • Part III: Interfaces
    • MiGA API
    • MiGA CLI
    • MiGA Web
  • Part IV: Deploying examples
    • RefSeq in MiGA
    • Build a clade collection
    • Launching daemons
    • Setting up MiGA in a cluster
  • Part V: Additional details
    • Advanced configuration
    • MiGA workflow
    • Metadata
    • External Software
  • Part VI: Workflows
    • Quality
    • Dereplicate
    • Classify
    • Preprocess
    • Index
    • Summaries
Powered by GitBook
On this page
  • General algorithm
  • Genomospecies proposals
  1. Part II: Getting started

Clustering

MiGA generates a clustering-based indexing of databases using ANI distances (for clade projects) or AAI distances (for all other projects). This indexing enables quick searching of databases with query genomes.

General algorithm

The AAI or ANI values are transformed to distances (1 - identity), and the all-vs-all distance matrix is used to generate a k-medoids partition (PAM: Partition Around Medoids). k is selected to simultaneously optimize for maximum Silhouette average width and minimum Silhouette negative area, between 2 and 100 (or the number of genomes minus 1, whichever is smaller). Once the partitions are defined, the same algorithm is applied recursively to each partition with 8 or more genomes. The resulting clustering-based indexing is used to speed-up query searches. In some cases, it can also be used as de novo typing scheme, in particular for ANI distances (clade projects).

Genomospecies proposals

In addition to the above clustering-based indexing, MiGA clusters genomes by Markov Clustering (MCL) using all ANI values above 95% as edges. The result is a collection of discrete genomospecies. The list of genomes per genomospecies is sorted by medoid-ranking, in which the first genome has the minimum average distance to all other genomes in the genomospecies.

PreviousDistancesNextPart III: Interfaces

Last updated 4 years ago