MiGA workflow
MiGA Workflow
This is the general overview of the MiGA workflow:
For each step, performed analyses may include the use of external Software, and produce one or more result files (indexed in a hash). In most steps, different utilities from the Enveomics Collection are used in addition to the Software detailed below. Some files are mandatory to continue with the analysis (marked with req), some can be gzipped during or after the analysis (marked with gz), and some are directories (marked with dir).
List of individual steps
Dataset Results
Raw Reads
This step is never actually performed by MiGA, instead it serves as the entry point for raw reads input.
Supported file keys:
For single reads only
single
(req, gz): FastQ file containing the raw reads
For paired-end reads only
pair1
(req, gz): FastQ file containing the raw forward readspair2
(req, gz): FastQ file containing the raw reverse reads
Statistics:
For single reads only
reads
: Total number of readslength_average
: Average read length (in bp)length_standard_deviation
: Standard deviation of read length (in bp)g_c_content
: G+C content of all reads (in %)x_content
: Undetermined bases content of all reads (in %)a_t_skew
: A-T sequence skew across all reads (in %)g_c_skew
: G-C sequence skew across all reads (in %)
For paired-end reads only
read_pairs
: Total number of read pairsforward_length_average
: Average forward read length (in bp)forward_length_standard_deviation
: Standard deviation of forward read length (in bp)forward_g_c_content
: G+C content of forward reads (in %)forward_x_content
: Undetermined bases content of forward reads (in %)forward_a_t_skew
: A-T sequence skew across forward reads (in %)forward_g_c_skew
: G-C sequence skew across forward reads (in %)reverse_length_average
,reverse_length_standard_deviation
,reverse_g_c_content
: Same as above, for reverse readsreverse_x_content
: Undetermined bases content of reverse reads (in %)reverse_a_t_skew
: A-T sequence skew across reverse reads (in %)reverse_g_c_skew
: G-C sequence skew across reverse reads (in %)
MiGA symbol: raw_reads
.
Trimmed Reads
This is part of Trimming & read quality in the above diagram. In this step, MiGA trims reads and clips potential adapters with an adaptive strategy using a combination of FaQCs, Seqtk, and fastp. The full pipeline is implemented as a stand-alone submodule called multitrim.
Supported file keys:
For single reads only
single
(req, gz): FastQ file containing trimmed/clipped reads
For paired-end reads only
pair1
(req, gz): FastQ file containing trimmed/clipped forward readspair2
(req, gz): FastQ file containing trimmed/clipped reverse readssingle
(req, gz): FastQ file containing trimmed/clipped reads with only one sister passing quality control
Deprecated (for backwards-compatibility)
trimming_summary
: Raw text file containing a summary of the trimmed sequences
MiGA symbol: trimmed_reads
.
Read Quality
This is a quality-control step included as part of Trimming & read quality in the diagram above. In this step, MiGA generates quality reports of the trimmed/clipped reads using Falco via multitrim.
Supported file keys:
For single and paired reads
pre_qc_1
: HTML file with QC-report of the forward reads before trimmingpost_qc_1
(req): HTML file with QC-report of the forward reads after trimmingadapter_detection
: List of adapters identified in the first pass
For paired reads only
pre_qc_2
: HTML file with QC-report of the reverse reads before trimmingpost_qc_2
: HTML file with QC-report of the reverse reads after trimming
Deprecated (for backwards-compatibility)
solexaqa
(dir): Folder containing the SolexaQA++ quality-control summariesfastqc
(dir): Folder containing the FastQC quality-control analyses
MiGA symbol: read_quality
.
Trimmed FastA
This is the final step included in Trimming & read quality in the diagram above, in which MiGA generates FastA files with the trimmed/clipped reads.
Supported file keys:
coupled
(req for coupled reads, unlesspair1
andpair2
exist): Interposed FastA file containing quality-checked paired reads. If this file doesn't exist, it is automatically generated frompair1
andpair2
single
(req for single reads, gz for coupled reads): FastA file with quality-checked single-end readspair1
(gz): FastA file containing forward sisters of quality-checked paired-end readspair2
(gz): FastA file containing reverse sisters of quality-checked paired-end reads
Statistics:
reads
: Total number of readslength_average
: Average read length (in bp)length_standard_deviation
: Standard deviation of read length (in bp)g_c_content
: G+C content of all reads (in %)x_content
: Undetermined bases content of all reads (in %)a_t_skew
: A-T sequence skew across all reads (in %)g_c_skew
: G-C sequence skew across all reads (in %)
MiGA symbol: trimmed_fasta
.
Assembly
In this step MiGA assembles trimmed FastA reads using IDBA-UD.
Supported file keys:
largecontigs
(req): FastA file containing large contigs or scaffolds (>1Kbp)allcontigs
: FastA file containing all contigs or scaffolds (including large)assembly_data
(dir): Folder containing some intermediate files generated during the assembly
Statistics:
contigs
: Total number of (large) contigsn50
: N50 of (large) contigs (in bp)total_length
: Total length of large contigs (in bp)longest_sequence
: Length of the longest contig (in bp)n_content
: Undetermined bases content of large contigs (in %)g_c_content
: G+C content of large contigs (in %)x_content
: Undetermined bases content of large contigs (in %)a_t_skew
: A-T sequence skew across large contigs (in %)g_c_skew
: G-C sequence skew across large contigs (in %)
MiGA symbol: assembly
.
CDS
This step corresponds to Gene prediction in the diagram above. MiGA predicts coding sequences (putative genes and proteins) using Prodigal, and automatically calculates the most likely codon table between 11 and 4.
Supported file keys:
proteins
(req): FastA file containing translated protein sequencesgenes
: FastA file containing putative gene sequencesgff3
(gz): GFF v3 file containing the coordinates of coding sequences. This file is not required, but MyTaxa depends on it (orgff2
ortab
, whichever is available)gff2
(gz): GFF v2 file containing the coordinates of coding sequences. This file is not produced by MiGA, but it's supported for backwards compatibility with earlier versions using MetaGeneMarktab
(gz): Tabular-delimited file containing the columns: gene ID, gene length, and contig ID. This file is not produced by MiGA, but it's supported to allow MyTaxa to run when more detailed information about the gene prediction is missing
Statistics:
predicted_proteins
: Total number of predicted proteinsaverage_length
: Average length of predicted proteins (in aa)coding_density
: Coding density of the genome (in %)codon_table
: Optimal coding table (4 or 11)
MiGA symbol: cds
.
Essential Genes
In this step, MiGA uses HMM.essential.rb
from the Enveomics Collection to identify a set of genes typically present in single-copy in Bacterial and Archaeal genomes. In this step, protein translations of those essential genes are extracted for other analyses in MiGA (e.g., hAAI in distances) or outside (e.g., phylogeny or MLSA for diversity analyses). In addition, this step generates a report that can be used for quality control including estimations of completeness and contamination (for genomes) and median number of copies of single-copy genes (for metagenomes and viromes).
Supported file keys:
ess_genes
(req): FastA file containing all extracted protein translations from essential genes (.faa) or archived collection (proteins.tar.gz)collection
(req): Folder containing individual FastA files with protein translations from essential genesreport
(req): Raw text report including derived statistics, as well as essential genes missing or detected in multiple copies (for genomes) or copy counts (for metagenomes and viromes)alignments
: Generated for all genomes (non-multi types). It contains the best matching protein for each detected model aligned to the modelfastaai_index
: A FastAAI index now deprecated (for backwards-compatibility)fastaai_index_2
: A FastAAI index with the second format version (SQLite)bac_report
: If present, this is the original report, and it indicates that a corrected report has been generated to accomodate particular features of the dataset
Statistics:
For metagenomes and viromes
mean_copies
: Average copy number across essential genesmedian_copies
: Median copy number across essential genes
For genomes
completeness
: Estimated completeness of the genome, based on presence of essential genes (in %)contamination
: Estimated contamination of the genome, based on copy number of essential genes (in %)quality
: Completeness - 5 x Contamination
MiGA symbol: essential_genes
.
MyTaxa
This step is only supported for metagenomes and viromes, and it requires the (optional) MyTaxa requirements installed.
In this step, the most likely taxonomic classification of each contig is identified using MyTaxa, and a report is generated using Krona.
Supported file keys:
mytaxa
(req): Output generated by MyTaxablast
(gz): BLAST against the reference genomes databasemytaxain
(gz): Re-formatted BLAST used as input for MyTaxanomytaxa
: If it exists, MiGA assumes no support for MyTaxa modules, and none of the above files are requiredspecies
: Profile of species composition (in permil) as raw tab-delimited textgenus
: Profile of genus composition (in permil) as raw tab-delimited textphylum
: Profile of phylum composition (in permil) as raw tab-delimited textinnominate
: List of innominate taxa (groups without a name but containing lower-rank classifications) as raw textkronain
: Raw-text list of taxa used as input for Kronakrona
: HTML output produced by Krona
MiGA symbol: mytaxa
.
MyTaxa Scan
This step is only supported for genomes (dataset types genome, popgenome, and scgenome), and it requires the (optional) MyTaxa requirements installed.
In this step, the genomes are scanned in windows of ten genes. For each window, the taxonomic distribution is determined using MyTaxa and compared against the distribution for the entire genome. This is a quality-control step for manual curation.
Supported file keys:
mytaxa
(req): MyTaxa outputreport
(req): PDF file containing the graphic reportregions_archive
(gz): Archived folder containing FastA files with the sequences of the genes in regions identified as abnormalnomytaxa
: If it exists, MiGA assumes no support for MyTaxa modules, and none of the above files are required
Deprecated file keys (for backwards-compatibility):
wintax
: Taxonomic distribution of each windowblast
(gz): BLAST against the reference genomes databasemytaxain
(gz): Re-formatted BLAST used as input for MyTaxaregions
(dir): Folder containing FastA files with the sequences of the genes in regions identified as abnormalgene_ids
: List of genes per windowregion_ids
: List of regions identified as abnormal
MiGA symbol: mytaxa_scan
.
Distances
This step is only supported for genomes (dataset types genome, popgenome, and scgenome). In this step, each dataset is compared against all other datasets in the project. If the dataset is a reference dataset, it is compared against all other reference datasets in the project. If it's a query dataset, it is compared iteratively against medoids. For more details on the strategy used in this step, see the manual section on distances.
Supported file keys:
For reference datasets
haai_db
(req): SQLite3 database containing hAAI valuesaai_db
: SQLite3 database containing AAI valuesani_db
: SQLite3 database containing ANI values
For query datasets
aai_medoids
(req except for clades projects): Best hits among medoids at different hierarchical levels in the AAI indexingani_medoids
(req for clades projects): Best hits among medoids at different hierarchical levels in the ANI indexinghaai_db
(req): SQLite3 database containing hAAI valuesaai_db
: SQLite3 database containing AAI valuesani_db
: SQLite3 database containing ANI valuesref_tree
: Newick file with the Bio-NJ tree including queried medoids and the query datasetref_tree_pdf
: PDF rendering ofref_tree
intax
: Raw text result of the taxonomy test against the reference genome
MiGA symbol: distances
.
Taxonomy
This step is only supported for genomes (dataset types genome, popgenome, and scgenome) that are reference datasets, in projects with a set reference project (:ref_project
in metadata).
In this step, MiGA compares the genome against a reference project using the query search method, and imports the resulting taxonomy with p-value below 0.1 (or whichever value is set as :tax_pvalue
in metadata).
Supported file keys:
intax
: Raw text result of the taxonomy test against the reference genomeaai_medoids
(req except for reference clades projects): Best hits among medoids at different hierarchical levels in the AAI indexingani_medoids
(req for reference clades projects): Best hits among medoids at different hierarchical levels in the ANI indexinghaai_db
(req): SQLite3 database containing hAAI valuesaai_db
: SQLite3 database containing AAI valuesani_db
: SQLite3 database containing ANI valuesref_tree
: Newick file with the Bio-NJ tree including queried medoids and the query datasetref_tree_pdf
: PDF rendering ofref_tree
Statistics:
closest_relative
: Name of the reference dataset with highest AAIaai
: AAI to the closest relativedomain_pvalue
,phylum_pvalue
,class_pvalue
,order_pvalue
,family_pvalue
,genus_pvalue
,species_pvalue
,subspecies_pvalue
: Empirical p-values for classification at each rank with respect to the closest relative, based on the observed AAI
MiGA symbol: taxonomy
SSU
In this step, MiGA detects rRNA genes (16S and 23S) using Barrnap, extracts the sequences of the small subunit genes (16S) using Bedtools, and identifies tRNA elements using tRNAscan-SE. If configured, it will also classify all the 16S rRNA genes detected using the RDP Naïve Bayes Classifier.
Supported file keys:
longest_ssu_gene
(req): FastA file containing the longest detected SSU genegff
(gz): GFF v3 file containing the location of detected SSU genesall_ssu_genes
(gz): FastA file containing all the detected SSU genesclassification
: Taxonomic classification with RDP taxonomytrna_list
(gz): Raw-text table with tRNA predictions, generated for all genome types (non-multi)
Deprecated file keys (for backwards-compatibility):
ssu_gff
(gz): GFF3 file containing pre-filtered SSU rRNA predictions
Statistics:
ssu
: Total number of detected SSU fragmentscomplete_ssu
: Number of complete SSU locissu_fragment
: Maximum percentage covered for any detected SSU fragmentslsu
: Total number of detected LSU fragmentscomplete_lsu
: Number of complete LSU locilsu_fragment
: Maximum percentage covered for any detected LSU fragmentsmax_length
: Length of the longest detected SSU fragmenttrna_count
: Total number of tRNA elements detected (including pseudogenes)trna_aa
: Number of distinct amino acids for which tRNA elements were detected (excluding pseudogenes)
MiGA symbol: ssu
.
Stats
In this step, MiGA traces back all the results of the dataset and estimates summary statistics. In addition, it cleans any stored values in the distances database including datasets no longer registered in the project.
Supported file keys:
trna_list
: List of tRNA elements detected. This file is only produced for genome datasets with defined taxonomy within the Archaea, Bacteria, or Eukaryota domains
MiGA symbol: stats
.
Project Results
Once all datasets have been pre-processed (i.e., once all the results above are available for all reference datasets), MiGA executes the following project-wide steps:
hAAI Distances
Consolidation of hAAI distances.
Supported file keys:
rds
(req): Pairwise values in adata.frame
forR
matrix
(req): Pairwise values in a raw tab-delimited filelog
(req): List of datasets included in the matrixhist
: Histogram of hAAI values as raw tab-delimited file
MiGA symbol: haai_distances
.
AAI Distances
Consolidation of AAI distances.
Supported file keys:
rda
(req): Pairwise values forR
in three vectorsmatrix
(req): Pairwise values in a raw tab-delimited filelog
(req): List of datasets included in the matrixhist
: Histogram of AAI values as raw tab-delimited filerds
(deprecated): Pairwise values in adata.frame
forR
MiGA symbol: aai_distances
.
ANI Distances
Consolidation of ANI distances.
Supported file keys:
rda
(req): Pairwise values forR
in three vectorsmatrix
(req): Pairwise values in a raw tab-delimited filelog
(req): List of datasets included in the matrixhist
: Histogram of ANI values as raw tab-delimited filerds
(deprecated): Pairwise values in adata.frame
forR
MiGA symbol: ani_distances
.
Clade Finding
This step is only supported for project types genomes and clade.
In this step, MiGA attempts to identify clades at species level or above using a combination of ANI and AAI values. MiGA generates AAI clades in this step for genomes projects. Clades proposed at AAI > 90% and ANI > 95% are formed using the Markov Clustering algorithm implemented in MCL. Most distance manipulation and tree estimation and manipulation utilities use the R packages Ape and Vegan.
Supported file keys:
report
(req forgenomes
): PDF file including a graphic report for the clusteringclass_table
(req forgenomes
): Tab-delimited file containing the classification of all datasets in AAI clustersclass_tree
(req forgenomes
): Newick file containing the classification of all datasets in AAI clusters as a dendrogramclassif
(req forgenomes
): Tab-delimited file containing the highest-level classification of each dataset, the medoid of the cluster, and the AAI against the corresponding medoidmedoids
(req forgenomes
): List of medoids per clusteraai_tree
: Bio-NJ tree based on AAI distances in Newick formataai_dist_rds
: AAI-based distances in R data serialized formatproposal
(req): Proposed species-level clades in the project, based onclades_ani95
. One line per proposed clade, with tab-delimited dataset names. Only clades with 5 or more members are includedclades_aai90
: Clades formed at AAI > 90%. One clade per line, with comma-delimited dataset namesclades_ani95
: Clades formed at ANI > 95%. One clade per line, with comma-delimited dataset namesmedoids_ani95
: List ofclades_ani95
datasets with the smallest ANI distance to all members of its own ANI95 clade. The list is in the same order
MiGA symbol: clade_finding
.
Subclades
This step is only supported for project type clade.
In this step, MiGA attempts to identify clades below species level using ANI values. MiGA generates ANI clades in this step. Most distance manipulation and tree estimation and manipulation utilities use the R packages Ape and Vegan.
Supported file keys:
report
(req): PDF file including a graphic report for the clusteringclass_table
(req): Tab-delimited file containing the classification of all datasets in ANI clustersclass_tree
(req): Newick file containing the classification of all datasets in ANI clusters as a dendrogramclassif
(req): Tab-delimited file containing the highest-level classification of each dataset, the medoid of the cluster, and the ANI against the corresponding medoidmedoids
(req): List of medoids per clusterani_tree
: Bio-NJ tree based on AAI distances in Newick formatani_dist_rds
: ANI-based distances in R data serialized format
MiGA symbol: subclades
.
OGS
This step is only supported for project type clade.
In this step, MiGA generates groups of orthology using reciprocal best matches between all pairs of datasets in the project. Groups are generated using MCL with pairs weighted by bit score. Once computed, MiGA uses the matrix of OGS to estimate summary and rarefied statistics.
Supported file keys:
ogs
(req): Matrix of orthology groups, as tab-delimited raw filestats
(req): Summary statistics in JSON formatabc
(gz): When available, it includes all the individual RBM files in ABC format. This file is typically produced as intermediate result and removed before finishing, but can be maintained usingmiga option -P . --key clean_ogs --value false
in the project folder using the CLIcore_pan
: Summary statistics of rarefied core-genome/pangenome sizes in tab-delimited formatcore_pan_plot
: Plot of rarefied core-genome/pangenome sizes in PDF
MiGA symbol: ogs
.
Project Stats
In this step, MiGA traces back all the results of the project and estimates summary statistics.
Supported file keys:
taxonomy_index
(req): Index of datasets per taxonomy in JSON formatmetadata_index
(req): Searchable index of datasets metadata as SQLite3 database
MiGA symbol: project_stats
.
Last updated