single
(req, gz): FastQ file containing the raw readspair1
(req, gz): FastQ file containing the raw forward readspair2
(req, gz): FastQ file containing the raw reverse readsreads
: Total number of readslength_average
: Average read length (in bp)length_standard_deviation
: Standard deviation of read length (in bp)g_c_content
: G+C content of all reads (in %)x_content
: Undetermined bases content of all reads (in %)a_t_skew
: A-T sequence skew across all reads (in %)g_c_skew
: G-C sequence skew across all reads (in %)read_pairs
: Total number of read pairsforward_length_average
: Average forward read length (in bp)forward_length_standard_deviation
: Standard deviation of forward read length (in bp)forward_g_c_content
: G+C content of forward reads (in %)forward_x_content
: Undetermined bases content of forward reads (in %)forward_a_t_skew
: A-T sequence skew across forward reads (in %)forward_g_c_skew
: G-C sequence skew across forward reads (in %)reverse_length_average
, reverse_length_standard_deviation
,reverse_g_c_content
: Same as above, for reverse readsreverse_x_content
: Undetermined bases content of reverse reads (in %)reverse_a_t_skew
: A-T sequence skew across reverse reads (in %)reverse_g_c_skew
: G-C sequence skew across reverse reads (in %)raw_reads
.single
(req, gz): FastQ file containing trimmed/clipped readspair1
(req, gz): FastQ file containing trimmed/clipped forward readspair2
(req, gz): FastQ file containing trimmed/clipped reverse readssingle
(req, gz): FastQ file containing trimmed/clipped reads with only one sister passing quality controltrimming_summary
: Raw text file containing a summary of the trimmed sequencestrimmed_reads
.solexaqa
(dir): Folder containing the SolexaQA++ quality-control summariesfastqc
(dir): Folder containing the FastQC quality-control analysesread_quality
.coupled
(req for coupled reads, unless pair1
and pair2
exist): Interposed FastA file containing quality-checked paired reads. If this file doesn't exist, it is automatically generated from pair1
and pair2
single
(req for single reads, gz for coupled reads): FastA file with quality-checked single-end readspair1
(gz): FastA file containing forward sisters of quality-checked paired-end readspair2
(gz): FastA file containing reverse sisters of quality-checked paired-end readsreads
: Total number of readslength_average
: Average read length (in bp)length_standard_deviation
: Standard deviation of read length (in bp)g_c_content
: G+C content of all reads (in %)x_content
: Undetermined bases content of all reads (in %)a_t_skew
: A-T sequence skew across all reads (in %)g_c_skew
: G-C sequence skew across all reads (in %)trimmed_fasta
.largecontigs
(req): FastA file containing large contigs or scaffolds (>1Kbp)allcontigs
: FastA file containing all contigs or scaffolds (including large)assembly_data
(dir): Folder containing some intermediate files generated during the assemblycontigs
: Total number of (large) contigsn50
: N50 of (large) contigs (in bp)total_length
: Total length of large contigs (in bp)longest_sequence
: Length of the longest contig (in bp)n_content
: Undetermined bases content of large contigs (in %)g_c_content
: G+C content of large contigs (in %)x_content
: Undetermined bases content of large contigs (in %)a_t_skew
: A-T sequence skew across large contigs (in %)g_c_skew
: G-C sequence skew across large contigs (in %)assembly
.proteins
(req): FastA file containing translated protein sequencesgenes
: FastA file containing putative gene sequencesgff3
(gz): GFF v3 file containing the coordinates of coding sequences This file is not required, but MyTaxa depends on it (or gff2
or tab
, whichever is available)gff2
(gz): GFF v2 file containing the coordinates of coding sequences This file is not produced by MiGA, but it's supported for backwards compatibility with earlier versions using MetaGeneMarktab
(gz): Tabular-delimited file containing the columns: gene ID, gene length, and contig ID. This file is not produced by MiGA, but it's supported to allow MyTaxa to run when more detailed information about the gene prediction is missingpredicted_proteins
: Total number of predicted proteinsaverage_length
: Average length of predicted proteins (in aa)coding_density
: Coding density of the genome (in %)codon_table
: Optimal coding table (4 or 11)cds
.HMM.essential.rb
from the Enveomics Collection to identify a set of genes typically present in single-copy in Bacterial and Archaeal genomes. In this step, protein translations of those essential genes are extracted for other analyses in MiGA (e.g., hAAI in distances) or outside (e.g., phylogeny or MLSA for diversity analyses). In addition, this step generates a report that can be used for quality control including estimations of completeness and contamination (for genomes) and median number of copies of single-copy genes (for metagenomes and viromes).ess_genes
(req): FastA file containing all extracted protein translations from essential genes (.faa) or archived collection (proteins.tar.gz)collection
(req): Folder containing individual FastA files with protein translations from essential genesreport
(req): Raw text report including derived statistics, as well as essential genes missing or detected in multiple copies (for genomes) or copy counts (for metagenomes and viromes)alignments
: Generated for all genomes (non-multi types). It contains the best matching protein for each detected model aligned to the modelbac_report
: If present, this is the original report, and it indicates that a corrected report has been generated to accomodate particular features of the datasetmean_copies
: Average copy number across essential genesmedian_copies
: Median copy number across essential genescompleteness
: Estimated completeness of the genome, based on presence of essential genes (in %)contamination
: Estimated contamination of the genome, based on copy number of essential genes (in %)quality
: Completeness - 5 x Contaminationessential_genes
.longest_ssu_gene
(req): FastA file containing the longest detected SSU genegff
(gz): GFF v3 file containing the location of detected SSU genesall_ssu_genes
(gz): FastA file containing all the detected SSU genesssu
: Total number of detected SSU fragmentscomplete_ssu
: Number of complete SSU locimax_length
: Length of the longest detected SSU fragmentssu
.mytaxa
(req): Output generated by MyTaxablast
(gz): BLAST against the reference genomes databasemytaxain
(gz): Re-formatted BLAST used as input for MyTaxanomytaxa
: If it exists, MiGA assumes no support for MyTaxa modules, and none of the above files are requiredspecies
: Profile of species composition (in permil) as raw tab-delimited textgenus
: Profile of genus composition (in permil) as raw tab-delimited textphylum
: Profile of phylum composition (in permil) as raw tab-delimited textinnominate
: List of innominate taxa (groups without a name but containing lower-rank classifications) as raw textkronain
: Raw-text list of taxa used as input for Kronakrona
: HTML output produced by Kronamytaxa
.mytaxa
(req): MyTaxa outputreport
(req): PDF file containing the graphic reportregions_archive
(gz): Archived folder containing FastA files with the sequences of the genes in regions identified as abnormalnomytaxa
: If it exists, MiGA assumes no support for MyTaxa modules, and none of the above files are requiredwintax
: Taxonomic distribution of each windowblast
(gz): BLAST against the reference genomes databasemytaxain
(gz): Re-formatted BLAST used as input for MyTaxaregions
(dir): Folder containing FastA files with the sequences of the genes in regions identified as abnormalgene_ids
: List of genes per windowregion_ids
: List of regions identified as abnormalmytaxa_scan
.haai_db
(req): SQLite3 database containing hAAI valuesaai_db
: SQLite3 database containing AAI valuesani_db
: SQLite3 database containing ANI valuesaai_medoids
(req except for clades projects): Best hits among medoids at different hierarchical levels in the AAI indexingani_medoids
(req for clades projects): Best hits among medoids at different hierarchical levels in the ANI indexinghaai_db
(req): SQLite3 database containing hAAI valuesaai_db
: SQLite3 database containing AAI valuesani_db
: SQLite3 database containing ANI valuesref_tree
: Newick file with the Bio-NJ tree including queried medoids and the query datasetref_tree_pdf
: PDF rendering of ref_tree
intax
: Raw text result of the taxonomy test against the reference genomedistances
.:ref_project
in metadata).:tax_pvalue
in metadata).intax
: Raw text result of the taxonomy test against the reference genomeaai_medoids
(req except for reference clades projects): Best hits among medoids at different hierarchical levels in the AAI indexingani_medoids
(req for reference clades projects): Best hits among medoids at different hierarchical levels in the ANI indexinghaai_db
(req): SQLite3 database containing hAAI valuesaai_db
: SQLite3 database containing AAI valuesani_db
: SQLite3 database containing ANI valuesref_tree
: Newick file with the Bio-NJ tree including queried medoids and the query datasetref_tree_pdf
: PDF rendering of ref_tree
closest_relative
: Name of the reference dataset with highest AAIaai
: AAI to the closest relativedomain_pvalue
, phylum_pvalue
, class_pvalue
, order_pvalue
, family_pvalue
, genus_pvalue
, species_pvalue
, subspecies_pvalue
: Empirical p-values for classification at each rank with respect to the closest relative, based on the observed AAItaxonomy
stats
.rds
(req): Pairwise values in a data.frame
for R
matrix
(req): Pairwise values in a raw tab-delimited filelog
(req): List of datasets included in the matrixhist
: Histogram of hAAI values as raw tab-delimited filehaai_distances
.rds
(req): Pairwise values in a data.frame
for R
matrix
(req): Pairwise values in a raw tab-delimited filelog
(req): List of datasets included in the matrixhist
: Histogram of AAI values as raw tab-delimited fileaai_distances
.rds
(req): Pairwise values in a data.frame
for R
matrix
(req): Pairwise values in a raw tab-delimited filelog
(req): List of datasets included in the matrixhist
: Histogram of ANI values as raw tab-delimited fileani_distances
.report
(req for genomes
): PDF file including a graphic report for the clusteringclass_table
(req for genomes
): Tab-delimited file containing the classification of all datasets in AAI clustersclass_tree
(req for genomes
): Newick file containing the classification of all datasets in AAI clusters as a dendrogramclassif
(req for genomes
): Tab-delimited file containing the highest-level classification of each dataset, the medoid of the cluster, and the AAI against the corresponding medoidmedoids
(req for genomes
): List of medoids per clusteraai_tree
: Bio-NJ tree based on AAI distances in Newick formataai_dist_rds
: AAI-based distances in R data serialized formatproposal
(req): Proposed species-level clades in the project, based on clades_ani95
. One line per proposed clade, with tab-delimited dataset names. Only clades with 5 or more members are includedclades_aai90
: Clades formed at AAI > 90%. One clade per line, with comma-delimited dataset namesclades_ani95
: Clades formed at ANI > 95%. One clade per line, with comma-delimited dataset namesmedoids_ani95
: List of clades_ani95
datasets with the smallest ANI distance to all members of its own ANI95 clade. The list is in the same orderclade_finding
.report
(req): PDF file including a graphic report for the clusteringclass_table
(req): Tab-delimited file containing the classification of all datasets in ANI clustersclass_tree
(req): Newick file containing the classification of all datasets in ANI clusters as a dendrogramclassif
(req): Tab-delimited file containing the highest-level classification of each dataset, the medoid of the cluster, and the ANI against the corresponding medoidmedoids
(req): List of medoids per clusterani_tree
: Bio-NJ tree based on AAI distances in Newick formatani_dist_rds
: ANI-based distances in R data serialized formatsubclades
.ogs
(req): Matrix of orthology groups, as tab-delimited raw filestats
(req): Summary statistics in JSON formatabc
(gz): When available, it includes all the individual RBM files in ABC format. This file is typically produced as intermediate result and removed before finishing, but can be maintained using miga option -P . --key clean_ogs --value false
in the project folder using the CLI​core_pan
: Summary statistics of rarefied core-genome/pangenome sizes in tab-delimited formatcore_pan_plot
: Plot of rarefied core-genome/pangenome sizes in PDFogs
.taxonomy_index
(req): Index of datasets per taxonomy in JSON formatmetadata_index
(req): Searchable index of datasets metadata as SQLite3 databaseproject_stats
.