QUAST 4.1 manual

QUAST stands for QUality ASsessment Tool. The tool evaluates genome assemblies by computing various metrics. This document provides instructions for the general QUAST tool for genome assemblies, MetaQUAST, the extension for metagenomic datasets, and Icarus, interactive visualizer for these tools.

You can find all project news and the latest version of the tool at http://quast.sf.net/.

QUAST utilizes E-MEM (an improvement over MUMmer), GeneMarkS, GeneMark-ES, GlimmerHMM, and GAGE. In addition, MetaQUAST uses MetaGeneMark, Krona tools, BLAST, and SILVA 16S rRNA database. Starting from version 3.2, QUAST package also includes reads processing tools for finding structural variants between the reference genome and actual organism. These tools are Bowtie2, SAMtools, and Manta.
All tools above are built in into the QUAST package which is ready for use by academic, non-profit institutions and U.S. Government agencies. If you are not in one of these categories please refer to LICENSE section 'Third-party tools incorporated into QUAST' for guidelines on how to complete the licensing process.

Version 4.1 of QUAST was released under GPL v2 on 26 May 2016. Note that some of build-in third-party tools are not under GPL v2. See LICENSE for details.

Contents

  1. Installation
  2. Running QUAST
    1. For impatient people
    2. Input data
    3. GAGE mode
    4. Command line options
    5. Metagenomic assemblies
  3. QUAST output
    1. Metrics description
      1. Summary report
      2. Misassemblies report
      3. Unaligned report
    2. Plots descriptions
    3. MetaQUAST output
    4. Icarus output
  4. Adjusting QUAST reports and plots
  5. Citation
  6. Feedback and bug reports
  7. FAQ

1. Installation

QUAST can be run on Linux or Mac OS.

Its default pipeline requires:

In addition, QUAST submodules require:

All those tools are usually preinstalled on Linux.
Mac OS, however, initially misses make, g++ and ar, so you will have to install Xcode (or only Command Line Tools for Xcode) to make them available.

It is also highly recommended to install the Matplotlib Python library for drawing plots. We recommend to use Matplotlib version 1.1 or higher. Tested with Matplotlib v.1.3.1.
Installation can be done with Python pip-installer:
    pip install matplotlib
Or with the Easy Install Python module:
    easy_install matplotlib
Or on Ubuntu by typing:
    sudo apt-get install python-matplotlib

To download the QUAST source code tarball and extract it, type:
    wget https://downloads.sourceforge.net/project/quast/quast-4.1.tar.gz
    tar -xzf quast-4.1.tar.gz
    cd quast-4.1

QUAST automatically compiles all its sub-parts when needed (on the first use). Thus, there is no special installation command for QUAST. However, we recommend you to run:

    python quast.py --test (if you plan to use quast.py)
or/and
    python quast.py --test-sv  (if you plan to use quast.py or metaquast.py with SV calling)
or/and
    python metaquast.py --test  (if you plan to use metaquast.py with reference genomes)
or/and
    python metaquast.py --test-no-ref  (if you plan to use metaquast.py without reference genomes)
These commands run all QUAST and MetaQUAST modules and check correctness of their work on your platform.

We also provide ./install.sh and ./install_full.sh scripts. If you plan to use MetaQUAST without reference genomes or use SV calling, you should use ./install_full.sh, which runs all four mentioned above commands, compiles tools for reads alignment and SV detection (SAMtools, Bowtie2, and Manta), and downloads necessary files (SILVA 16S rRNA gene database and BLAST). Otherwise, it is enough to use ./install.sh, which runs only two simple --test commands.

Note: you should place quast-4.1 directory in the final destination before the first use (e.g. before run with --test). If you want to move QUAST to some new place after several usages you should use a clean copy of quast-4.1. This limitation is caused by auto-generation of absolute paths in compiled modules of QUAST.

2. Running QUAST

2.1 For impatient people


Running QUAST on test data from the installation tarball (reference genome, gene and operon annotations, and two assemblies of the first 10 kbp of E. coli):
    ./quast.py test_data/contigs_1.fasta \
               test_data/contigs_2.fasta \
               -R test_data/reference.fasta.gz \
               -G test_data/genes.gff
View the summary of the evaluation results with the less utility:
    less quast_results/latest/report.txt

2.2 Input data

The test_data directory contains examples of assembly, reference genome, genes and operons files.

Sequences
The tool accepts assemblies and reference genomes in FASTA format. Files may be compressed with zip, gzip, or bzip2.
A reference genome with multiple chromosomes can be provided as a single FASTA file with separate sequence for each chromosome inside.
 
Maximum assembly length is 4.29 Gbp.
Maximum length of a reference sequence (e.g. a chromosome) is 536 Mbp. The number of sequences in a reference file is not limited.
 
Those restrictions belongs to Nucmer, a tool that QUAST applies to align contigs to a reference genome. The metrics that do not require alignment are computed in any case.

Genes and operons
One can also specify files with gene and operon positions in the reference genome. QUAST will count fully and partially aligned regions, and output total values and cumulative plots.
 
The following file formats are supported:

Note that the sequence name has to match a name in the reference file.
 
Coordinates are 1-based, i.e. the first nucleotide in the reference genome has position 1, not 0. If a start position less than a corresponding end position, such gene or operon is located on forward strand, and on reverse-complement strand otherwise.

2.3 GAGE mode

GAGE is a well-known assessment tool. However, it has limitations:

These issues are solved by QUAST in GAGE mode (run with --gage). QUAST filters contigs according to a specified threshold and runs GAGE on eachgrou assembly. GAGE statistics (see GAGE website and GAGE paper for the descriptions) are reported in addition to standard QUAST report.
 
Note:

2.4 Command line options


QUAST runs from a command line as follows:
    python quast.py [options] <contig_file(s)>
Options:
-o <output_dir>
Output directory. The default value is quast_results/results_<date_time>.
Also, a symlink quast_results/latest is created.

Note: QUAST reuses Nucmer alignments if run repeatedly with the same output directory. Thus, you can efficiently reuse already computed results when running QUAST with different parameters, or adding more assemblies to an existing comparison.
-R <path>
Reference genome file. Optional. Many metrics can't be evaluated without a reference. If this is omitted, QUAST will only report the metrics that can be evaluated without a reference.
-G <path> (or --genes <path>)
File with gene positions in the reference genome. See details about the file format in section 2.2.
 
If you do not have gene positions, you can make QUAST predict genes with --gene-finding.
-O <path> (or --operons <path>)
File with operon positions in the reference genome. See details about the file format in section 2.2
--min-contig (or -m) <int>
Lower threshold for a contig length. Shorter contigs won't be taken into account (except for some metrics, see section 3). The default value is 500.

Advanced options:
--threads (or -t) <int>
Maximum number of threads. The default value is 25% of all available CPUs but not less than 1. If QUAST fails to determine the number of CPUs, maximum threads number is set to 4.
--labels (or -l) <label,label...>
Human-readable assembly names. Those names will be used in reports, plots and logs. For example:
-l SPAdes,IDBA-UD
If your labels include spaces, use quotes:
-l SPAdes,"Assembly 2",Assembly3
-l "SPAdes 2.5, SPAdes 2.4, IDBA-UD"
-L
Take assembly names from their parent directory names.
--gene-finding (or -f)
Enables gene finding. Affects performance, thus disabled by default.
 
By default, we assume that the genome is prokaryotic, and apply GeneMarkS for gene finding. If the genome is eukaryotic, use --eukaryote to enable option to enable GeneMark-ES instead. If it is a metagenome, use --meta.
 
If a gene file is provided with -G as well, both # genes in the file covered by the assembly, and # predicted genes are reported. Note that operons are not predicted, but a file of known operon positions can be provided instead.
--glimmer
Use GlimmerHMM for gene finding for eukaryotes instead of GeneMark-ES.
--gene-thresholds <int,int,...>
Comma-separated list of thresholds for gene lengths to find with a finding tool. The default value is 0,300,1500,3000. Note: this list is used only if --gene-finding option is specified.
--eukaryote (or -e)
Genome is eukaryotic. Affects gene finding and contig alignment:
  1. For prokaryotes (which is default), GeneMarkS is used. For eukaryotes, GeneMark-ES is used.
  2. By default, QUAST assumes that a genome is circular and correctly processes its linear representation. This options indicates that the genome is not circular.
--meta
Use MetaGeneMark for gene finding, if --gene-finding is specified. If --eukaryote is also provided, MetaGeneMark still will be used.
 
Note: if you are working with metagenome assemblies, we recommend to use metaquast.py instead of quast.py (it is in the same directory as quast.py).
--est-ref-size <int>
Estimated reference genome size (in bases) for computing NGx statistics. This value will be used only if a reference genome file is not specified (see --R option).
--gage
Starts QUAST in "GAGE mode" (see section 2.3). Note: in this case, you also have to specify a reference genome with -R.
--contig-thresholds <int,int,...>
Comma-separated list of contig length thresholds. Used in # contigs ≥ x and total length (≥ x) metrics (see section 3). The default value is 0,1000.
--scaffolds (or -s)
The assemblies are scaffolds (rather than contigs). QUAST will add split versions of assemblies to the comparison (named <assembly_name>_broken). Assemblies are split by continuous fragments of N's of length ≥ 10. If broken version is equal to the original assembly (i.e. nothing was split) it is not included in the comparison.
--use-all-alignments (or -u)
Compute genome fraction, # genes, # operons metrics in the manner used in QUAST v.1.*. By default, QUAST v.2.0 and higher filters out ambiguous and redundant alignments, keeping only one alignment per contig (or one set of non-overlapping or slightly overlapping alignments). This option makes QUAST count all alignments.
--min-alignment (or -i) <int>
Minimum length of alignment. It is Nucmer's parameter which filters all alignments shorter than the value. Default is 0 bp. In any case, Nucmer will not produce alignments shorter than 65 bp (default min cluster size).
--ambiguity-usage (or -a) <none|one|all>
Way of processing equally good alignments of a contig (probably repeats):
noneskip all such alignments;
onetake only one (the first one);
alluse all alignments. Can cause a significant increase of # mismatches (repeats are almost always inexact due to accumulated SNPs, indels, etc.).
This option is also used for processing internal overlaps between adjacent aligned blocks of a misassembled contig:
noneexclude (remove) overlapped fragments from both blocks;
oneremove overlapped fragment from only one block (the shortest one);
alluse both blocks unchanged.
The default value is 'one'.
--strict-NA
Break contigs at every misassembly event (including local ones) to compute NAx and NGAx statistics. By default, QUAST breaks contigs only at extensive misassemblies (not local ones).
--extensive-mis-size <int>
Lower threshold for the relocation size (gap or overlap size between left and right flanking sequence, see section 3.1.2 for details). Shorter relocations are considered as local misassemblies. Does not affect other types of extensive misassemblies (inversions and translocations). The default value is 1000 bp. Note that the threshold should be greater than maximum indel length which is 85 bp (Nucmer default value).
--significant-part-size <int>
Lower threshold for detecting partially unaligned contigs with both significant aligned and unaligned parts, see section 3.1.3 for details. The default value is 500 bp.
--fragmented
Reference genome is fragmented (e.g. a scaffold reference). QUAST will try to detect misassemblies caused by the fragmentation and mark them fake (will be excluded from # misassemblies).
--plots-format <format>
File format for plots. Supported formats: emf, eps, pdf, png, ps, raw, rgba, svg, svgz. The default format is PDF.
--memory-efficient
Run Nucmer using one thread, separately per each assembly and each chromosome. This may significantly reduce memory consumption on large genomes.
--silent
Do not print detailed information about each step in standard output. This option does not affect quast.log file.

Structural variant (SV) calling and processing (experimental, please use it with care until we finalize the feature):
--reads1 (or -1)
File with forward reads in FASTQ format (may be gzipped).
--reads2 (or -2)
File with reverse reads in FASTQ format (may be gzipped).

Reads are used for SV detection: Reads are aligned to reference genome using bowtie2, then Manta SV calling tool is run on bowtie2 output. Found SVs are used for classifying QUAST misassemblies into true ones and fake ones (caused by structural differences between reference sequence and sequenced organism). Fake misassemblies are excluded from # misassemblies and reported as # structural variants.

--sv-bedpe
Use specified file in BEDPE format as a list of structural variations (SV). This option disables SV detection based on reads. Examples of BEDPE files for various types of SV are in FAQ section, question Q8.

Speedup options:
--no-check
Do not check and correct input FASTA files (both reference genome and assemblies). By default, QUAST corrects sequence names by replacing special characters (all symbols except latin letters, numbers, underscores, dots, and minus signs) with underscore ("_"). QUAST also checks and corrects sequences itself. Lowercase letters are changed to uppercase. Alternative nucleotide symbols (M, K, R, etc) are replaced with N. If non-ACGTN characters are present after this modifications the whole FASTA file is skipped from further processing.
Caution: use this option at your own risk. Incorrect FASTA files may cause failing of third-party tools incorporated to QUAST, i.e. GAGE, Nucmer, GeneMark, GlimmerHMM. This option is useful for running QUAST without -R and --gene-finding (no third-party tools will be run) or if you are absolutely sure that your FASTA files are correct.
--no-plots
Do not draw plots.
--no-html
Do not build HTML reports and Icarus viewers.
--no-snps
Do not report SNPs statistics. This may significantly reduce memory consumption on large genomes and speed up computation. However, all SNP-related metrics will not reported (e.g. # mismatches per 100).
--no-gc
Do not compute GC% and do not produce GC-distribution plots (both in HTML report and in PDF).
--no-sv
Do not run structural variant calling and processing (make sense only if reads are specified).
--fast
A shortcut for using all of speedup options except --no-check.

MetaQUAST only:
--test-no-ref
Run MetaQUAST on a data from the test_data folder, but without reference genomes. The tool will download SILVA 16S rRNA gene database (170 Mb) and BLAST binaries (55-75 Mb depending on your OS), which will be required if you plan to use MetaQUAST without references. See section 2.5 for details about reference search algorithm.
--max-ref-num <int>
Maximum number of reference genomes (per each assembly) to download after searching in SILVA database. Default value is 30.
--unique-mapping <int>
Force --ambiguity-usage='one' for the combined reference genome ('all' is used by default).

Other:
--test
Run the tool on a data from the test_data folder and check correctness of the evaluation process. Output is saved in quast_test_output.
--test-sv
Run the tool on a data from the test_data folder using the reads for SV detection. The tool will compile the required programs (SAMtools, Bowtie2, and Manta Structural Variant Caller).
-h (or --help)
Print help.
-v (or --version)
Print version.

2.5 Metagenomic assemblies

The metaquast.py script accepts multiple reference genomes. One can provide several files or directory with multiple reference files inside. The tool partitions all contigs into groups aligned to each reference genome. Note that a contig may belong to several groups simultaneously if it aligns to several references.
MetaQUAST runs quast.py for each of the following:

If you run MetaQUAST without providing reference genomes, the tool will try to identify genome content of the metagenome. MetaQUAST uses BLASTN for aligning contigs to SILVA rRNA database, i.e. FASTA file containing small subunit ribosomal RNA sequences. For each assembly, 30 reference genomes with top scores are chosen. Maximum number of references to download can be specified with --max-ref-number.

Reference genomes for the chosen genomes are downloaded from NCBI database to <quast_output_dir>/quast_downloaded_references/. After first run of quast.py, MetaQUAST removes reference genomes with low genome fraction (less than 10%) and run quast.py on remaining references. Note that MetaQUAST uses --ambiguity-usage 'all' when running quast.py on the concatenation of all input references ("combined reference") until --unique-mapping is specified.

Usage:

    python metaquast.py contigs_1 contigs_2 ... -R reference_1,reference_2,reference_3,...

All options are the same as for quast.py, except for -R: it can accept multiple reference genomes (comma-separated list without spaces in between) or a directory with references.

3. QUAST output

If an output path is not specified manually (with -o), QUAST puts its output into quast_results/result_<DATE> directory and creates latest symlink to it under quast_results/ directory.

QUAST output contains:
report.txt assessment summary in plain text format,
report.tsv tab-separated version of the summary, suitable for spreadsheets (Google Docs, Excel, etc),
report.tex LaTeX version of the summary,
icarus.html Icarus main menu with links to interactive viewers. See section 3.4 for details,
report.pdf all other plots combined with all tables (file is created if matplotlib python library is installed),
report.html HTML version of the report with interactive plots inside,
contigs_reports/
misassemblies_report detailed report on misassemblies. See section 3.1.2 for details,
unaligned_report detailed report on unaligned and partially unaligned contigs. See section 3.1.3 for details.

Note:

3.1 Metrics description

3.1.1 Summary report

# contigs (≥ x bp) is total number of contigs of length ≥ x bp. Not affected by the --min-contig parameter (see section 2.4).

Total length (≥ x bp) is the total number of bases in contigs of length ≥ x bp. Not affected by the --min-contig parameter (see section 2.4).

All remaining metrics are computed for contigs that exceed the threshold specified with --min-contig (see section 2.4, default is 500 bp).

# contigs is the total number of contigs in the assembly.

Largest contig is the length of the longest contig in the assembly.

Total length is the total number of bases in the assembly.

Reference length is the total number of bases in the reference genome.

GC (%) is the total number of G and C nucleotides in the assembly, divided by the total length of the assembly.

Reference GC (%) is the percentage of G and C nucleotides in the reference genome.

N50 is the length for which the collection of all contigs of that length or longer covers at least half an assembly.

NG50 is the length for which the collection of all contigs of that length or longer covers at least half the reference genome.
This metric is computed only if the reference genome is provided.

N75 and NG75 are defined similarly to N50 but with 75 % instead of 50 %.

L50 (L75, LG50, LG75) is the number of contigs equal to or longer than N50 (N75, NG50, NG75)
In other words, L50, for example, is the minimal number of contigs that cover half the assembly.

# misassemblies is the number of positions in the contigs that satisfy one of the following criteria:

This metric requires a reference genome. Note that default threshold of 1 kbp can be changed with --extensive-mis-size.

# misassembled contigs is the number of contigs that contain misassembly events.

Misassembled contigs length is the total number of bases in misassembled contigs.

# local misassemblies is the number of breakpoints that satisfy the following conditions:

  1. Two or more distinct alignments cover the breakpoint.
  2. The gap between left and right flanking sequences is less than 1 kbp.
  3. The left and right flanking sequences both are on the same strand of the same chromosome of the reference genome.

# unaligned contigs is the number of contigs that have no alignment to the reference sequence. The value "X + Y part" means X totally unaligned contigs plus Y partially unaligned contigs.

Unaligned length is the total length of all unaligned regions in the assembly (sum of lengths of fully unaligned contigs and unaligned parts of partially unaligned ones).

Genome fraction (%) is the percentage of aligned bases in the reference genome. A base in the reference genome is aligned if there is at least one contig with at least one alignment to this base. Contigs from repetitive regions may map to multiple places, and thus may be counted multiple times.

Duplication ratio is the total number of aligned bases in the assembly divided by the total number of aligned bases in the reference genome (see Genome fraction (%) for the 'aligned base' definition). If the assembly contains many contigs that cover the same regions of the reference, its duplication ratio may be much larger than 1. This may occur due to overestimating repeat multiplicities and due to small overlaps between contigs, among other reasons.

# N's per 100 kbp is the average number of uncalled bases (N's) per 100000 assembly bases.

# mismatches per 100 kbp is the average number of mismatches per 100000 aligned bases. True SNPs and sequencing errors are not distinguished and are counted equally.

# indels per 100 kbp is the average number of indels per 100000 aligned bases. Several consecutive single nucleotide indels are counted as one indel.

# genes is the number of genes in the assembly (complete and partial), based on a user-provided list of gene positions in the reference genome. A gene 'partially covered' if the assembly contains at least 100 bp of this gene but not the whole one.
 
This metric is computed only if a reference genome and an annotated list of gene positions are provided (see section 2.4).

# operons is defined similarly to # genes, but an operon positions file required instead.

# predicted genes is the number of genes in the assembly found by GeneMarkS, GeneMark-ES, GlimmerHMM or MetaGeneMark. See the description of --gene-finding option for details.

Largest alignment is the length of the largest continuous alignment in the assembly. A value can be smaller than a value of largest contig if the largest contig is misassembled.

NA50, NGA50, NA75, NGA75, LA50, LA75, LGA50, LGA75 ("A" stands for "aligned") are similar to the corresponding metrics without "A", but in this case aligned blocks instead of contigs are considered.
Aligned blocks are obtained by breaking contigs at misassembly events and removing all unaligned bases.

3.1.2 Misassemblies report

# misassemblies is the same as # misassemblies from section 3.1.1. However, this report also contains a classification of all misassembly events into three groups: relocations, translocations, and inversions (see below). For metagenomic assemblies, this classification also includes interspecies translocation.

Relocation is a misassembly event (breakpoint) where the left flanking sequence aligns over 1 kbp away from the right flanking sequence on the reference genome, or they overlap by more than 1 kbp, and both flanking sequences align on the same chromosome. Note that default threshold of 1 kbp can be changed by --extensive-mis-size.

Translocation is a misassembly event (breakpoint) where the flanking sequences align on different chromosomes.

Interspecies translocation is a misassembly event (breakpoint) where the flanking sequences align on different reference genomes (MetaQUAST only).

Inversion is a misassembly event (breakpoint) where the flanking sequences align on opposite strands of the same chromosome.

Scaffold gap size misassemblies is a misassembly event (breakpoint) where the flanking sequences combined in scaffold on the wrong distance (--scaffolds only). These misassemblies are not included in the total number of misassemblies.

# misassembled contigs and misassembled contigs length are the same as the metrics from section 3.1.1 and are counted among all contigs with any type of a misassembly event (relocation, translocation, interspecies translocation or inversion).

# possibly misassembled contigs is the number of contigs that contain large unaligned fragment and thus could possibly contain interspecies translocation with unknown reference (MetaQUAST only).

# local misassemblies is the same as # local misassemblies from section 3.1.1.

# structural variants is the number of misassemblies matched with structural variations of genome. These misassemblies are not included in the total number of misassemblies.

# mismatches is the number of mismatches in all aligned bases.

# indels is the number of indels in all aligned bases.

# short indels (≤ 5 bp) is the number of indels of length  5 bp.

# long indels (> 5 bp) is the number of indels of length > 5 bp.

Indels length is the total number of bases contained in all indels.

Note: Nucmer's default maximum length of indel is 85 bp. All indels larger than 85 bp are considered as local misassemblies.

3.1.3 Unaligned report

# fully unaligned contigs is the number of contigs that have no alignment to the reference sequence.

Fully unaligned length is the total number of bases in all unaligned contigs.

# partially unaligned contigs is the number of contigs that are not fully unaligned, but have fragments with no alignment to the reference sequence.

# with misassembly is the number of partially unaligned contigs that have a misassembly event in their aligned fragment. Note that such misassembly events are not counted in # misassemblies and other misassemblies statistics.

# both parts are significant is the number of partially unaligned contigs that have both aligned and unaligned fragments longer than the value of --significant-part-size.

Partially unaligned length is the total number of unaligned bases in all partially unaligned contigs.

# N's is the total number of uncalled bases (N's) in the assembly.

3.2 Plots description

Contig alignment plot shows alignment of contigs to the reference genome and the positions of misassembly events in these contigs. Contigs that align correctly are colored blue if the boundaries agree (within 2 kbp on each side, contigs are larger than 10 kbp) in at least half of the assemblies, and green otherwise. Blocks of misassembled contigs are colored orange if the boundaries agree in at least half of the assemblies, and red otherwise. Contigs are staggered vertically and are shown in different shades of their color in order to distinguish the separate contigs, including small ones. If the reference file consists of several sequences all of them are drawn on the single plot horizontally next to each other.

Cumulative length plot shows the growth of contig lengths. On the x-axis, contigs are ordered from the largest to smallest. The y-axis gives the size of the x largest contigs in the assembly.

Nx plot shows Nx values as x varies from 0 to 100 %.

NGx plot shows NGx values as x varies from 0 to 100 %.

GC content plot shows the distribution of GC content in the contigs.
 
The x value is the GC percentage (0 to 100 %).
The y value is the number of non-overlapping 100 bp windows which GC content equals x %.
 
For a single genome, the distribution is typically Gaussian. However, for assemblies with contaminants, the GC distribution appears to be a superposition of Gaussian distributions, giving a plot with multiple peaks.

Cumulative length plot for aligned contigs shows the growth of lengths of aligned blocks. If a contig has a misassembly event, QUAST breaks it into smaller pieces called aligned blocks.
 
On the x-axis, blocks are ordered from the largest to smallest. The y-axis gives the size of the x largest aligned blocks.
This plot is created only if a reference genome is provided.

NAx and NGAx plots
These plots are similar to the Nx and NGx plots but for the NAx and NGAx metrics respectively. These plots are created only if a reference genome is provided.

Genes plot shows the growth rate of full genes in assemblies.
The y-axis is the number of full genes in the assembly, and the x-axis is the number of contigs in the assembly (from the largest one to the smallest one).
This plot could be created only if a reference genome and genes annotations files are given.

Operons plot is similar to the previous one but for operons.

3.3 MetaQUAST output

Output for combined reference genome is located inside the directory provided with -o (or in quast_results/latest). An output for each reference genome is placed into separate directory inside <quast_output_dir>/runs_per_reference directory. Also, plots and reports for each metric are saved to <quast_output_dir>/summary/. Combined HTML report is saved to <quast_output_dir>/report.html.

Metric-level plots
These plots are created for each metric to show its values for all assemblies vs all reference genomes. References on the plot are sorted by the mean value of this metric in all assemblies. References are always sorted from the best results to the worst ones, thus the plot can be descending or ascending depend on the metric.

Metric-level reports (TXT, TSV and TEX versions)
These files contain the same information as the metric-level plots, but in a different formats: simple text format, tab-separated format, and LaTeX.

Summary HTML-report
Summary HTML-report is created on the basis of HTML-report in combined_quast_output/. Each row is expandable and contains data for all reference genomes. You can view results separately for each reference genome by clicking on a row preceded by plus sign:

Note that values for some metrics like # contigs may not sum up, because one contig may be aligned to multiple reference genomes.

Krona charts
Krona pie charts show assemblies and dataset taxonomic profiles. Relative species abundance is calculated based on the total length of contigs aligned to corresponding reference genome. Charts are created for each assembly and one additional chart is created for all assemblies altogether.
Note: these plots are created only in de novo evaluation mode (MetaQUAST without reference genomes).

3.4 Icarus output

Icarus generates contig size viewer and one or more contig alignment viewers (if reference genome/genomes are provided). All of them are located in <quast_output_dir>/icarus_viewers/. The links to the viewers and other auxiliary information are provided in Icarus main menu which is saved in <quast_output_dir>/icarus.html. Note that QUAST HTML report also contains a link to Icarus output.

All Icarus viewers contain a legend with color scheme description. For moving and zooming interactive window you can use mouse, Icarus controls (top panel) or keyboard shortcuts (+, -, ←, →, use Shift to speed up the action).

Contig size viewer
This type of viewer draws contigs ordered from longest to shortest. This ordering is suitable for comparing only largest contigs or number of contigs longer than a specific threshold. The viewer shows N50 and N75 with color and textual indication. If the reference genome is available or at least approximate genome length is known (see --est-ref-size), NG50 and NG75 are also shown. You can also tone down contigs shorter than a specified threshold using Icarus control panel.

Contig alignment viewer
This type of viewer is available only if a reference genome is provided. For large genomes (≥ 50 Mbp) each chromosome is displayed in a separate viewer. This is also true for multiple reference genomes (see section 2.5).
The viewer places contigs according to their mapping to the reference genome. The viewer can additionally visualize genes, operons, and read coverage distribution along the genome, if any of those were fed to QUAST.

Note: We recommend to use Icarus in Chrome, however it was tested in other popular web browsers as well (see FAQ, Q9 for exact list with versions).

4. Adjusting QUAST reports and plots

You can easily change content, order of metrics, and metric names in all QUAST reports. In order to do this, edit CONFIGURABLE PARAMETERS section in libs/reporting.py. It contains a lot of informative comments, which will help you to adjust QUAST reports easily even if you are new to Python.

You can also adjust plot colors, style and width of lines, legend font, etc. See CONFIGURABLE PARAMETERS section in libs/plotter.py.

Note: if you restart QUAST on the same directory with new parameters, is will reuse existing alignments and run much faster. See the description of -o option in section 2.4.

5. Citation


If you use QUAST in your research, please include Gurevich et al., 2013 into your reference list:
Alexey Gurevich, Vladislav Saveliev, Nikolay Vyahhi and Glenn Tesler,
QUAST: quality assessment tool for genome assemblies,
Bioinformatics (2013) 29 (8): 1072-1075. doi: 10.1093/bioinformatics/btt086
First published online: February 19, 2013

If you use MetaQUAST in your research, please include Mikheenko et al., 2016 into your reference list:
Alla Mikheenko, Vladislav Saveliev, Alexey Gurevich,
MetaQUAST: evaluation of metagenome assemblies,
Bioinformatics (2016) 32 (7): 1088-1090. doi: 10.1093/bioinformatics/btv697
First published online: November 26, 2015

If you use Icarus visualizations in your research, please include Mikheenko et al., 2016 into your reference list:
Alla Mikheenko, Gleb Valin, Andrey Prjibelski, Vladislav Saveliev, Alexey Gurevich,
Icarus: visualizer for de novo assembly evaluation,
Submitted (2016)

6. Feedback and bug reports

We will be thankful if you help us make QUAST better by sending your comments, bug reports, and suggestions to quast.support@bioinf.spbau.ru.

We kindly ask you to attach the quast.log file from output directory (or an entire archive of the folder) if you have troubles running QUAST.

Note that if you didn't specify the output directory manually, it is going to be automatically set to quast_results/results_<date_time>, with a symbolic link quast_results/latest to that directory.

7. FAQ

This section contains frequent questions about QUAST. Read answers below for deeper understanding of the results generated by the tool.

For the simplicity of explanation we further refer to the directory containing all results as <quast_output_dir>.
If you use the command-line version of QUAST you can specify <quast_output_dir> with -o option ("quast_results/latest" if not specified).
If you use http://quast.bioinf.spbau.ru/ you should download full report by pressing "Download report" button (at top-right corner), decompress result and go to "full_report" subdirectory.


Q1. It seems that QUAST is giving me a differing number of misassemblies and misassembled contigs. Does this imply that QUAST looks for multiple misassemblies within one contig?

Yes, you are right, QUAST looks for multiple misassembly events within one contig. Thus, number of misassembled contigs is always less or equal to number of misassemblies.


Q2. Is there a way to get only misassembled contigs of the assembly?

Yes, there is such way.
QUAST copies all misassembled contigs of "<assembly_name>" assembly into <quast_output_dir>/contigs_reports/<assembly_name>.mis_contigs.fa file.
E.g. if your assembly is "contigs.fasta" then the file is "contigs.mis_contigs.fa", if your assembly is "ecoli_assembly_1.fa.gz" then the file is "ecoli_assembly_1.mis_contigs.fa".


Q3. Is it possible to find which misassembly corresponds to each contig and which kind of a misassembly event it is?

Yes, it is possible. QUAST produces report with detailed info about each contig alignments and the short version with only extensive misassemblies records.

Let's start with the short one. It is saved to <quast_output_dir>/contigs_reports/contigs_report_<assembly_name>.mis_contigs.info. E.g. if your assembly is "contigs.fasta" then the file is "contigs_report_contigs.mis_contigs.info", if your assembly is "ecoli_assembly_1.fasta" then the file is "contigs_report_ecoli_assembly_1.stdout".
The content of this file looks like this:

NODE_601
Extensive misassembly ( inversion ) between 287 575 and 296 1
Extensive misassembly ( relocation, inconsistency = 2655 ) between 16800 18907 and 18905 20382
In this example, we can see that contig named NODE_601 has two extensive misassemblies. The first is an inversion. It occurred between fragments 287 575 and 296 1 (coordinates on the contig). The first fragment (287-575 bp) aligned to the forward strand and the second one (1-296 bp) to the reverse strand (coordinates are descending). The second misassembly is a relocation. It occurred between fragments 16800-18907 and 18905-20382. They aligned to the reference genome with inconsistency of 2655 bp (gap in this case).


Let's move to the detailed report. Here you can find information about all misassembled, unaligned and correctly aligned contigs. This report is saved to <quast_output_dir>/contigs_reports/contigs_report_<assembly_name>.stdout file. E.g. if your assembly is "contigs.fasta" then the file is "contigs_report_contigs.stdout", if your assembly is "ecoli_assembly_1.fasta" then the file is "contigs_report_ecoli_assembly_1.mis_contigs.info".

To get info about misassemblies, you should look for "Extensive misassembly" words in the report and look around to detect contig name which corresponds this misassembly.

Look at the following example:
CONTIG: NODE_772 (575bp)
Top Length: 296  Top ID: 100.0
    Skipping redundant alignment 1096745 1096882 | 138 1 | 138 138 | 98.55 | Escherichia_coli NODE_772
    This contig is misassembled. 3 total aligns.
        Real Alignment 1: 924846 925134 | 287 575 | 289 289 | 100.0 | Escherichia_coli NODE_772
            Extensive misassembly ( inversion ) between these two alignments
        Real Alignment 2: 924906 925201 | 296 1 | 296 296 | 100.0 | Escherichia_coli NODE_772
In this example, we can see that contig name is NODE_772, its length is 575 bp. This contig has two alignments and one misassembly. Inversion is a type of the misassembly. QUAST also reports relocations and translocations, see section 3.1.2 for details.

Here is another example:
CONTIG: Contig_753 (140518bp)
Top Length: 121089  Top ID: 99.98
    Skipping redundant alignments after choosing the best set of alignments
    Skipping redundant alignment 273398 273468 | 18977 18907 | 71 71 | 100.0 | Escherichia_coli Contig_753
    ....
    Skipping redundant alignment 3363797 3363867 | 18977 18907 | 71 71 | 100.0 | Escherichia_coli Contig_753
    This contig is misassembled. 14 total aligns.
        Real Alignment 1: 1425621 1426074 | 19431 18978 | 454 454 | 100.0 | Escherichia_coli Contig_753
            Gap between these two alignments (local misassembly). Inconsistency = 148
        Real Alignment 2: 1426295 1426818 | 18905 18382 | 524 524 | 100.0 | Escherichia_coli Contig_753
            Extensive misassembly ( relocation, inconsistency = 2224055 ) between these two alignments
        Real Alignment 3: 3650278 3650348 | 18977 18907 | 71 71 | 100.0 | Escherichia_coli Contig_753
            Extensive misassembly ( relocation, inconsistency = 236807 ) between these two alignments
        Real Alignment 4: 3765544 3886652 | 140518 19430 | 121109 121089 | 99.98 | Escherichia_coli Contig_753
            Extensive misassembly ( relocation, inconsistency = -1052 ) between these two alignments
        Real Alignment 5: 3886649 3905037 | 18381 1 | 18389 18381 | 99.96 | Escherichia_coli Contig_753
This contig is Contigs_753 of length 140518 bp. It has 3 extensive misassemblies (all three are relocations) and one local misassembly.


Q4. Could you explain the format of Real Alignments in contigs report files (see the answer for Q3)?

Yes, sure. Let's look at the following example:

    Real Alignment 1: 19796 20513 | 29511 30228 | 718 718 | 100.0 | ENA|U00096|U00096.2_Escherichia_coli contig-710
The first two numbers are position in the target sequence (reference genome), and the second two are position in the query sequence (assembled contig). Note that positions on the target are always ascending while positions on the query can be ascending (forward strand) and descending (reverse-complement one).

The next two numbers (in this case: 718 718) mean "the number of aligned bases on the target" and "the number of aligned bases on the query". They are usually equal to each other but they can be slightly different because of short insertions and deletions. Actually, these numbers are excessive because they can be easily calculated based on the first two pairs of numbers (positions on the target and positions on the query). However, sometimes it is convenient to look at these numbers.

The last number (in this case: 100.0) is the Nucmer aligner quality metric. It is called "identity %" (IDY %) and it describes the quality of the alignment (the number of mismatches and indels between the target and the query). If IDY% = 100.0 then the alignment is perfect, i.e. all bases on the target and on the query are equal to each other. If IDY% is less than 100.0 then the target and the query are slightly different. Quast has a threshold on IDY% which is 95%. Thus we don't use alignments with IDY% less than 95% (they are considered to be relatively bad).

And finally, the last two columns are the name of the target sequence (i.e. reference genome name) and the name of the query (i.e. contig name).


Q5. Where does QUAST save information about SNPs?

There are two output files containing SNP information. Both of them are saved in <quast_output_dir>/contigs_reports/nucmer_output/ directory.
The first one has extension ".all_snps" and it is raw Nucmer aligner output. Its format is:

     [P1]  [SUB] [SUB]  [P2]  [BUFF] [DIST] [R] [Q] [FRM] [TAGS]
     15383   T     G   3339560 1     15383   3   2    1     -1    Escherichia_coli contig_15
Where: P1 is position on the reference genome, first SUB is nucleotide in the reference genome, second SUB is nucleotide in the contig, P2 is position on the contig, BUFF is the distance from this SNP to the nearest mismatch (end of alignment, indel, SNP, etc) in the same alignment, while the [DIST] column specifies the distance from this SNP to the nearest sequence end.
R and Q specify the number of other alignments, which overlap this position (in Reference and Query (i.e. contig) respectively). FRM and TAGS are not documented in Nucmer help message, and the last two columns are reference name and contig name.

The second file ("*.used_snps") is generated by QUAST.
We analyse all alignments and filter them by skipping some "uninformative" alignments (redundant, duplicated) and after that include in ".used_snps" file only those of all SNPs, which were actually appear in filtered alignments. Thus, values of "# mismatches per 100 kbp", "# indels per 100 kbp" reported by QUAST include statistics from USED SNPs, not ALL SNPs.
In addition, we use our own format of ".used_snps" file.
  Escherichia_coli  contig_15  728803  C   .  3217983
where the columns are: reference genome name, contig name, position on the reference genome, nucleotide in the reference genome, nucleotide in the contig (in this case it is ".", i.e. an absence of a nucleotide in the contig which means a deletion) and the final column is position on the contig.


Q6. What does "broken" version of an assembly refer to while assessing scaffolds' quality (--scaffolds option)?

Actually, the difference between "broken" and original assembly (scaffolds) is very simple. QUAST splits input fasta by continuous fragments of N's of length ≥ 10 and call this a "_broken" assembly. By doing this we try to reconstruct "contigs" which were used for construction of the scaffolds. After that, user can compare results for real scaffolds and "reconstructed contigs" and find out whether scaffolding step was useful or not.

If you have both contigs.fasta and scaffolds.fasta it is better to specify both files to QUAST and don't set "--scaffolds" option. The comparison of real contigs vs real scaffolds is more honest and informative than scaffolds vs scaffolds_broken.

To sum up, you should use "--scaffolds" option if you don't have original file with contigs but want to compare your scaffolds with it.


Q7. Can I add new assemblies to existing QUAST report without need to realign already processed assemblies? Or can I at least rerun existing QUAST report with slightly modified options set?

Yes, sure! You just need to specify existing QUAST output directory with -o option. Our tool will reuse already generated Nucmer alignments and will run alignment process only for new assemblies. Note that all of QUAST options except --min-contig do not affect Nucmer alignment process, so you can rerun previous QUAST command with modified options and QUAST will reuse existing alignments also.
Hint: if you did not specify QUAST output dir with -o option you can rerun QUAST on the same directory with -o quast_results/latest.


Q8. Which types of structural variations (SV) are handled by QUAST? Can you give examples of correct BEDPE files for --sv-bedpe option?

QUAST can detect and correctly resolve inversions, deletions, and translocations. We also plan to add support for insertions soon.

BEDPE format specification is here. We process first seven columns of the file (chrom1, start1, end1, chrom2, start2, end2, name), the rest are optional and not read by QUAST. Note that columns should be tab-separated!
Chrom1, start1, end1 define confidence interval around SV start, chrom2, start2, end2 define confidence interval around SV end. Name defines SV type and it should contain 'INV' substring for inversions or 'DEL' for deletions; translocations are automatically identified if chrom1 is not equal to chrom2.

Example of BEDPE line for inversion on positions 1000-1200 of 'E.coli' chromosome (confidence interval is 11 bp long):

    E.coli 995 1010 E.coli 1195 1205 This_is_INVersion The Rest Columns Are Optional
    
Example of BEDPE line for deletion of fragment between 1000 and 1200 of 'S.aureus' chromosome:
    S.aureus 995 1010 S.aureus 1195 1205 DEL
    
Example of BEDPE line for translocation from position 500 of 'chr1' chromosome to position 100 of 'chr2' chromosome (confidence interval is different for both ends):
    chr1 450 550 chr2 100 100 name_does_not_matter_here
    

Q9. Which versions of web browsers are suitable for Icarus output?

We recommend to use Icarus in Chrome (tested with v49.0.x), however it also works properly in Safari (tested with v8.0.x) and Firefox (tested with v41.0.x and v45.0.x). Most of the functionality works in Internet Explorer 9 and higher, but we do not recommend this browser due to slow animation.