This report was generated using the results located at /glittertind/home/carl/asscom2/tests/strachan_campylo/results_ac2 using the installation at /glittertind/home/carl/asscom2.

Samples

Table 1: Overview of the samples analysed in this batch. Because mashtree has run, the samples are arranged by the order of the mashtree output.

Report sections

Here is an overview of the number of result files that have been found for each analysis. A report section is only rendered if relevant result files are present for that analysis. Each section can be triggered to run by calling assemblycomparator2 with a trailing --until <section>

Table 2: Overview of sections that are rendered in this report. “n / expected” shows the number of analysis files versus how many are expected to be present. Sections are only rendered if relevant files exist. Analyses that perform comparisons between samples generally only output one set of results independent on the number of input files


Assembly statistics

rule assembly_stats

Table 3: Assembly statistics is provided by assembly-stats. N50 indicates the length of the smallest contig that (together with the longer contigs) covers at least half of the genome.


Contig sizes and GC-content

rule sequence_lengths

Fig. 1: Visualization of the length of each fasta record for each sample. The colors show the mean GC content for each record (contig).


BUSCO

rule busco

Table 4: Table of BUSCO “BUSCO estimates the completeness and redundancy of processed genomic data based on universal single-copy orthologs.”. The following columns are printed as percents [%]: C: Complete, S: Complete and single-copy, D: Complete and duplicated, F: Fragmented, M: Missing, n: Total BUSCO groups searched. For each sample, only the best lineage match (in terms of completeness) is shown.

Fig. 2: BUSCO results visualized. Legend: S: Complete and single-copy; D: Complete and duplicated; F: Fragmented; M: Missing. For each sample, only the best lineage match (in terms of completeness) is shown.


Checkm2

rule checkm2

Table 5: Checkm2 results.


Kraken2

rule kraken2

Table 6: Kraken2 results. For each sample, only the best hit is shown. Taxonomical identification is provided by Kraken 2. The percentages indicate the number of fragments that are covered by the respective clade.


GTDB taxonomical classification

rule gtdbtk GTDB uses several public repositories with reference sequences and assigns the most likely name by measuring the average nucleotide identity (ANI) and relative evolutionary divergence (RED).

Table 7: Species classification provided by the GTDB-tk classify_wf workflow.


MLST

rule mlst

Table 8: Table of MLST (Multi Locus Sequence Typing) results. Called with mlst which incorporates components of the PubMLST database.

How to customize the mlst-analysis

Mlst automatically detects the best scheme for typing, one sample at a time. If you don’t agree with the automatic detection, you can enforce a single scheme across all samples by (re)running assemblycomparator2 with the added command-line argument: --config mlst_scheme=hpylori --forcerun mlst. Replace hpylori with the mlst scheme you wish to use. You can find a full list of available schemes in the “results_ac2/mlst/mlst_schemes.txt”.


Antimicrobial Resistance

rule abricate Using Abricate, the assemblies are scanned for known resistance genes in the ncbi, card, plasmidfinder and vfdb antimicrobial resistance databases.

NCBI AMRFinder

Table 9: Table of NCBI Resistance gene calls called with NCBI AMRFinder.


VFDB

Table 10: Table of VFDB virulence factor calls: “An integrated and comprehensive online resource for curating information about virulence factors of bacterial pathogens”.


Genomic annotation

rule prokka

Table 11: Overview of the number of different gene types. Called using the Prokka genome annotator.


KEGG pathway enrichment analysis

rule kegg_pathway For each genome the prokka-prodigal called amino-acid sequences are searched in the Uniref100-KO database. This is the same database that CheckM2 uses. For the results produced for this analysis, the alignment criteria are stricter (>=85% coverage and >=50% identity). Using clusterProfilers “enricher” function, Benjamini-Hochberg adjusted p-values for the pathway enrichment for the called genes is computed.

Fig. 3: Summary of the KEGG-ortholog based pathway enrichment analysis results. The KEGG pathway hierarchy consists of a number of pathway-classes that are listed on the vertical axis. n denotes the number of pathways from that class, that are significally enriched in each sample.

Table 12: Results from the KEGG-ortholog based pathway enrichment analysis produced with clusterProfiler::enricher. Only significant results are shown. The KOs can be entered directly into KEGG mapper search by setting mode to “Reference”.


Pan and Core genome

rule roary Roary the pan genome pipeline computes the number of orthologous genes in a number of core/pan spectrum partitions.

The core genome denotes the genes which are conserved between all samples (intersection), whereas the pan genome is the union of all genes across all samples.

Table 13: Distribution of genes in different core/pan spectrum partitions.

Fig. 4: Genes shared between samples. Each vertical line represents a gene, and all lines have the same width regardless of the size of the gene. The genes are colored by the number of samples sharing them.


SNP distances

rule snp_dists Counts the number of differences between any pair of samples on the core genome produced by roary. SNP distances do not approximate the evolutionary distance as they are not adjusted for different probabilities for transitions and transversions etc. Rather, they give a ballpark indication of the difference between the samples. Note that the number of SNP distances is highly sensitive to the core/pan genome size ratio.

Table 14: Pairwise SNP distances between all samples.

Fig. 5: Pairwise SNP distances between all samples. The color indicates the relative distance for the pair when considering the index positions of a phylogenetic tree resembling the samples which is produced with mashtree. The index positions in a phylogenetic tree can be haphazard, but will always correlate with kinship.


Mashtree phylogeny

rule mashtree Mashtree computes an approximation of ANI using the minhash distance measure. On these distances, a phylogenetic tree is then created using the neighbor-joining algorithm. The plotted tree is not rooted.

Fig. 6: Approximation of a phylogenetic tree calculated with mashtree. The horizontal axis is equivalent to 1-ANI.


assemblycomparator2 v2.5.4 genomes to report pipeline. Copyright (C) 2019-2023 Carl M. Kobel GNU GPL v3