Assembly report for run aquamis_test_data

Overview

  • Run name: aquamis_test_data
  • Number of completed samples: 4
  • Number of pipeline fails: 0
  • Number of QC fails: 0
  • Working directory: /cephfs/abteilung4/Projects_NGS/aquamis_contamination/resources/test_data/aquamis
  • Location of trimmed fastq files: /cephfs/abteilung4/Projects_NGS/aquamis_contamination/resources/test_data/aquamis/trimmed
  • Location of fasta files: /cephfs/abteilung4/Projects_NGS/aquamis_contamination/resources/test_data/aquamis/Assembly/assembly

Pipeline execution fails

Pipeline execution failed for 0 samples.

## [1] "No samples failed."

For more details, have a look at the log files in /cephfs/abteilung4/Projects_NGS/aquamis_contamination/resources/test_data/aquamis/logs.

Short summary table

Detailed assembly table

The table is searchable and sortable. At the end, clickable links are provided to the created fasta file, to the best ncbi reference contig and to the quast and icarus reports.

Note that the best reference is selected from the set of all complete bacterial chromosome assemblies. Hence plasmids are excluded which in turn might reflect in a larger genome length of the sample than the reference. Futhermore, for bacteria with more than one chromosome, only the best matching chromosome is reported. Thus gene count, reference length etc. are properties of the best matching chromosome.

Detailed trimming table

Taxonomy results

Three most abundant species according to read-based taxonomic classification with kraken2 and abundance estimation with braken based on the kraken2 minikraken database.

Three most abundant species according to contig-based taxonomic classification with kraken2 based on the kraken2 minikraken database.

Three most abundant genus according to read-based taxonomic classification with kraken2 and abundance estimation with braken based on the kraken2 minikraken database.

Three most abundant genus according to contig-based taxonomic classification with kraken2 based on the kraken2 minikraken database.

Contamination results

Inter and intra species contamination is assessed using confindr. Sound intra-species contamination is performed using a genus specific cgMLST approach. A sample is marked as contaminated if either more than 1 contaminating SNV per 10000 base pairs examined was found - or there is cross contamination between genera. Missing values were not determined (ND).

  • Sample: The name of the sample. ConFindr will take everything before the first underscore (_) character to be the name of the sample, as done with samples coming from an Illumina MiSeq.
  • Genus: The genus that ConFindr thinks your sample is. If ConFindr couldn’t figure out what genus your sample is from, this will be NA. If multiple genera were found, they will all be listed here, separated by a :
  • NumContamSNVs: The number of times ConFindr found sites with more than one base present.
  • ContamStatus: The most important of all! Will read True if contamination is present in the sample, and False if contamination is not present. The result will be True if any of the following conditions are met: More than 1 contaminating SNV per 10000 base pairs examined was found. There is cross contamination between genera.
  • PercentContam: Based on the depth of the minor variant for sites with multiple bases, ConFindr guesses at what percent of your reads come from a contaminant. The more sequencing depth you have, the more accurate this will get. For lower levels of contamination (around 5 percent) this tends to get overestimated, but the number gets more accurate as contamination level increases, as well as sequencing depth.
  • PercentContamStandardDeviation: The standard deviation of the percentage contamination estimate. Very high values may indicate something strange is going on.
  • BasesExamined: The number of bases ConFindr examined when making the contamination call. Will usally be around 20kb for rMLST databases, and will vary when other databases are used.
  • DatabaseDownloadDate: Date that rMLST databases were downloaded, if you have them. As these are curated and updated regularly, it’s a good idea to re-run confindr_database_setup every now and then.

Plots per run

Coverage depth

Figure 1: Average assembly coverage depth of all mapped reads (per sample).

Figure 1: Average assembly coverage depth of all mapped reads (per sample).

Fraction of recovered reference genes

Figure 2: Fraction of reference genes fully or partially found in assembly. Note that if a genome consists of more than one chromosome, only the fraction beloning to the largest chromosome is displayed.

Figure 2: Fraction of reference genes fully or partially found in assembly. Note that if a genome consists of more than one chromosome, only the fraction beloning to the largest chromosome is displayed.

Fraction of reads mapped to contigs

Figure 3: Fraction of reads that map back to the assembly. A lower fraction indicates problems with the assembly and/or contamination in a sample.

Figure 3: Fraction of reads that map back to the assembly. A lower fraction indicates problems with the assembly and/or contamination in a sample.

Insert size distribution

Figure 4: Insert size distrubution per sample (violin plot). The mean insert sizes are indicated by red diamonds. The insert size is the same as the fragment size without barcodes. Thus, if the insert size is smaller than two times the read length, the reads overlap.

Figure 4: Insert size distrubution per sample (violin plot). The mean insert sizes are indicated by red diamonds. The insert size is the same as the fragment size without barcodes. Thus, if the insert size is smaller than two times the read length, the reads overlap.


Coverage depth distribution

Figure 6: Coverage depth distribution of each sample(violin graph), in logscale. Each separate bubble indicates the presence of a DNA molecule with a defined coverage depth distribition. Hence, the presence of more than one bubble may be associated with the presence of (high copy number) plasmids.

Figure 6: Coverage depth distribution of each sample(violin graph), in logscale. Each separate bubble indicates the presence of a DNA molecule with a defined coverage depth distribition. Hence, the presence of more than one bubble may be associated with the presence of (high copy number) plasmids.

Plots per sample

Coverage depth distribution

In the following, the coverage depth distribution is shown for each sample (in log scale):

## $SRR1206159

## 
## $SRR1609871

## 
## $SRR2985019

## 
## $SRR498433

Program versions/ log

Parameters

  • All parameter setting can be found in the config file: /cephfs/abteilung4/Projects_NGS/aquamis_contamination/resources/test_data/aquamis/config.yaml.
  • program versions are available in the conda env files
##   Created by AQUAMIS:"1.3.1"
##   version           :"v1.3.0-6-g7893989"
##   workdir           :"/cephfs/abteilung4/Projects_NGS/aquamis_contamination/resources/test_data/aquamis"
##   samples           :"/cephfs/abteilung4/Projects_NGS/aquamis_contamination/resources/test_data/samples.tsv"
##   params            :List of 12
##    ..$ threads    :10
##    ..$ docker     :""
##    ..$ run_name   :"aquamis_test_data"
##    ..$ remove_temp:FALSE
##    ..$ fastp      :List of 1
##    .. ..$ length_required:15
##    ..$ confindr   :List of 1
##    .. ..$ database:"/cephfs/abteilung4/Projects_NGS/aquamis_contamination/repo/AQUAMIS/reference_db/confindr"
##    ..$ kraken2    :List of 4
##    .. ..$ db_kraken         :"/cephfs/abteilung4/Projects_NGS/aquamis_contamination/repo/AQUAMIS/reference_db/kraken"
##    .. ..$ read_length       :150
##    .. ..$ taxonomic_qc_level:"G"
##    .. ..$ taxonkit_db       :"/cephfs/abteilung4/Projects_NGS/aquamis_contamination/repo/AQUAMIS/reference_db/taxonkit"
##    ..$ shovill    :List of 7
##    .. ..$ assembler     :"spades"
##    .. ..$ depth         :100
##    .. ..$ tmpdir        :"/tmp/shovill"
##    .. ..$ ram           :16
##    .. ..$ output_options:""
##    .. ..$ extraopts     :""
##    .. ..$ modules       :"--noreadcorr"
##    ..$ mash       :List of 3
##    .. ..$ mash_refdb     :"/cephfs/abteilung4/Projects_NGS/aquamis_contamination/repo/AQUAMIS/reference_db/mash/mashDB.msh"
##    .. ..$ mash_kmersize  :21
##    .. ..$ mash_sketchsize:1000
##    ..$ mlst       :List of 1
##    .. ..$ scheme:""
##    ..$ qc         :List of 1
##    .. ..$ thresholds:"/cephfs/abteilung4/Projects_NGS/aquamis_contamination/repo/AQUAMIS/resources/AQUAMIS_thresholds.json"
##    ..$ json_schema:List of 2
##    .. ..$ validation:"/cephfs/abteilung4/Projects_NGS/aquamis_contamination/repo/AQUAMIS/resources/AQUAMIS_schema_v20210226.json"
##    .. ..$ filter    :"/cephfs/abteilung4/Projects_NGS/aquamis_contamination/repo/AQUAMIS/resources/AQUAMIS_schema_filter_v20210226.json"

Software versions

  • main programs are fastp, confindr, shovill/spades, mash, quast
Software Version
fastp 0.20.1
ConFindr 0.7.4
Kraken 2.1.1
bracken 2.5
taxonkit 0.7.2
shovill 1.1.0
bwa 0.7.17-r1188
flash 1.2.11
java 11.0.8-internal
kmc 3.1.0
lighter 1.1.2
megahit 1.2.9
megahit_toolkit 1.2.9
pigz 2.5
pilon 1.23
samclip 0.4.0
samtools 1.11
seqtk 1.3-r106
skesa 2.4.0
spades 3.14.1
trimmomatic 0.39
velvetg 1.2.10
velveth 1.2.10
mash 2.2.2
QUAST 5.0.2
mlst 2.19.0

Logging

  • snakemake logs: * config file: /cephfs/abteilung4/Projects_NGS/aquamis_contamination/resources/test_data/aquamis/.snakemake.
  • logfiles in folder /cephfs/abteilung4/Projects_NGS/aquamis_contamination/resources/test_data/aquamis/logs.

Help

Column Details
Sample Name Name of sample
QC Vote Recommended quality assessment based all criteria (PASS or FAIL)
QC Fail Number of fields falling below the fail threshold
QC Warn Number of fields falling below the warning threshold
QC N.D. Number of fields where no threshold could be applied, either by missing value or missing genus/species-specific threshold
Reference Best NCBI complete genome according to mash
Reference Accession Accession number of best NCBI complete genome according to mash
Species Species of best NCBI complete genome according to mash
# Reads Number of reads after trimming
Megabases Number of bases from all trimmed reads
Q30 Base Fraction Fraction of bases that have Q30 or higher
Coverage Depth Average depth over all positions and contigs
# Contigs Number of contigs larger 0 base pairs
# Contigs >1000 bp Number of contigs larger 1000 base pairs
N50 N50 value in basepairs (indicator of average contig size and assembly quality)
Read Fraction Majority Taxon Fraction of reads assigned to the most abundant taxonomic rank, e.g. species or genus
Contig Fraction Majority Taxon Fraction of contigs assigned to the most abundant taxonomic rank, e.g. species or genus
Contam. Status Contamination Status
Contam. # SNVs Number of contaminating Single Nucleotide Variations
Single-Copy Orthologs Fraction of universal single-copy orthologs that were found. Values below 1 indicate incompleteness
Duplicated Orthologs Fraction of universal single-copy orthologs that were found in duplicate. Non-zero indicate contamination
MLST Loci w/ Multiple Alleles MLST loci with multiple alleles, an indicator of intra-species contamination
MLST Loci Missing missing MLST loci, compare to ST in scheme to confirm a true miss
MLST Schema MLST schema - determined automatically (default) or chosen by user
MLST ST MLST sequence type of associated MLST schema
# Full Genes Number of reference genes found
# Partial Genes Number of reference genes partially found
Fraction Genes Recovered Fraction of genes found compared to all reference genes, includes partial matches
Reference Coverage Genome coverage compared to reference
Duplication Ratio Total number of aligned bases in the assembly divided by the total number of aligned bases in the reference genome
GC Total number of G and C nucleotides in the assembly, divided by the total length of the assembly
Total Length Sum of all contigs
Reference Length Length of reference genomes
Reference Similarity Similarity to reference according to mash. The value describes the fraction of shared kmers
Fraction Mapped Reads Fraction of reads that map to contigs. Values below 1 indicate assembly issues
Insert Size Calculated fragment length, i.e. read length plus insert size
Trimming Details fastp report with details related to read trimming and read QC
FASTA Link to assembly file
NCBI Link to reference NCBI entry
Ikarus Link to ikarus report on structural comparison to reference
QUAST Link to assembly quality report