AQUAMIS - Assembly-based QUAlity assessment for Microbial Isolate Sequencing

Description

AQUAMIS is a pipeline for routine assembly and quality assessment of microbial isolate sequencing experiments.
It is based on snakemake and includes the following tools:

It will read untrimmed raw data from your Illumina sequencing experiments as paired .fastq.gz-files.
These are then trimmed, assembled and polished.
Besides generating ready-to-use contigs, AQUAMIS will select the closest reference genome from NCBI RefSeq and produce an intuitive, detailed report on your data and assemblies to evaluate its reliability for further analyses.
It relies on reference-based and reference-free measures such as coverage depth, gene content, genome completeness and contamination, assembly length and many more.
Based on the experience from thousands of sequencing experiments, threshold sets for different species have been defined to detect potentially poor results.

Website

The AQUAMIS project website is https://gitlab.com/bfr_bioinformatics/AQUAMIS

There, you can find the latest version, source code and documentation.

Installation

You can install AQUAMIS by installing the bioconda package, by installing the docker container or by cloning this repository and installing all dependencies with conda.
AQUAMIS relies on the conda package manager for all dependencies.
Please set up conda on your system as explained here.
It is advised to use mamba instead of conda for resolving all software requirements (Install it via conda install mamba first).

Path Placeholders in this manual

Placeholder Path
<path_to_conda> is the conda installation folder, type conda info --base to retrieve its absolute path, typically ~/anaconda3 or ~/miniconda3
<path_to_envs> is the folder that holds your conda environments, typically <path_to_conda>/envs
<path_to_installation> is the parent folder of the AQUAMIS repository
<path_to_aquamis> is the base folder of the AQUAMIS repository, i.e. <path_to_installation>/AQUAMIS
<path_to_databases> is the parent folder of your databases, by default, AQUAMIS uses <path_to_aquamis>/reference_db, but you are free to choose a custom location
<path_to_data> is the working directory for an AQUAMIS analysis typically containing a subfolder <path_to_data>/raw with your fastq read files

From source

To install the latest stable version of AQUAMIS, please clone the git repository on your system.

cd <path_to_installation>
git clone https://gitlab.com/bfr_bioinformatics/AQUAMIS.git

AQUAMIS relies on the package manager conda for all dependencies.
Please set up conda on your system as explained here.

Next, please execute the script:

<path_to_aquamis>/scripts/aquamis_setup.sh

to install the conda dependency manager mamba, create the conda environment aquamis and install external databases within the default folder <path_to_aquamis>/reference_db.

Manual Conda Environment Setup

Alternatively, please initialize a conda base environment containing snakemake and mamba (mamba is faster in resolving dependencies), then:

mamba env create -c conda-forge -f <path_to_aquamis>/envs/aquamis.yaml

This creates an environment named aquamis containing all dependencies.
It is found under <path_to_conda>/envs/aquamis.

For custom database paths, please see the chapter Database setup.

From bioconda

mamba create -n aquamis -c bioconda aquamis   # coming soon

From docker

Prerequisite:
Install the Docker engine for your favourite operating system, e.g. Ubuntu Linux.

Download the latest version of AQUAMIS from Docker Hub and note down the Docker Image ID on your system (hereafter refered as $docker_image_id) with the shell commands:

docker pull bfrbioinformatics/aquamis:latest
docker image list | grep "aquamis" | grep "latest" | awk '{ print $3 }'

To process data and write results, Docker needs a volume mapping from a host directory containing your sequence data (<path_to_data>) to the Docker container (/AQUAMIS/analysis).
Your sample list (samples.tsv) needs to be located within <path_to_data> and contain relative paths to your NGS reads in the same or another child directory.
You may generate a Docker-compatible sample list in your host directory (<path_to_data>/samples.tsv) by executing the create_sampleSheet.sh from the container with the following terminal commands:

host:<path_to_data>$ ls raw/
sample1_R1.fastq   sample1_R2.fastq   sample2_R2.fastq   sample2_R2.fastq
docker run --rm \
  -v <path_to_data>:/AQUAMIS/analysis \
  -e HOST_PATH=<path_to_data> \
  -e LOCAL_USER_ID=$(id -u $USER) \
  --entrypoint bash $docker_image_id \
  /AQUAMIS/scripts/create_sampleSheet.sh --mode ncbi \
  --fastxDir /AQUAMIS/analysis/raw \
  --outDir /AQUAMIS/analysis

With the following command, AQUAMIS is started within the Docker container and will process any options appended:

docker run --rm \
  -v <path_to_data>:/AQUAMIS/analysis \
  -e HOST_PATH=<path_to_data> \
  -e LOCAL_USER_ID=$(id -u $USER) \
  --condaprefix /opt/conda/envs \
  --sample_list /AQUAMIS/analysis/samples.tsv \
  --working_directory /AQUAMIS/analysis \
  --<any_other_AQUAMIS_options>

Note: The container path /AQUAMIS/analysis is fixed and may not be altered.
Any subdirectories of <path_to_data> will be available as subdirectories under /AQUAMIS/analysis/.
Our container is able to write results with the Linux user and group ID of your choice (UID and GID, respectively) to blend into your host file permission setup.
With the above option -e LOCAL_USER_ID==$(id -u $USER) the UID of the currently executing user is inherited, change it according to your needs.
The absolute host path mapped to the container has to be provided as the environment variable $HOST_PATH, too.
It is used for correcting file paths in the result JSON files of each sample to match the host perspective.

Usage

Execution

To run AQUAMIS, source the conda environment aquamis and call the wrapper script:

conda activate aquamis
python3 aquamis.py --help
usage: aquamis.py [-h] -l SAMPLE_LIST -d WORKING_DIRECTORY [-s SNAKEFILE]
                  [-m MASHDB] [--mash_kmersize MASH_KMERSIZE]
                  [--mash_sketchsize MASH_SKETCHSIZE] [--kraken2db KRAKEN2DB]
                  [--read_length READ_LENGTH]
                  [--min_trimmed_length MIN_TRIMMED_LENGTH]
                  [--assembler ASSEMBLER]
                  [--shovill_output_options SHOVILL_OUTPUT_OPTIONS]
                  [--shovill_extraopts SHOVILL_EXTRAOPTS]
                  [--shovill_modules SHOVILL_MODULES] [-t THREADS]
                  [--threads_sample THREADS_SAMPLE] [-c CONDAPREFIX] [-n]
                  [--forceall] [-f FORCE] [--fix_fails] [--unlock]
                  [--no_assembly]

optional arguments:
  -h, --help            show this help message and exit
  -l SAMPLE_LIST, --sample_list SAMPLE_LIST
                        List of samples to assemble, format as defined by ...
  -d WORKING_DIRECTORY, --working_directory WORKING_DIRECTORY
                        Working directory
  -s SNAKEFILE, --snakefile SNAKEFILE
                        Path to Snakefile of bakcharak pipeline, default is
                        path to Snakefile in same directory
  -m MASHDB, --mashdb MASHDB
                        Path to reference mash database
  --mash_kmersize MASH_KMERSIZE
                        kmer size for mash, must match size of database,
                        default 21
  --mash_sketchsize MASH_SKETCHSIZE
                        sketch size for mash, must match size of database,
                        default 1000
  --kraken2db KRAKEN2DB
                        Path to kraken2 database
  --read_length READ_LENGTH
                        Read length to be used in braken abundane estimation,
                        default 150
  --min_trimmed_length MIN_TRIMMED_LENGTH
                        Minimum length of a read to keep, default = 15
  --assembler ASSEMBLER
                        Assembler to use in shovill, choose from megahit
                        velvet skesa spades (default: spades)
  --shovill_output_options SHOVILL_OUTPUT_OPTIONS
                        Extra output options for shovill (default: "")
  --shovill_extraopts SHOVILL_EXTRAOPTS
                        Extra options for shovill (default: "")
  --shovill_modules SHOVILL_MODULES
                        Module options for shovill, choose from --noreadcorr
                        --trim --nostitch --nocorr --noreadcorr (default: "--
                        noreadcorr")
  -t THREADS, --threads THREADS
                        Number of Threads to use. Ideally multiple of 10,
                        default = 10
  --threads_sample THREADS_SAMPLE
                        Number of Threads to use per sample, default = 1
  -c CONDAPREFIX, --condaprefix CONDAPREFIX
                        Path of default conda environment, enables recycling
                        built environments. Must not be empty.
  -n, --dryrun          Snakemake dryrun. Only calculate graph without
                        executing anything
  --forceall            Snakemake force. Force recalculation of all steps
  -f FORCE, --force FORCE
                        Snakemake force. Force recalculation of output (rule
                        or file) speciefied here
  --fix_fails           Re-run snakemake after failure removing failed samples
  --unlock              Unlock a snakemake execution folder if it had been
                        interrupted
  --no_assembly         Only trimming and kraken analysis

For example:

<path_to_aquamis>/aquamis.py -l <path_to_data>/samples.tsv -s <path_to_aquamis>/Snakefile -c <path_to_envs> -m <path_to_databases>/mash/mash_db.msh -d <path_to_data>

You can also run snakemake directly

snakemake -p --conda-prefix <path_to_envs> --keep-going --configfile <path_to_data>/config.yaml --snakefile <path_to_aquamis>/Snakefile --use-conda

Configuration

AQUAMIS is built to be used routinely.
To ensure a maximum comparability of the results, a default config.yaml file is generated when calling the aquamis.py wrapper script.
The wrapper itself only allows configuring basic functionalities.
The config.yaml can be initialized by starting AQUAMIS with the dry-run flag -n .
Then, you can alter it to configure AQUAMIS in more detail.

Results

AQUAMIS will provide you with an interactive, browser-based report, showing the most important measures of your data on the first sight.
All tables in the report can be sorted and filtered.
Short Summary Table shows the key values for a quick estimation of the success of your sequencing experiment and the assembly.
Detailed Assembly Table is giving many additional measures.
In addition to the tables, many measures are provided as graphical feedback.
Plots per Run and Plots per Sample are generated for one complete sequencing experiment and each show measures on one specific dataset, respectively.

JSON output

In addition, all results are stored in JSON format in the subfolders /json/qc and /json/full of your current working directory <path_to_data>.
The content of /json/qc files is a subset of /json/full and combines trimming, contamination assessment and read-based taxonomic classification results prior to the assembly stage.
It represents the final digest when assembly is omitted by enforcing the Snakemake rule all_trimming_only.
Each JSON file is named after its corresponding sample and has the following high-level structure:

.
├── sample/
│   ├── analysis
│   ├── summary
│   └── qc_assessment
└── pipelines/
    ├── fastp
    ├── confindr
    ├── kraken2/
    │   ├── read_based
    │   └── contig_based
    ├── shovill
    ├── samstats
    ├── mlst
    ├── mash
    ├── quast
    ├── busco
    └── aquamis

The node...

For easy data mining of multiple sample JSON files in R, please follow the methods used in the markdown cells Import Sample JSONs and Deserialize and read_data of <path_to_aquamis>/scripts/write_report.Rmd using the R packages jsonlite, rrapply and purrr.

Database Setup

ConFindr database

The ConFindr installation already provides databases for Listeria, Salmonella and E. coli.
Additional databases for Campylobacter, Bacillus, Brucella, Staphyloccus can be found here:

cd <path_to_databases>   # free to choose
wget --output-document confindr_db.tar.gz https://seafile.bfr.berlin/f/ede87ec860624a0cb406/?dl=1
tar -xzvf confindr_db.tar.gz -C <path_to_databases>

Specify the path <path_to_databases>/confindr in the --confindr_database flag.

You may also consider using the species agnostic rMLST database described here.

Kraken2 and bracken database

We propose using the latest minikraken2 and associated bracken database, see here or here for details
Alternatively you can download a legacy version:

cd <path_to_databases>   # free to choose
wget --output-document minikraken2.tgz https://seafile.bfr.berlin/f/8ca1b4d2c97341498698/?dl=1
tar -zxvf minikraken2.tgz

Specify the path <path_to_databases>/minikraken2 in the --kraken2db flag.

Taxonomy database

cd <path_to_databases> && mkdir <path_to_databases>/taxonkit   # free to choose
wget --output-document taxdump.tar.gz https://seafile.bfr.berlin/f/1d51700ecfd241e4a6d4/?dl=1  #  54MB or ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -xzvf taxdump.tar.gz -C <path_to_databases>/taxonkit/

Specify the path <path_to_databases>/taxonkit in the --taxonkit_db flag.

mash database

cd <path_to_databases> && mkdir <path_to_databases>/mash  # free to choose
wget --output-document mashDB.tar.gz https://seafile.bfr.berlin/f/41f804a1eba541788530/?dl=1
tar -xzvf mashDB.tar.gz -C <path_to_databases>/mash/

Specify the path <path_to_databases>/mash in the --mashdb flag.

Quast module: BUSCO

cd <path_to_envs>/aquamis/lib/python3.7/site-packages/quast_libs/busco/   # exact path depends on conda installation
wget --output-document bacteria.tar.gz https://seafile.bfr.berlin/f/41cf8fdcfe2043d2800e/?dl=1
tar -xzvf bacteria.tar.gz

To detect the path of your Quast environment and associated Python library path, you may type:

find <path_to_envs>/aquamis -name quast

Quast module: Augustus

Augustus is an additional dependency to Quast v5 that should be downloaded and installed automatically.
In case there is a network issue, please install it manually by typing:

cd <path_to_envs>/aquamis/lib/python3.7/site-packages/quast_libs   # exact path depends on conda installation
wget -O augustus.tar.gz https://seafile.bfr.berlin/f/64cc5034fad74f50a2f0/?dl=1
tar -xzvf augustus.tar.gz

Test data

Test data is provided by downloading the following tarball:

wget --output-document raw.tar.gz https://seafile.bfr.berlin/f/b8b636dbe6bd4b39801c/?dl=1
tar -xzvf test_data.tar.gz -C <path_to_data>/
cd <path_to_data>
<path_to_aquamis>/scripts/create_sampleSheet.sh --help

Contact

Please consult the AQUAMIS project website for questions.

If this does not help, please feel free to consult: