AQUAMIS is a pipeline for routine assembly and quality assessment of microbial isolate sequencing experiments.
It is based on snakemake and includes the following tools:
It will read untrimmed raw data from your Illumina sequencing experiments as paired .fastq.gz-files.
These are then trimmed, assembled and polished.
Besides generating ready-to-use contigs, AQUAMIS will select the closest reference genome from NCBI RefSeq and produce an intuitive, detailed report on your data and assemblies to evaluate its reliability for further analyses.
It relies on reference-based and reference-free measures such as coverage depth, gene content, genome completeness and contamination, assembly length and many more.
Based on the experience from thousands of sequencing experiments, threshold sets for different species have been defined to detect potentially poor results.
The AQUAMIS project website is https://gitlab.com/bfr_bioinformatics/AQUAMIS
There, you can find the latest version, source code and documentation.
You can install AQUAMIS by installing the bioconda package, by installing the docker container or by cloning this repository and installing all dependencies with conda.
AQUAMIS relies on the conda package manager for all dependencies.
Please set up conda on your system as explained here.
It is advised to use mamba instead of conda for resolving all software requirements (Install it via conda install mamba
first).
Placeholder | Path |
---|---|
<path_to_conda> |
is the conda installation folder, type conda info --base to retrieve its absolute path, typically ~/anaconda3 or ~/miniconda3 |
<path_to_envs> |
is the folder that holds your conda environments, typically <path_to_conda>/envs |
<path_to_installation> |
is the parent folder of the AQUAMIS repository |
<path_to_aquamis> |
is the base folder of the AQUAMIS repository, i.e. <path_to_installation>/AQUAMIS |
<path_to_databases> |
is the parent folder of your databases, by default, AQUAMIS uses <path_to_aquamis>/reference_db , but you are free to choose a custom location |
<path_to_data> |
is the working directory for an AQUAMIS analysis typically containing a subfolder <path_to_data>/raw with your fastq read files |
To install the latest stable version of AQUAMIS, please clone the git repository on your system.
cd <path_to_installation>
git clone https://gitlab.com/bfr_bioinformatics/AQUAMIS.git
AQUAMIS relies on the package manager conda
for all dependencies.
Please set up conda on your system as explained here.
Next, please execute the script:
<path_to_aquamis>/scripts/aquamis_setup.sh
to install the conda dependency manager mamba
, create the conda environment aquamis
and install external databases within the default folder <path_to_aquamis>/reference_db
.
Alternatively, please initialize a conda base environment containing snakemake
and mamba
(mamba is faster in resolving dependencies), then:
mamba env create -c conda-forge -f <path_to_aquamis>/envs/aquamis.yaml
This creates an environment named aquamis
containing all dependencies.
It is found under <path_to_conda>/envs/aquamis
.
For custom database paths, please see the chapter Database setup.
mamba create -n aquamis -c bioconda aquamis # coming soon
Prerequisite:
Install the Docker engine for your favourite operating system, e.g. Ubuntu Linux.
Download the latest version of AQUAMIS from Docker Hub and note down the Docker Image ID on your system (hereafter refered as $docker_image_id) with the shell commands:
docker pull bfrbioinformatics/aquamis:latest
docker image list | grep "aquamis" | grep "latest" | awk '{ print $3 }'
To process data and write results, Docker needs a volume mapping from a host directory containing your sequence data (<path_to_data>
) to the Docker container (/AQUAMIS/analysis
).
Your sample list (samples.tsv
) needs to be located within <path_to_data>
and contain relative paths to your NGS reads in the same or another child directory.
You may generate a Docker-compatible sample list in your host directory (<path_to_data>/samples.tsv
) by executing the create_sampleSheet.sh
from the container with the following terminal commands:
host:<path_to_data>$ ls raw/
sample1_R1.fastq sample1_R2.fastq sample2_R2.fastq sample2_R2.fastq
docker run --rm \
-v <path_to_data>:/AQUAMIS/analysis \
-e HOST_PATH=<path_to_data> \
-e LOCAL_USER_ID=$(id -u $USER) \
--entrypoint bash $docker_image_id \
/AQUAMIS/scripts/create_sampleSheet.sh --mode ncbi \
--fastxDir /AQUAMIS/analysis/raw \
--outDir /AQUAMIS/analysis
With the following command, AQUAMIS is started within the Docker container and will process any options appended:
docker run --rm \
-v <path_to_data>:/AQUAMIS/analysis \
-e HOST_PATH=<path_to_data> \
-e LOCAL_USER_ID=$(id -u $USER) \
--condaprefix /opt/conda/envs \
--sample_list /AQUAMIS/analysis/samples.tsv \
--working_directory /AQUAMIS/analysis \
--<any_other_AQUAMIS_options>
Note: The container path /AQUAMIS/analysis
is fixed and may not be altered.
Any subdirectories of <path_to_data>
will be available as subdirectories under /AQUAMIS/analysis/
.
Our container is able to write results with the Linux user and group ID of your choice (UID
and GID
, respectively) to blend into your host file permission setup.
With the above option -e LOCAL_USER_ID==$(id -u $USER)
the UID of the currently executing user is inherited, change it according to your needs.
The absolute host path mapped to the container has to be provided as the environment variable $HOST_PATH
, too.
It is used for correcting file paths in the result JSON files of each sample to match the host perspective.
To run AQUAMIS, source the conda environment aquamis
and call the wrapper script:
conda activate aquamis
python3 aquamis.py --help
usage: aquamis.py [-h] -l SAMPLE_LIST -d WORKING_DIRECTORY [-s SNAKEFILE]
[-m MASHDB] [--mash_kmersize MASH_KMERSIZE]
[--mash_sketchsize MASH_SKETCHSIZE] [--kraken2db KRAKEN2DB]
[--read_length READ_LENGTH]
[--min_trimmed_length MIN_TRIMMED_LENGTH]
[--assembler ASSEMBLER]
[--shovill_output_options SHOVILL_OUTPUT_OPTIONS]
[--shovill_extraopts SHOVILL_EXTRAOPTS]
[--shovill_modules SHOVILL_MODULES] [-t THREADS]
[--threads_sample THREADS_SAMPLE] [-c CONDAPREFIX] [-n]
[--forceall] [-f FORCE] [--fix_fails] [--unlock]
[--no_assembly]
optional arguments:
-h, --help show this help message and exit
-l SAMPLE_LIST, --sample_list SAMPLE_LIST
List of samples to assemble, format as defined by ...
-d WORKING_DIRECTORY, --working_directory WORKING_DIRECTORY
Working directory
-s SNAKEFILE, --snakefile SNAKEFILE
Path to Snakefile of bakcharak pipeline, default is
path to Snakefile in same directory
-m MASHDB, --mashdb MASHDB
Path to reference mash database
--mash_kmersize MASH_KMERSIZE
kmer size for mash, must match size of database,
default 21
--mash_sketchsize MASH_SKETCHSIZE
sketch size for mash, must match size of database,
default 1000
--kraken2db KRAKEN2DB
Path to kraken2 database
--read_length READ_LENGTH
Read length to be used in braken abundane estimation,
default 150
--min_trimmed_length MIN_TRIMMED_LENGTH
Minimum length of a read to keep, default = 15
--assembler ASSEMBLER
Assembler to use in shovill, choose from megahit
velvet skesa spades (default: spades)
--shovill_output_options SHOVILL_OUTPUT_OPTIONS
Extra output options for shovill (default: "")
--shovill_extraopts SHOVILL_EXTRAOPTS
Extra options for shovill (default: "")
--shovill_modules SHOVILL_MODULES
Module options for shovill, choose from --noreadcorr
--trim --nostitch --nocorr --noreadcorr (default: "--
noreadcorr")
-t THREADS, --threads THREADS
Number of Threads to use. Ideally multiple of 10,
default = 10
--threads_sample THREADS_SAMPLE
Number of Threads to use per sample, default = 1
-c CONDAPREFIX, --condaprefix CONDAPREFIX
Path of default conda environment, enables recycling
built environments. Must not be empty.
-n, --dryrun Snakemake dryrun. Only calculate graph without
executing anything
--forceall Snakemake force. Force recalculation of all steps
-f FORCE, --force FORCE
Snakemake force. Force recalculation of output (rule
or file) speciefied here
--fix_fails Re-run snakemake after failure removing failed samples
--unlock Unlock a snakemake execution folder if it had been
interrupted
--no_assembly Only trimming and kraken analysis
For example:
<path_to_aquamis>/aquamis.py -l <path_to_data>/samples.tsv -s <path_to_aquamis>/Snakefile -c <path_to_envs> -m <path_to_databases>/mash/mash_db.msh -d <path_to_data>
You can also run snakemake directly
snakemake -p --conda-prefix <path_to_envs> --keep-going --configfile <path_to_data>/config.yaml --snakefile <path_to_aquamis>/Snakefile --use-conda
AQUAMIS is built to be used routinely.
To ensure a maximum comparability of the results, a default config.yaml file is generated when calling the aquamis.py
wrapper script.
The wrapper itself only allows configuring basic functionalities.
The config.yaml can be initialized by starting AQUAMIS with the dry-run flag -n .
Then, you can alter it to configure AQUAMIS in more detail.
AQUAMIS will provide you with an interactive, browser-based report, showing the most important measures of your data on the first sight.
All tables in the report can be sorted and filtered.
Short Summary Table shows the key values for a quick estimation of the success of your sequencing experiment and the assembly.
Detailed Assembly Table is giving many additional measures.
In addition to the tables, many measures are provided as graphical feedback.
Plots per Run and Plots per Sample are generated for one complete sequencing experiment and each show measures on one specific dataset, respectively.
In addition, all results are stored in JSON format in the subfolders /json/qc
and /json/full
of your current working directory <path_to_data>
.
The content of /json/qc
files is a subset of /json/full
and combines trimming, contamination assessment and read-based taxonomic classification results prior to the assembly stage.
It represents the final digest when assembly is omitted by enforcing the Snakemake rule all_trimming_only.
Each JSON file is named after its corresponding sample and has the following high-level structure:
.
├── sample/
│ ├── analysis
│ ├── summary
│ └── qc_assessment
└── pipelines/
├── fastp
├── confindr
├── kraken2/
│ ├── read_based
│ └── contig_based
├── shovill
├── samstats
├── mlst
├── mash
├── quast
├── busco
└── aquamis
The node...
sample/analysis
holds metadata on the sample raw data paths, times of analyses, version info and analysis parameters of each performed AQUAMIS call.sample/summary
combines selected results of all modules, representing the Detailed Assembly Table and is also available as a single line per sample in the <path_to_data>/reports/summary_report.tsv
.pipelines/
stores the detailed results of each bioinformatic module/tool in a full take approach.For easy data mining of multiple sample JSON files in R
, please follow the methods used in the markdown cells Import Sample JSONs and Deserialize
and read_data
of <path_to_aquamis>/scripts/write_report.Rmd
using the R packages jsonlite
, rrapply
and purrr
.
The ConFindr installation already provides databases for Listeria, Salmonella and E. coli.
Additional databases for Campylobacter, Bacillus, Brucella, Staphyloccus can be found here:
cd <path_to_databases> # free to choose
wget --output-document confindr_db.tar.gz https://seafile.bfr.berlin/f/ede87ec860624a0cb406/?dl=1
tar -xzvf confindr_db.tar.gz -C <path_to_databases>
Specify the path <path_to_databases>/confindr
in the --confindr_database
flag.
You may also consider using the species agnostic rMLST database described here.
We propose using the latest minikraken2 and associated bracken database, see here or here for details
Alternatively you can download a legacy version:
cd <path_to_databases> # free to choose
wget --output-document minikraken2.tgz https://seafile.bfr.berlin/f/8ca1b4d2c97341498698/?dl=1
tar -zxvf minikraken2.tgz
Specify the path <path_to_databases>/minikraken2
in the --kraken2db
flag.
cd <path_to_databases> && mkdir <path_to_databases>/taxonkit # free to choose
wget --output-document taxdump.tar.gz https://seafile.bfr.berlin/f/1d51700ecfd241e4a6d4/?dl=1 # 54MB or ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -xzvf taxdump.tar.gz -C <path_to_databases>/taxonkit/
Specify the path <path_to_databases>/taxonkit
in the --taxonkit_db
flag.
cd <path_to_databases> && mkdir <path_to_databases>/mash # free to choose
wget --output-document mashDB.tar.gz https://seafile.bfr.berlin/f/41f804a1eba541788530/?dl=1
tar -xzvf mashDB.tar.gz -C <path_to_databases>/mash/
Specify the path <path_to_databases>/mash
in the --mashdb
flag.
cd <path_to_envs>/aquamis/lib/python3.7/site-packages/quast_libs/busco/ # exact path depends on conda installation
wget --output-document bacteria.tar.gz https://seafile.bfr.berlin/f/41cf8fdcfe2043d2800e/?dl=1
tar -xzvf bacteria.tar.gz
To detect the path of your Quast environment and associated Python library path, you may type:
find <path_to_envs>/aquamis -name quast
Augustus is an additional dependency to Quast v5 that should be downloaded and installed automatically.
In case there is a network issue, please install it manually by typing:
cd <path_to_envs>/aquamis/lib/python3.7/site-packages/quast_libs # exact path depends on conda installation
wget -O augustus.tar.gz https://seafile.bfr.berlin/f/64cc5034fad74f50a2f0/?dl=1
tar -xzvf augustus.tar.gz
Test data is provided by downloading the following tarball:
wget --output-document raw.tar.gz https://seafile.bfr.berlin/f/b8b636dbe6bd4b39801c/?dl=1
tar -xzvf test_data.tar.gz -C <path_to_data>/
cd <path_to_data>
<path_to_aquamis>/scripts/create_sampleSheet.sh --help
Please consult the AQUAMIS project website for questions.
If this does not help, please feel free to consult: