add manual and further details to README (5640dbf6) · Commits · github_fork / Universc

README.md

+187 −6

Original line number	Diff line number	Diff line
		@@ -9,7 +9,10 @@

		### UniverSC version 0.3.0

		Conversion script to run Nadia, iCELL8, and custom libraries with cellranger (10x Genomics analysis tool)
		Single-cell processing across technologies.

		Provides a conversion script to run multiple technologies and custom libraries with cellranger (10x Genomics analysis tool).


		#### Tom Kelly (RIKEN IMS) and Kai Battenberg (RIKEN CSRS/IMS)

		@@ -17,6 +20,55 @@ Conversion script to run Nadia, iCELL8, and custom libraries with cellranger (10

		We've developed a bash script that will run cellranger on FASTQ files for these technologies. See below for details on how to use it.

		If you use this tools, please [cite](#Citation) to acknowledge the efforts of the authors. You can report problems and request
		new features to the maintainers with and [issue](#Issues) on GitHub. Details on how to [install](#Install) and [run](#Usage) are provided
		below. Please see the [help](#Help) and [examples](#Examples) to try solve your problem before submitting an issue.

		### Supported Technologies

		In principle, any technology with a cell barcode and unique molecular identifier (UMI) can be supported.

		The following technologies have been tested to ensure that they give the expected results: 10x Genomics, Nadia (DropSeq), iCELL8 version 3

		We provide the following preset configurations for convenience based on published data and configurations used by other pipelines
		(e.g, DropSeqPipe and Kallisto/Bustools). To add further support for other technologies or troubleshoot problems, please submit an Issue
		to the GitHub repository: https://github.com/TomKellyGenetics/universc/issues] as described in [Bug Reports](#Issues) below.

		Some changes to the cellranger install are required to run other technologies. Therefore we provide settings for 10x Genomics
		which restores settings for the Chromium instrument. We therefore recommend using 'convert' for processing all data from different
		technologies as the tool manages these changes. Please note that multiple technologies cannot be run on the same install of cellranger
		at the same time (the tool will also check for this to avoid causing problems with existing runs). Multiple samples of the same technology
		can be run simultaneously.

		#### Pre-set configurations

		- 10x Genomics (version automatically detected): 10x, chromium
		-- 10x Genomics version 2 (16bp barcode, 10bp UMI): 10x-v2, chromium-v2
		-- 10x Genomics version 3 (16bp barcode, 12bp UMI): 10x-v3, chromium-v3
		- CEL-Seq (8bp barcode, 4bp UMI): celseq
		- CEL-Seq2 (6bp UMI, 6bp barcode): celseq2
		- Drop-Seq (12bp barcode, 8bp UMI): nadia, dropseq
		- iCell8 version 3 (11bp barcode, 14bp UMI): icell8 or custom
		- inDrops version 1 (19bp barcode, 8bp UMI): indrops-v1, 1cellbio-v1
		- inDrops version 2 (19bp barcode, 8bp UMI): indrops-v2, 1cellbio-v2
		- inDrops version 3 (8bp barcode, 6bp UMI): indrops-v3, 1cellbio-v3
		- Quartz-Seq2 (14bp barcode, 8bp UMI): quartzseq2-384
		- Quartz-Seq2 (15bp barcode, 8bp UMI): quartzseq2-1536
		- Sci-Seq (8bp UMI, 10bp barcode): sciseq
		- SCRB-Seq (6bp barcode, 10bp UMI): scrbseq, mcscrbseq
		- SeqWell (12bp barcode, 8bp UMI): seqwell
		- Smart-seq2-UMI, Smart-seq3 (11bp barcode, 8bp UMI): smartseq
		- SureCell (18bp barcode, 8bp UMI): surecell, ddseq, biorad

		#### Custom inputs

		Custom inputs are also supported by giving the name "custom" and length of barcode and UMI separated by "_"

		e.g. Custom (16bp barcode, 10bp UMI): custom_16_10

		Custom barcode files are also supported for preset technologies. These are particularly useful for well-based
		technologies to demutliplex based on the wells.

		## Release

		At the moment, we have not released the script publicly but we do intend to. We welcome any feedback on it.
		@@ -24,7 +76,38 @@ Hopefully it will save people time as make it easier to compare technologies.

		We plan to make this open-source with the agreement of everyone in the project.

		## Installation
		### Citation <span id="Citation"><span>

		### Bug Reports <span id="Issues"><span>

		#### Reporting issues

		To add further support for other technologies or troubleshoot problems, please submit an Issue
		to the GitHub repository: https://github.com/TomKellyGenetics/universc/issues

		### Requesting new technologies

		Where possible, please provide an minimal example of the first few lines of each FASTQ file for testing purposes.

		It is also helpful to describe the technology, such as:

		- length of barcode
		- length of UMI
		- which reads they're on
		- whether there is a known barcode whitelist available
		- whether adapters or linkers are required
		- whether a preprint, publication, or company specifications are available

		Technologies that may be difficult to support are those with:

		- barcodes longer than 16bp or varying length
		- combinatorial indexing
		- dual indexing

		Please bear this in mind when submitting requests. We will consider to add further technologies but
		it could take significant resources to add support for these.

		## Installation <span id="Install"><span>

		This script requires cellranger to be installed and exported to the PATH (version 3.0.0 of higher recommended).
		The script itself is exectuable and does not require installation to run but you can put it in your PATH or
		@@ -37,7 +120,7 @@ This script will run in bash on any OS (but it has only been tested on Linux Deb
		with this configuration requires a lot of memory (40Gb) so running on server is recommended.
		SGE job modes are supported to run cellranger with multiple threads.

		## Usage
		## Usage <span id="Usage"><span>

		The script will:

		@@ -61,6 +144,104 @@ The script will:

		Please note that this script alters the barcode whitelist. Known iCELL8 barcodes are supported but this is not possible with Nadia or DropSeq chemistry so 100% valid barcodes will be returned.

		This is a work-in-progress and documentation with examples will be added in the future. The script is stable and functional.
		Please send feedback, comments, or issues to Kai Battenberg <[kai.battenberg@riken.jp](mailto:kai.battenberg@riken.jp)>
		or Tom Kelly <[tom.kelly@riken.jp](mailto:tom.kelly@riken.jp)>
		### Manual <span id="Help"><span>

		```
		Usage:
		bash launch_universc.sh --testrun -t THECHNOLOGY
		bash launch_universc.sh -t TECHNOLOGY --setup
		bash launch_universc.sh -R1 FILE1 -R2 FILE2 -t TECHNOLOGY -i ID -r REFERENCE [--option OPT]
		bash launch_universc.sh -R1 READ1_LANE1 READ1_LANE2 -R2 READ2_LANE1 READ2_LANE2 -t TECHNOLOGY -i ID -r REFERENCE [--option OPT]
		bash launch_universc.sh -f SAMPLE_LANE -t TECHNOLOGY -i ID -r REFERENCE [--option OPT]
		bash launch_universc.sh -f SAMPLE_LANE1 SAMPLE_LANE2 -t TECHNOLOGY -i ID -r REFERENCE [--option OPT]
		bash launch_universc.sh -v
		bash launch_universc.sh -h

		Convert sequencing data (FASTQ) from Nadia or iCELL8 platforms for compatibility with 10x Genomics and run cellranger count

		Mandatory arguments to long options are mandatory for short options too.
		--testrun Initiates a test trun with the test dataset
		-R1, --read1 FILE Read 1 FASTQ file to pass to cellranger (cell barcodes and umi)
		-R2, --read2 FILE Read 2 FASTQ file to pass to cellranger
		-f, --file NAME Path and the name of FASTQ files to pass to cellranger (prefix before R1 or R2)
		e.g. /path/to/files/Example_S1_L001

		-i, --id ID A unique run id, used to name output folder
		-d, --description TEXT Sample description to embed in output files.
		-r, --reference DIR Path of directory containing 10x-compatible reference.
		-t, --technology PLATFORM Name of technology used to generate data.
		Supported technologies:
		10x Genomics (version automatically detected): 10x, chromium
		10x Genomics version 2 (16bp barcode, 10bp UMI): 10x-v2, chromium-v2
		10x Genomics version 3 (16bp barcode, 12bp UMI): 10x-v3, chromium-v3
		CEL-Seq (8bp barcode, 4bp UMI): celseq
		CEL-Seq2 (6bp UMI, 6bp barcode): celseq2
		Drop-Seq (12bp barcode, 8bp UMI): nadia, dropseq
		iCell8 version 3 (11bp barcode, 14bp UMI): icell8 or custom
		inDrops version 1 (19bp barcode, 8bp UMI): indrops-v1, 1cellbio-v1
		inDrops version 2 (19bp barcode, 8bp UMI): indrops-v2, 1cellbio-v2
		inDrops version 3 (8bp barcode, 6bp UMI): indrops-v3, 1cellbio-v3
		Quartz-Seq2 (14bp barcode, 8bp UMI): quartzseq2-384
		Quartz-Seq2 (15bp barcode, 8bp UMI): quartzseq2-1536
		Sci-Seq (8bp UMI, 10bp barcode): sciseq
		SCRB-Seq (6bp barcode, 10bp UMI): scrbseq, mcscrbseq
		SeqWell (12bp barcode, 8bp UMI): seqwell
		Smart-seq2-UMI, Smart-seq3 (11bp barcode, 8bp UMI): smartseq
		SureCell (18bp barcode, 8bp UMI): surecell, ddseq, biorad
		Custom inputs are also supported by giving the name "custom" and length of barcode and UMI separated by "_"
		e.g. Custom (16bp barcode, 10bp UMI): custom_16_10
		-b, --barcodefile FILE Custom barcode list in plain text (with each line containing a barcode)

		-c, --chemistry CHEM Assay configuration, autodetection is not possible for converted files: SC3Pv2 (default), SC5P-PE, or SC5P-R2
		-n, --force-cells NUM Force pipeline to use this number of cells, bypassing the cell detection algorithm.
		-j, --jobmode MODE Job manager to use. Valid options: local (default), sge, lsf, or a .template file
		--localcores NUM Set max cores the pipeline may request at one time.
		Only applies when --jobmode=local.
		--localmem NUM Set max GB the pipeline may request at one time.
		Only applies when --jobmode=local.
		--mempercore NUM Set max GB each job may use at one time.
		Only applies in cluster jobmodes.

		-p, --per-cell-data Generates a file with basic run statistics along with per-cell data

		--setup Set up whitelists for compatibility with new technology and exit
		--as-is Skips the FASTQ file conversion if the file already exists

		-h, --help Display this help and exit
		-v, --version Output version information and exit
		--verbose Print additional outputs for debugging

		For each fastq file, follow the naming convention below:
		<SampleName>_<SampleNumber>_<LaneNumber>_<ReadNumber>_001.fastq
		e.g. EXAMPLE_S1_L001_R1_001.fastq
		Example_S4_L002_R2_001.fastq.gz

		For custom barcode and umi length, follow the format below:
		custom_<barcode>_<UMI>
		e.g. custom_16_10 (which is the same as 10x)

		Files will be renamed if they do not follow this format. File extension will be detected automatically.

		```

		### Examples <span id="Examples"><span>

		### Licensing

		This package is provided open-source on a GPL-3 license. This means that you are free to use and
		modify this code provided that they also contain this license.

		Please note that we are third-party developers releasing it for use by users like ourselves.
		We are not affiliated with 10x Genomics, Dolomite Bio, Takara Bio, or any other vendor of
		single-cell technologies. This software is not supported by 10x Genomics and only changes
		data formats so that other technologies can be used with the cellranger pipeline.

		Cellranger (version 2.0.2, 2.1.0, 2.1.0, and 3.0.2) has been released open source on and MIT
		license on GitHub. We use this version of cellranger for testing and running our tools.
		Note that the code that generates the 'cloupe' files is not included in this release.
		The Cloupe browser uses files generated by proprietary closed-source software and is
		subject to the 10x Genomics End-User License Agreement which does not allow use with
		data generated from other platforms.

		Therefore 'launch_universc.sh' does not support Cloupe files and you should not use them with
		technologies other than 10x Genomics.

launch_universc.sh

+1 −1

Original line number	Diff line number	Diff line
		@@ -93,7 +93,7 @@ fi
		#####usage statement#####
		help='
		Usage:
		bash '$(basename $0)' --testrun -t THECHNOLOGY
		bash '$(basename $0)' --testrun -t TECHNOLOGY
		bash '$(basename $0)' -t TECHNOLOGY --setup
		bash '$(basename $0)' -R1 FILE1 -R2 FILE2 -t TECHNOLOGY -i ID -r REFERENCE [--option OPT]
		bash '$(basename $0)' -R1 READ1_LANE1 READ1_LANE2 -R2 READ2_LANE1 READ2_LANE2 -t TECHNOLOGY -i ID -r REFERENCE [--option OPT]

man/convert.sh

0 → 100644

+197 −0

Original line number	Diff line number	Diff line
		.\" Manpage for UniverSC
		.\" Contact tom.kelly@riken.jp to correct errors or typos.
		.TH man 1 "08 April 2020" "0.3" "launch_universc.sh man page"
		.SH NAME
		launch_universc.sh \- single-cell processing across technologies
		.SH SYNOPSIS
		bash launch_universc.sh [--version] [--help] [--setup] [-t <technology>] [-i <id>]
		[-r <reference>] [--option <OPT>]

		bash launch_universc.sh --testrun -t TECHNOLOGY
		bash launch_universc.sh -t TECHNOLOGY --setup
		bash launch_universc.sh -R1 FILE1 -R2 FILE2 -t TECHNOLOGY -i ID -r REFERENCE [--option OPT]
		bash launch_universc.sh -R1 READ1_LANE1 READ1_LANE2 -R2 READ2_LANE1 READ2_LANE2 -t TECHNOLOGY -i ID -r REFERENCE [--option OPT]
		bash launch_universc.sh -f SAMPLE_LANE -t TECHNOLOGY -i ID -r REFERENCE [--option OPT]
		bash launch_universc.sh -f SAMPLE_LANE1 SAMPLE_LANE2 -t TECHNOLOGY -i ID -r REFERENCE [--option OPT]
		bash launch_universc.sh -v
		bash launch_universc.sh -h
		.SH DESCRIPTION
		Provides a conversion script to run multiple technologies and custom libraries with cellranger (10x Genomics analysis tool).
		.SH OPTIONS
		--testrun
		Initiates a test trun with the test dataset. The technology and id must be specified.

		e.g., bash launch_universc.sh -t "10x" -i "test-10x" --testrun

		-R1, --read1 FILE
		Read 1 FASTQ file to pass to cellranger (contains the cell barcodes and umi).
		Please provide the name of FASTQ file in the working directory or the path to it.
		String must match the name of an exiting file. Files can have any of the
		following extensions:

		.fastq .fq .fastq.gz .fq.gz

		Compressed files will be opened automatically. Files will be renamed for
		compatibility with cellranger:

		e.g., SRR1873277_R1.fastq will be renamed to SRR1873277_S1_L001_R1_001.fastq

		Names for multiple files can be given, for example multiple lanes:

		--read1 Sample_S1_L001_R1_001.fastq Sample_S1_L002_R1_001.fastq

		Apart from inDrops-v2 or inDrops-v3, all technologies expect barcodes in Read 1.

		-R2, --read2 FILE
		Read 2 FASTQ file to pass to cellranger (contains the transcript reads).
		Please provide the name of FASTQ file in the working directory or the path to it.
		String must match the name of an exiting file. Files can have any of the
		following extensions:

		.fastq .fq .fastq.gz .fq.gz

		Compressed files will be opened automatically. Files will be renamed for
		compatibility with cellranger:

		e.g., SRR1873277_R2.fastq will be renamed to SRR1873277_S1_L001_R2_001.fastq

		Names for multiple files can be given, for example multiple lanes:

		--read2 Sample_S1_L001_R2_001.fastq Sample_S1_L002_R2_001.fastq

		-f, --file NAME
		Path and the name of FASTQ files to pass to cellranger (prefix before R1 or R2)

		e.g. /path/to/files/Example_S1_L001

		This enables giving a prefix instead of "read1" and "read2". This requires
		that there are fastq files ending with the following suffixes:

		${NAME}_R1_001.fastq and ${NAME}_R2_001.fastq

		Automatic renaming of files and detection of file type may not work in this mode.
		Multiple inputs are still supported for multiple lanes:

		e.g,. --file Example_S1_L001 Example_S2_L002

		for files: Example_S1_L001_R1_001.fastq Example_S2_L002_R1_001.fastq
		Example_S1_L001_R2_001.fastq Example_S2_L002_R2_001.fastq

		-i, --id ID
		A unique run id, used to name output folder. Must be a string that doesn't
		contain special characters or an existing filename.

		-d, --description TEXT
		Sample description to embed in output files, passes to cellranger HTML output.

		-r, --reference DIR
		Path of directory containing 10x-compatible reference.
		See cellranger documentation on how to generate custom "transcriptomes" or
		download human and mouse references from the 10x Genomics website.

		-t, --technology PLATFORM
		Name of technology used to generate data.

		Supported technologies:

		10x Genomics (version automatically detected): 10x, chromium
		10x Genomics version 2 (16bp barcode, 10bp UMI): 10x-v2, chromium-v2
		10x Genomics version 3 (16bp barcode, 12bp UMI): 10x-v3, chromium-v3
		CEL-Seq (8bp barcode, 4bp UMI): celseq
		CEL-Seq2 (6bp UMI, 6bp barcode): celseq2
		Drop-Seq (12bp barcode, 8bp UMI): nadia, dropseq
		iCell8 version 3 (11bp barcode, 14bp UMI): icell8 or custom
		inDrops version 1 (19bp barcode, 8bp UMI): indrops-v1, 1cellbio-v1
		inDrops version 2 (19bp barcode, 8bp UMI): indrops-v2, 1cellbio-v2
		inDrops version 3 (8bp barcode, 6bp UMI): indrops-v3, 1cellbio-v3
		Quartz-Seq2 (14bp barcode, 8bp UMI): quartzseq2-384
		Quartz-Seq2 (15bp barcode, 8bp UMI): quartzseq2-1536
		Sci-Seq (8bp UMI, 10bp barcode): sciseq
		SCRB-Seq (6bp barcode, 10bp UMI): scrbseq, mcscrbseq
		SeqWell (12bp barcode, 8bp UMI): seqwell
		Smart-seq2-UMI, Smart-seq3 (11bp barcode, 8bp UMI): smartseq
		SureCell (18bp barcode, 8bp UMI): surecell, ddseq, biorad
		Custom inputs are also supported by giving the name "custom" and length of barcode and UMI separated by "_"
		e.g. Custom (16bp barcode, 10bp UMI): custom_16_10

		A barcode whitelist is provided for all beads or wells for the following technologies:

		10x Genomics, iCell8, inDrops-v2, and QuartzSeq2

		Where no known barcodes are available all possible barcodes of the expected length are
		generated and converted if the permutations have not been computed already.

		-b, --barcodefile FILE
		Custom barcode list in plain text (with each line containing a barcode). Please provide
		the name of a text file in the working directory or the path to it.

		-c, --chemistry CHEM
		Assay configuration, autodetection is not possible for converted files:

		SC3Pv2 (default), SC3Pv3, SC5P-PE, or SC5P-R2

		Chemistry can only be automatically detected for 10x Genomics Chromium as it relies
		on matches to a barcode whitelist. For other technologies we do not recommend changing
		the chemistry input. All samples are converted to contain the barcode and UMI in Read1
		as used for SC3Pv2. SC3Pv3 is only used for technologies with longer UMI.

		-n, --force-cells NUM
		Force pipeline to use this number of cells, bypassing the cell detection algorithm.

		-j, --jobmode MODE
		Job manager to use. Valid options: local (default), sge, lsf, or a .template file

		We recommend to use a cluster configuration when submitting jobs in to a job scheduler.
		DO NOT submit jobs in "local" mode to a slurm, SGE, or LSF cluster as cellranger runs
		multiple threads in parallel by default. Performance is significantly improved using a
		cluster mode with a job scheduler.

		See the cellranger documentation on how to set up a cluster mode with a template file.

		--localcores NUM
		Set max cores the pipeline may request at one time.
		Only applies when --jobmode=local.

		--localmem NUM
		Set max GB the pipeline may request at one time.
		Only applies when --jobmode=local.

		--mempercore NUM
		Set max GB each job may use at one time.
		Only applies in cluster jobmodes.

		-p, --per-cell-data
		Generates a file with basic run statistics along with per-cell data (additional output to cellranger).
		Recommended but disabled by default due to additional runtime required to parse BAM files.
		This provides more accurate summary statistics than cellranger (which uses an average across cells
		that are filtered out).

		--setup
		Set up whitelists for compatibility with new technology. Called automatically when a new
		technology is run and no other technology is running. Recommended to run before submitting
		multiple samples of the same technology. Example:

		bash launch_universc.sh -t "dropseq" --setup

		--as-is
		Skips the FASTQ file conversion if the file already exists and run cellranger on pre-converted file.

		-h, --help
		Prints the usage and a list of the most commonly used commands.

		-v, --version Output version information and exit
		Prints the version of 'convert' and the 'cellranger' version that will be called from the PATH.

		--verbose
		Print additional information to standard out for debugging purposes.

		--version
		Prints the version of 'convert' and the 'cellranger' version that will be called from the PATH.

		.SH SEE ALSO
		cellranger
		.SH BUGS
		No known bugs.
		.SH AUTHOR
		S. Thomas Kelly (tom.kelly [at] riken.jp)
		Kai Battenberg (kai.batenberg [at] riken.jp)

Admin message