Commit 5640dbf6 authored by TomKellyGenetics's avatar TomKellyGenetics
Browse files

add manual and further details to README

parent 157a6abb
Loading
Loading
Loading
Loading
+187 −6
Original line number Diff line number Diff line
@@ -9,7 +9,10 @@

### UniverSC version 0.3.0

Conversion script to run Nadia, iCELL8, and custom libraries with cellranger (10x Genomics analysis tool)
Single-cell processing across technologies.

Provides a conversion script to run multiple technologies and custom libraries with cellranger (10x Genomics analysis tool).


#### Tom Kelly (RIKEN IMS) and Kai Battenberg (RIKEN CSRS/IMS)

@@ -17,6 +20,55 @@ Conversion script to run Nadia, iCELL8, and custom libraries with cellranger (10

We've developed a bash script that will run cellranger on FASTQ files for these technologies. See below for details on how to use it.

If you use this tools, please [cite](#Citation) to acknowledge the efforts of the authors. You can report problems and request
new features to the maintainers with and [issue](#Issues) on GitHub. Details on how to [install](#Install) and [run](#Usage) are provided
below. Please see the [help](#Help) and [examples](#Examples) to try solve your problem before submitting an issue.

### Supported Technologies

In principle, any technology with a cell barcode and unique molecular identifier (UMI) can be supported.

The following technologies have been tested to ensure that they give the expected results: 10x Genomics, Nadia (DropSeq), iCELL8 version 3 

We provide the following preset configurations for convenience based on published data and configurations used by other pipelines 
(e.g, DropSeqPipe and Kallisto/Bustools). To add further support for other technologies or troubleshoot problems, please submit an Issue
to the GitHub repository: https://github.com/TomKellyGenetics/universc/issues] as described in [Bug Reports](#Issues) below.

Some changes to the cellranger install are required to run other technologies. Therefore we provide settings for 10x Genomics
which restores settings for the Chromium instrument. We therefore recommend using 'convert' for processing all data from different
technologies as the tool manages these changes. Please note that multiple technologies cannot be run on the same install of cellranger
at the same time (the tool will also check for this to avoid causing problems with existing runs). Multiple samples of the same technology
can be run simultaneously.

#### Pre-set configurations

-  10x Genomics (version automatically detected): 10x, chromium
--  10x Genomics version 2 (16bp barcode, 10bp UMI): 10x-v2, chromium-v2
--  10x Genomics version 3 (16bp barcode, 12bp UMI): 10x-v3, chromium-v3
-  CEL-Seq (8bp barcode, 4bp UMI): celseq
-  CEL-Seq2 (6bp UMI, 6bp barcode): celseq2
-  Drop-Seq (12bp barcode, 8bp UMI): nadia, dropseq
-  iCell8 version 3 (11bp barcode, 14bp UMI): icell8 or custom
-  inDrops version 1 (19bp barcode, 8bp UMI): indrops-v1, 1cellbio-v1
-  inDrops version 2 (19bp barcode, 8bp UMI): indrops-v2, 1cellbio-v2
-  inDrops version 3 (8bp barcode, 6bp UMI): indrops-v3, 1cellbio-v3
-  Quartz-Seq2 (14bp barcode, 8bp UMI): quartzseq2-384
-  Quartz-Seq2 (15bp barcode, 8bp UMI): quartzseq2-1536
-  Sci-Seq (8bp UMI, 10bp barcode): sciseq
-  SCRB-Seq (6bp barcode, 10bp UMI): scrbseq, mcscrbseq
-  SeqWell (12bp barcode, 8bp UMI): seqwell
-  Smart-seq2-UMI, Smart-seq3 (11bp barcode, 8bp UMI): smartseq
-  SureCell (18bp barcode, 8bp UMI): surecell, ddseq, biorad

#### Custom inputs

Custom inputs are also supported by giving the name "custom" and length of barcode and UMI separated by "_"

 e.g. Custom (16bp barcode, 10bp UMI): custom_16_10

Custom barcode files are also supported for preset technologies. These are particularly useful for well-based
technologies to demutliplex based on the wells.

## Release

At the moment, we have not released the script publicly but we do intend to. We welcome any feedback on it. 
@@ -24,7 +76,38 @@ Hopefully it will save people time as make it easier to compare technologies.

We plan to make this open-source with the agreement of everyone in the project.

## Installation
### Citation <span id="Citation"><span>

### Bug Reports <span id="Issues"><span>

#### Reporting issues

To add further support for other technologies or troubleshoot problems, please submit an Issue 
to the GitHub repository: https://github.com/TomKellyGenetics/universc/issues

### Requesting new technologies

Where possible, please provide an minimal example of the first few lines of each FASTQ file for testing purposes.

It is also helpful to describe the technology, such as:

- length of barcode
- length of UMI
- which reads they're on
- whether there is a known barcode whitelist available
- whether adapters or linkers are required
- whether a preprint, publication, or company specifications are available

Technologies that may be difficult to support are those with:

- barcodes longer than 16bp or varying length 
- combinatorial indexing
- dual indexing 

Please bear this in mind when submitting requests. We will consider to add further technologies but
it could take significant resources to add support for these.

## Installation <span id="Install"><span>

This script requires cellranger to be installed and exported to the PATH (version 3.0.0 of higher recommended).
The script itself is exectuable and does not require installation to run but you can put it in your PATH or 
@@ -37,7 +120,7 @@ This script will run in bash on any OS (but it has only been tested on Linux Deb
with this configuration requires a lot of memory (40Gb) so running on server is recommended.
SGE job modes are supported to run cellranger with multiple threads.

## Usage
## Usage <span id="Usage"><span>

The script will:

@@ -61,6 +144,104 @@ The script will:

Please note that this script alters the barcode whitelist. Known iCELL8 barcodes are supported but this is not possible with Nadia or DropSeq chemistry so 100% valid barcodes will be returned.

This is a work-in-progress and documentation with examples will be added in the future. The script is stable and functional.
Please send feedback, comments, or issues to Kai Battenberg <[kai.battenberg@riken.jp](mailto:kai.battenberg@riken.jp)>
 or Tom Kelly <[tom.kelly@riken.jp](mailto:tom.kelly@riken.jp)>
### Manual <span id="Help"><span>

```
Usage:
  bash launch_universc.sh --testrun -t THECHNOLOGY
  bash launch_universc.sh -t TECHNOLOGY --setup
  bash launch_universc.sh -R1 FILE1 -R2 FILE2 -t TECHNOLOGY -i ID -r REFERENCE [--option OPT]
  bash launch_universc.sh -R1 READ1_LANE1 READ1_LANE2 -R2 READ2_LANE1 READ2_LANE2 -t TECHNOLOGY -i ID -r REFERENCE [--option OPT]
  bash launch_universc.sh -f SAMPLE_LANE -t TECHNOLOGY -i ID -r REFERENCE [--option OPT]
  bash launch_universc.sh -f SAMPLE_LANE1 SAMPLE_LANE2 -t TECHNOLOGY -i ID -r REFERENCE [--option OPT]
  bash launch_universc.sh -v
  bash launch_universc.sh -h

Convert sequencing data (FASTQ) from Nadia or iCELL8 platforms for compatibility with 10x Genomics and run cellranger count

Mandatory arguments to long options are mandatory for short options too.
       --testrun                Initiates a test trun with the test dataset
  -R1, --read1 FILE             Read 1 FASTQ file to pass to cellranger (cell barcodes and umi)
  -R2, --read2 FILE             Read 2 FASTQ file to pass to cellranger
  -f,  --file NAME              Path and the name of FASTQ files to pass to cellranger (prefix before R1 or R2)
                                  e.g. /path/to/files/Example_S1_L001

  -i,  --id ID                  A unique run id, used to name output folder
  -d,  --description TEXT       Sample description to embed in output files.
  -r,  --reference DIR          Path of directory containing 10x-compatible reference.
  -t,  --technology PLATFORM    Name of technology used to generate data.
                                Supported technologies:
                                  10x Genomics (version automatically detected): 10x, chromium
                                  10x Genomics version 2 (16bp barcode, 10bp UMI): 10x-v2, chromium-v2
                                  10x Genomics version 3 (16bp barcode, 12bp UMI): 10x-v3, chromium-v3
                                  CEL-Seq (8bp barcode, 4bp UMI): celseq
                                  CEL-Seq2 (6bp UMI, 6bp barcode): celseq2
                                  Drop-Seq (12bp barcode, 8bp UMI): nadia, dropseq
                                  iCell8 version 3 (11bp barcode, 14bp UMI): icell8 or custom
                                  inDrops version 1 (19bp barcode, 8bp UMI): indrops-v1, 1cellbio-v1
                                  inDrops version 2 (19bp barcode, 8bp UMI): indrops-v2, 1cellbio-v2
                                  inDrops version 3 (8bp barcode, 6bp UMI): indrops-v3, 1cellbio-v3
                                  Quartz-Seq2 (14bp barcode, 8bp UMI): quartzseq2-384
                                  Quartz-Seq2 (15bp barcode, 8bp UMI): quartzseq2-1536
                                  Sci-Seq (8bp UMI, 10bp barcode): sciseq
                                  SCRB-Seq (6bp barcode, 10bp UMI): scrbseq, mcscrbseq
                                  SeqWell (12bp barcode, 8bp UMI): seqwell
                                  Smart-seq2-UMI, Smart-seq3 (11bp barcode, 8bp UMI): smartseq
                                  SureCell (18bp barcode, 8bp UMI): surecell, ddseq, biorad
                                Custom inputs are also supported by giving the name "custom" and length of barcode and UMI separated by "_"
                                  e.g. Custom (16bp barcode, 10bp UMI): custom_16_10
  -b,  --barcodefile FILE       Custom barcode list in plain text (with each line containing a barcode)

  -c,  --chemistry CHEM         Assay configuration, autodetection is not possible for converted files: SC3Pv2 (default), SC5P-PE, or SC5P-R2
  -n,  --force-cells NUM        Force pipeline to use this number of cells, bypassing the cell detection algorithm.
  -j,  --jobmode MODE           Job manager to use. Valid options: local (default), sge, lsf, or a .template file
       --localcores NUM         Set max cores the pipeline may request at one time.
                                    Only applies when --jobmode=local.
       --localmem NUM           Set max GB the pipeline may request at one time.
                                    Only applies when --jobmode=local.
       --mempercore NUM         Set max GB each job may use at one time.
                                    Only applies in cluster jobmodes.

  -p,  --per-cell-data          Generates a file with basic run statistics along with per-cell data

       --setup                  Set up whitelists for compatibility with new technology and exit
       --as-is                  Skips the FASTQ file conversion if the file already exists

  -h,  --help                   Display this help and exit
  -v,  --version                Output version information and exit
       --verbose                Print additional outputs for debugging

For each fastq file, follow the naming convention below:
  <SampleName>_<SampleNumber>_<LaneNumber>_<ReadNumber>_001.fastq
  e.g. EXAMPLE_S1_L001_R1_001.fastq
       Example_S4_L002_R2_001.fastq.gz

For custom barcode and umi length, follow the format below:
  custom_<barcode>_<UMI>
  e.g. custom_16_10 (which is the same as 10x)

Files will be renamed if they do not follow this format. File extension will be detected automatically.

```

### Examples <span id="Examples"><span>

### Licensing

This package is provided open-source on a GPL-3 license. This means that you are free to use and 
modify this code provided that they also contain this license.

Please note that we are third-party developers releasing it for use by users like ourselves.
We are not affiliated with 10x Genomics, Dolomite Bio, Takara Bio, or any other vendor of
single-cell technologies. This software is not supported by 10x Genomics and only changes
data formats so that other technologies can be used with the cellranger pipeline.

Cellranger (version 2.0.2, 2.1.0, 2.1.0, and 3.0.2) has been released open source on and MIT
license on GitHub. We use this version of cellranger for testing and running our tools.
Note that the code that generates the 'cloupe' files is not included in this release.
The Cloupe browser uses files generated by proprietary closed-source software and is
subject to the 10x Genomics End-User License Agreement which does not allow use with
data generated from other platforms.

Therefore 'launch_universc.sh' does not support Cloupe files and you should not use them with
technologies other than 10x Genomics.  
+1 −1
Original line number Diff line number Diff line
@@ -93,7 +93,7 @@ fi
#####usage statement#####
help='
Usage:
  bash '$(basename $0)' --testrun -t THECHNOLOGY
  bash '$(basename $0)' --testrun -t TECHNOLOGY
  bash '$(basename $0)' -t TECHNOLOGY --setup
  bash '$(basename $0)' -R1 FILE1 -R2 FILE2 -t TECHNOLOGY -i ID -r REFERENCE [--option OPT]
  bash '$(basename $0)' -R1 READ1_LANE1 READ1_LANE2 -R2 READ2_LANE1 READ2_LANE2 -t TECHNOLOGY -i ID -r REFERENCE [--option OPT]

man/convert.sh

0 → 100644
+197 −0
Original line number Diff line number Diff line
.\" Manpage for UniverSC
.\" Contact tom.kelly@riken.jp to correct errors or typos.
.TH man 1 "08 April 2020" "0.3" "launch_universc.sh man page"
.SH NAME
launch_universc.sh \- single-cell processing across technologies
.SH SYNOPSIS
 bash launch_universc.sh [--version] [--help] [--setup] [-t <technology>] [-i <id>]
           [-r <reference>] [--option <OPT>]

  bash launch_universc.sh --testrun -t TECHNOLOGY
  bash launch_universc.sh -t TECHNOLOGY --setup
  bash launch_universc.sh -R1 FILE1 -R2 FILE2 -t TECHNOLOGY -i ID -r REFERENCE [--option OPT]
  bash launch_universc.sh -R1 READ1_LANE1 READ1_LANE2 -R2 READ2_LANE1 READ2_LANE2 -t TECHNOLOGY -i ID -r REFERENCE [--option OPT]
  bash launch_universc.sh -f SAMPLE_LANE -t TECHNOLOGY -i ID -r REFERENCE [--option OPT]
  bash launch_universc.sh -f SAMPLE_LANE1 SAMPLE_LANE2 -t TECHNOLOGY -i ID -r REFERENCE [--option OPT]
  bash launch_universc.sh -v
  bash launch_universc.sh -h
.SH DESCRIPTION
Provides a conversion script to run multiple technologies and custom libraries with cellranger (10x Genomics analysis tool).
.SH OPTIONS
       --testrun
            Initiates a test trun with the test dataset. The technology and id must be specified.

                e.g., bash launch_universc.sh -t "10x" -i "test-10x" --testrun

  -R1, --read1 FILE
            Read 1 FASTQ file to pass to cellranger (contains the cell barcodes and umi).
            Please provide the name of FASTQ file in the working directory or the path to it.
            String must match the name of an exiting file. Files can have any of the
            following extensions:

                .fastq .fq .fastq.gz .fq.gz

            Compressed files will be opened automatically. Files will be renamed for
            compatibility with cellranger:

                e.g.,  SRR1873277_R1.fastq will be renamed to SRR1873277_S1_L001_R1_001.fastq

            Names for multiple files can be given, for example multiple lanes:

                --read1 Sample_S1_L001_R1_001.fastq Sample_S1_L002_R1_001.fastq

            Apart from inDrops-v2 or inDrops-v3, all technologies expect barcodes in Read 1.

  -R2, --read2 FILE
            Read 2 FASTQ file to pass to cellranger (contains the transcript reads).
            Please provide the name of FASTQ file in the working directory or the path to it.
            String must match the name of an exiting file. Files can have any of the
            following extensions:

                 .fastq .fq .fastq.gz .fq.gz

            Compressed files will be opened automatically. Files will be renamed for
            compatibility with cellranger:

                 e.g.,  SRR1873277_R2.fastq will be renamed to SRR1873277_S1_L001_R2_001.fastq

            Names for multiple files can be given, for example multiple lanes:

                 --read2 Sample_S1_L001_R2_001.fastq Sample_S1_L002_R2_001.fastq

  -f,  --file NAME
            Path and the name of FASTQ files to pass to cellranger (prefix before R1 or R2)

                e.g. /path/to/files/Example_S1_L001

            This enables giving a prefix instead of "read1" and "read2". This requires
            that there are fastq files ending with the following suffixes:

               ${NAME}_R1_001.fastq and ${NAME}_R2_001.fastq

            Automatic renaming of files and detection of file type may not work in this mode.
            Multiple inputs are still supported for multiple lanes:

                  e.g,. --file Example_S1_L001 Example_S2_L002

                 for files: Example_S1_L001_R1_001.fastq Example_S2_L002_R1_001.fastq
                            Example_S1_L001_R2_001.fastq Example_S2_L002_R2_001.fastq

  -i,  --id ID
            A unique run id, used to name output folder. Must be a string that doesn't
            contain special characters or an existing filename.

  -d,  --description TEXT
            Sample description to embed in output files, passes to cellranger HTML output.

  -r,  --reference DIR
            Path of directory containing 10x-compatible reference.
            See cellranger documentation on how to generate custom "transcriptomes" or
            download human and mouse references from the 10x Genomics website.

  -t,  --technology PLATFORM
            Name of technology used to generate data.

                Supported technologies:

                                  10x Genomics (version automatically detected): 10x, chromium
                                  10x Genomics version 2 (16bp barcode, 10bp UMI): 10x-v2, chromium-v2
                                  10x Genomics version 3 (16bp barcode, 12bp UMI): 10x-v3, chromium-v3
                                  CEL-Seq (8bp barcode, 4bp UMI): celseq
                                  CEL-Seq2 (6bp UMI, 6bp barcode): celseq2
                                  Drop-Seq (12bp barcode, 8bp UMI): nadia, dropseq
                                  iCell8 version 3 (11bp barcode, 14bp UMI): icell8 or custom
                                  inDrops version 1 (19bp barcode, 8bp UMI): indrops-v1, 1cellbio-v1
                                  inDrops version 2 (19bp barcode, 8bp UMI): indrops-v2, 1cellbio-v2
                                  inDrops version 3 (8bp barcode, 6bp UMI): indrops-v3, 1cellbio-v3
                                  Quartz-Seq2 (14bp barcode, 8bp UMI): quartzseq2-384
                                  Quartz-Seq2 (15bp barcode, 8bp UMI): quartzseq2-1536
                                  Sci-Seq (8bp UMI, 10bp barcode): sciseq
                                  SCRB-Seq (6bp barcode, 10bp UMI): scrbseq, mcscrbseq
                                  SeqWell (12bp barcode, 8bp UMI): seqwell
                                  Smart-seq2-UMI, Smart-seq3 (11bp barcode, 8bp UMI): smartseq
                                  SureCell (18bp barcode, 8bp UMI): surecell, ddseq, biorad
                                Custom inputs are also supported by giving the name "custom" and length of barcode and UMI separated by "_"
                                  e.g. Custom (16bp barcode, 10bp UMI): custom_16_10

           A barcode whitelist is provided for all beads or wells for the following technologies:

                 10x Genomics, iCell8, inDrops-v2, and QuartzSeq2

            Where no known barcodes are available all possible barcodes of the expected length are
            generated and converted if the permutations have not been computed already.

  -b,  --barcodefile FILE
            Custom barcode list in plain text (with each line containing a barcode). Please provide
            the name of a text file in the working directory or the path to it.

  -c,  --chemistry CHEM
            Assay configuration, autodetection is not possible for converted files:

                SC3Pv2 (default), SC3Pv3, SC5P-PE, or SC5P-R2

            Chemistry can only be automatically detected for 10x Genomics Chromium as it relies
            on matches to a barcode whitelist. For other technologies we do not recommend changing
            the chemistry input. All samples are converted to contain the barcode and UMI in Read1
            as used for SC3Pv2. SC3Pv3 is only used for technologies with longer UMI.

  -n,  --force-cells NUM
            Force pipeline to use this number of cells, bypassing the cell detection algorithm.

  -j,  --jobmode MODE
           Job manager to use. Valid options: local (default), sge, lsf, or a .template file

           We recommend to use a cluster configuration when submitting jobs in to a job scheduler.
           DO NOT submit jobs in "local" mode to a slurm, SGE, or LSF cluster as cellranger runs
           multiple threads in parallel by default. Performance is significantly improved using a
           cluster mode with a job scheduler.

          See the cellranger documentation on how to set up a cluster mode with a template file.

       --localcores NUM
           Set max cores the pipeline may request at one time.
           Only applies when --jobmode=local.

       --localmem NUM
           Set max GB the pipeline may request at one time.
           Only applies when --jobmode=local.

       --mempercore NUM
           Set max GB each job may use at one time.
           Only applies in cluster jobmodes.

  -p,  --per-cell-data
           Generates a file with basic run statistics along with per-cell data (additional output to cellranger).
           Recommended but disabled by default due to additional runtime required to parse BAM files.
           This provides more accurate summary statistics than cellranger (which uses an average across cells
           that are filtered out).

       --setup
           Set up whitelists for compatibility with new technology. Called automatically when a new
           technology is run and no other technology is running. Recommended to run before submitting
           multiple samples of the same technology. Example:

              bash launch_universc.sh -t "dropseq" --setup

       --as-is
           Skips the FASTQ file conversion if the file already exists and run cellranger on pre-converted file.

  -h,  --help
           Prints the usage and a list of the most commonly used commands.

  -v,  --version                Output version information and exit
           Prints the version of 'convert' and the 'cellranger' version that will be called from the PATH.

       --verbose
           Print additional information to standard out for debugging purposes.

       --version
           Prints the version of 'convert' and the 'cellranger' version that will be called from the PATH.

.SH SEE ALSO
cellranger
.SH BUGS
No known bugs.
.SH AUTHOR
S. Thomas Kelly (tom.kelly [at] riken.jp)
Kai Battenberg (kai.batenberg [at] riken.jp)