Commit 585f1e5a authored by TomKellyGenetics's avatar TomKellyGenetics
Browse files

update documentation for dual indexes

parent e08345ed
Loading
Loading
Loading
Loading
+55 −8
Original line number Diff line number Diff line
@@ -236,15 +236,16 @@ techniques are available it is possible to specify which to use.
#### Single and dual indexed technologies

Where needed the cell barcode can be detected in the index I1 or I2 file.
Single indexes are supported for STRT-Seq, Quartz-Seq, and RamDA-Seq.
Dual indexes are supported for inDrops-v3, SCI-RNA-Seq, scifi-seq, and Smart-Seq.
Single indexes are supported for STRT-Seq and Quartz-Seq.
Dual indexes are supported for Fluidigm C1, ICELL8 full-length, 
inDrops-v3, RamDA-Seq, SCI-RNA-Seq, scifi-seq, and Smart-Seq.
Combinatorial indexing technologies have linkers between barcodes removed
automatically to match the barcode whitelist.

#### Demultiplexing for dual-indexing

For dual-indexed technologies such as inDrops-v3, Sci-Seq, SmartSeq3 it is advised to use "bcl2fastq"
before calling UniverSC:
For dual-indexed technologies such as Fluidigm C1, inDrops-v3, Sci-Seq,
SmartSeq3 it is advised to use "bcl2fastq" before calling UniverSC:

```
   /usr/local/bin/bcl2fastq  -v --runfolder-dir "/path/to/illumina/bcls"  --output-dir "./Data/Intensities/BaseCalls"\
@@ -254,14 +255,60 @@ before calling UniverSC:
```

Please adjust the lengths for `--use-bases-mask` accordingly for read 1, index 1 (i7), index 2 (i5), and read 2.
Ensure that `--create-fastq-for-index-read` is used where possible. If a sequencing facility has demultiplexed
the samples for you without this, UniverSC will attempt to extract index sequences from FASTQ headers in read 1.
Ensure that `--create-fastq-for-index-read` is used where possible.
Using `--no-lane-splitting` is optional as UniverSC can process an arbirtary number of lanes.

There is no need to specify index sequences in the same sheet for cell barcodes, using "NNNNNNNN" will match all
samples and the cell barcodes will be distinguished by the single-cell processing pipeline. Index sequences should
only be used to demultiplex samples and replicates (not cells).

#### Missing index sequences

If a sequencing facility has demultiplexed the samples for you without this,
UniverSC will attempt to extract index sequences from FASTQ headers in read 1.
If index sequences are not stored in the file headers and samples have already
been demultiplexed, a dummy index file of the same number of reads as R1 and R2
will be required. As a workaroudn, you can generate this by copying the R1 and R2
files and replacing the sequences with the first barcode in the relevant whitelist.
For example:

```
index1="TAAGGCGA"
index2="AAGGAGTA"

# create new files
cp R1_file.fastq I1_file.fastq
cp R2_file.fastq I2_file.fastq

# replace sequences
sed -i "2~4s/^.*$/${index1}/g" I1_file.fastq
sed -i "2~4s/^.*$/${index2}/g" I2_file.fastq

# replace quality scores
sed -i "4~4s/^.*$/IIIIIIII/g" I1_file.fastq I2_file.fastq
```

This results in a new "sample index" for each demultiplexed sample.
To combine demultiplexed sampls for dual indexed techniques use the following:

```
# for fastq files
cat Sample1_R1_file.fastq Sample2_R1_file.fastq Sample3_R1_file.fastq > Combined_R1_file.fastq
cat Sample1_R2_file.fastq Sample2_R2_file.fastq Sample3_R2_file.fastq > Combined_R2_file.fastq
cat Sample1_I1_file.fastq Sample2_I1_file.fastq Sample3_I1_file.fastq > Combined_I1_file.fastq
cat Sample1_I2_file.fastq Sample2_I2_file.fastq Sample3_I2_file.fastq > Combined_I2_file.fastq

# for compressed files (not need to uncompress)
cat Sample1_R1_file.fastq.gz Sample2_R1_file.fastq.gz Sample3_R1_file.fastq.gz > Combined_R1_file.fastq.gz
cat Sample1_R2_file.fastq.gz Sample2_R2_file.fastq.gz Sample3_R2_file.fastq.gz > Combined_R2_file.fastq.gz
cat Sample1_I1_file.fastq.gz Sample2_I1_file.fastq.gz Sample3_I1_file.fastq.gz > Combined_I1_file.fastq.gz
cat Sample1_I2_file.fastq.gz Sample2_I2_file.fastq.gz Sample3_I2_file.fastq.gz > Combined_I2_file.fastq.gz
```

As this needs to done on a case-by-case basis it has not been implemented by the UniverSC core functions.
We provide this workaround for using published data and data already processed by sequencing facilities.
Please contact the maintainers or file an issue on GitHub if you are having problems with this case.


#### Custom inputs

Custom inputs are also supported by giving the name "custom" and length of barcode and UMI separated by a "_" character.
+33 −6
Original line number Diff line number Diff line
@@ -57,7 +57,7 @@
<ul>
<li>ICELL8 version 2 (11 bp barcode, No UMI): icell8-non-umi, icell8-v2</li>
<li>ICELL8 version 3 (11 bp barcode, 14 bp UMI): icell8 or custom</li>
<li>ICELL8 5′ scRNA with TCR OR kit (10bp barcode, 8 bp UMI): icell8-5-prime</li>
<li>ICELL8 5′ scRNA with TCR OR kit (10bp barcode, NO bp UMI): icell8-5-prime</li>
<li>ICELL8 full-length scRNA with Smart-Seq (16 bp barcode, No UMI): icell8-full-length</li>
</ul></li>
<li>inDrops
@@ -106,15 +106,42 @@
<p>By default, UMIs are supported where available so with the following exceptions for non-UMI technologies: ICELL8 v2, RamDA-Seq, Quartz-Seq, Smart-Seq, Smart-Seq2. While using UMI is recommended we provide a mock UMI for counting reads for these technologies (and data from previous versions).</p>
<p>Other techniques can be forced to replace the UMI with a mock sequence for counting reads only with <code>--non-umi</code> or <code>--read-only</code> arguments. Forcing non-UMI techniques is <em>not recommended</em> unless you are integrating non-UMI and UMI-based technologies. It is not necessary to specific <code>--non-umi</code> for non-UMI techniques as these will be used automatically when applicable. For ICELL8 and Smart-Seq where both non-UMI (icell8-v2, smartseq2) and UMI-based (icell8-v3, smartseq3) techniques are available it is possible to specify which to use.</p>
<h4 id="single-and-dual-indexed-technologies">Single and dual indexed technologies</h4>
<p>Where needed the cell barcode can be detected in the index I1 or I2 file. Single indexes are supported for STRT-Seq, Quartz-Seq, and RamDA-Seq. Dual indexes are supported for inDrops-v3, SCI-RNA-Seq, scifi-seq, and Smart-Seq. Combinatorial indexing technologies have linkers between barcodes removed automatically to match the barcode whitelist.</p>
<p>Where needed the cell barcode can be detected in the index I1 or I2 file. Single indexes are supported for STRT-Seq and Quartz-Seq. Dual indexes are supported for Fluidigm C1, ICELL8 full-length, inDrops-v3, RamDA-Seq, SCI-RNA-Seq, scifi-seq, and Smart-Seq. Combinatorial indexing technologies have linkers between barcodes removed automatically to match the barcode whitelist.</p>
<h4 id="demultiplexing-for-dual-indexing">Demultiplexing for dual-indexing</h4>
<p>For dual-indexed technologies such as inDrops-v3, Sci-Seq, SmartSeq3 it is advised to use &quot;bcl2fastq&quot; before calling UniverSC:</p>
<p>For dual-indexed technologies such as Fluidigm C1, inDrops-v3, Sci-Seq, SmartSeq3 it is advised to use &quot;bcl2fastq&quot; before calling UniverSC:</p>
<pre><code>   /usr/local/bin/bcl2fastq  -v --runfolder-dir &quot;/path/to/illumina/bcls&quot;  --output-dir &quot;./Data/Intensities/BaseCalls&quot;\
                                --sample-sheet &quot;/path/to/SampleSheet.csv&quot; --create-fastq-for-index-reads\
                                --use-bases-mask Y26n,I8n,I8n,Y50n  --mask-short-adapter-reads 0\
                                --minimum-trimmed-read-length 0</code></pre>
<p>Please adjust the lengths for <code>--use-bases-mask</code> accordingly for read 1, index 1 (i7), index 2 (i5), and read 2. Ensure that <code>--create-fastq-for-index-read</code> is used where possible. If a sequencing facility has demultiplexed the samples for you without this, UniverSC will attempt to extract index sequences from FASTQ headers in read 1. Using <code>--no-lane-splitting</code> is optional as UniverSC can process an arbirtary number of lanes.</p>
<p>There is no need to specify index sequences in the same sheet for cell barcodes, using &quot;NNNNNNNN&quot; will match all samples and the cell barcodes will be distinguished by the single-cell processing pipeline. Index sequences should only be used to demultiplex samples and replicates (not cells).</p>
<p>Please adjust the lengths for <code>--use-bases-mask</code> accordingly for read 1, index 1 (i7), index 2 (i5), and read 2. Ensure that <code>--create-fastq-for-index-read</code> is used where possible. Using <code>--no-lane-splitting</code> is optional as UniverSC can process an arbirtary number of lanes. There is no need to specify index sequences in the same sheet for cell barcodes, using &quot;NNNNNNNN&quot; will match all samples and the cell barcodes will be distinguished by the single-cell processing pipeline. Index sequences should only be used to demultiplex samples and replicates (not cells).</p>
<h4 id="missing-index-sequences">Missing index sequences</h4>
<p>If a sequencing facility has demultiplexed the samples for you without this, UniverSC will attempt to extract index sequences from FASTQ headers in read 1. If index sequences are not stored in the file headers and samples have already been demultiplexed, a dummy index file of the same number of reads as R1 and R2 will be required. As a workaroudn, you can generate this by copying the R1 and R2 files and replacing the sequences with the first barcode in the relevant whitelist. For example:</p>
<pre><code>index1=&quot;TAAGGCGA&quot;
index2=&quot;AAGGAGTA&quot;

# create new files
cp R1_file.fastq I1_file.fastq
cp R2_file.fastq I2_file.fastq

# replace sequences
sed -i &quot;2~4s/^.*$/${index1}/g&quot; I1_file.fastq
sed -i &quot;2~4s/^.*$/${index2}/g&quot; I2_file.fastq

# replace quality scores
sed -i &quot;4~4s/^.*$/IIIIIIII/g&quot; I1_file.fastq I2_file.fastq</code></pre>
<p>This results in a new &quot;sample index&quot; for each demultiplexed sample. To combine demultiplexed sampls for dual indexed techniques use the following:</p>
<pre><code># for fastq files
cat Sample1_R1_file.fastq Sample2_R1_file.fastq Sample3_R1_file.fastq &gt; Combined_R1_file.fastq
cat Sample1_R2_file.fastq Sample2_R2_file.fastq Sample3_R2_file.fastq &gt; Combined_R2_file.fastq
cat Sample1_I1_file.fastq Sample2_I1_file.fastq Sample3_I1_file.fastq &gt; Combined_I1_file.fastq
cat Sample1_I2_file.fastq Sample2_I2_file.fastq Sample3_I2_file.fastq &gt; Combined_I2_file.fastq

# for compressed files (not need to uncompress)
cat Sample1_R1_file.fastq.gz Sample2_R1_file.fastq.gz Sample3_R1_file.fastq.gz &gt; Combined_R1_file.fastq.gz
cat Sample1_R2_file.fastq.gz Sample2_R2_file.fastq.gz Sample3_R2_file.fastq.gz &gt; Combined_R2_file.fastq.gz
cat Sample1_I1_file.fastq.gz Sample2_I1_file.fastq.gz Sample3_I1_file.fastq.gz &gt; Combined_I1_file.fastq.gz
cat Sample1_I2_file.fastq.gz Sample2_I2_file.fastq.gz Sample3_I2_file.fastq.gz &gt; Combined_I2_file.fastq.gz</code></pre>
<p>As this needs to done on a case-by-case basis it has not been implemented by the UniverSC core functions. We provide this workaround for using published data and data already processed by sequencing facilities. Please contact the maintainers or file an issue on GitHub if you are having problems with this case.</p>
<h4 id="custom-inputs">Custom inputs</h4>
<p>Custom inputs are also supported by giving the name &quot;custom&quot; and length of barcode and UMI separated by a &quot;_&quot; character.</p>
<p>e.g. Custom (16bp barcode, 10bp UMI): <code>custom_16_10</code></p>
@@ -516,7 +543,7 @@ Mandatory arguments to long options are mandatory for short options too.
                                  Drop-Seq (12 bp barcode, 8 bp UMI): dropseq
                                  ICELL8 version 2 (11 bp barcode, No UMI): icell8-non-umi, icell8-v2
                                  ICELL8 version 3 (11 bp barcode, 14 bp UMI): icell8 or custom
                                  ICELL8 5′ scRNA with TCR OR kit (10bp barcode, 8 bp UMI): icell8-5-prime
                                  ICELL8 5′ scRNA with TCR OR kit (10bp barcode, NO bp UMI): icell8-5-prime
                                  ICELL8 full-length scRNA with Smart-Seq (16 bp barcode, No UMI): icell8-full-length
                                  inDrops version 1 (19 bp barcode, 6 bp UMI): indrops-v1, 1cellbio-v1
                                  inDrops version 2 (19 bp barcode, 6 bp UMI): indrops-v2, 1cellbio-v2
+58 −11
Original line number Diff line number Diff line
@@ -6,7 +6,7 @@ affiliations:
   index: 1
 - name: "RIKEN Center for Sustainable Resource Sciences, Suehiro-cho-1-7-22, Tsurumi Ward, Yokohama, Kanagawa 230-0045, Japan"
   index: 2
date: "Thursday 06 May 2021"
date: "Wednesday 12 May 2021"
output:
  prettydoc::html_pretty:
       theme: cayman
@@ -47,8 +47,8 @@ tags:
![GitHub issues](https://img.shields.io/github/issues/minoda-lab/universc)
![GitHub pull requests](https://img.shields.io/github/issues-pr/minoda-lab/universc)

[![HitCount](http://hits.dwyl.com/minoda-lab/universc.svg)](http://hits.dwyl.com/minoda-lab/universc)
[![HitCount](http://hits.dwyl.com/tomkellygenetics/universc.svg)](http://hits.dwyl.com/tomkellygenetics/universc)
[![GitHub Views](http://hits.dwyl.com/minoda-lab/universc.svg)](http://hits.dwyl.com/minoda-lab/universc)
[![GitHub Views](http://hits.dwyl.com/tomkellygenetics/universc.svg)](http://hits.dwyl.com/tomkellygenetics/universc)
![GitHub search hit counter](https://img.shields.io/github/search/minoda-lab/universc/master)
![GitHub forks](https://img.shields.io/github/forks/minoda-lab/universc?style=social)
![GitHub Repo stars](https://img.shields.io/github/stars/minoda-lab/universc?style=social)
@@ -236,15 +236,16 @@ techniques are available it is possible to specify which to use.
#### Single and dual indexed technologies

Where needed the cell barcode can be detected in the index I1 or I2 file.
Single indexes are supported for STRT-Seq, Quartz-Seq, and RamDA-Seq.
Dual indexes are supported for inDrops-v3, SCI-RNA-Seq, scifi-seq, and Smart-Seq.
Single indexes are supported for STRT-Seq and Quartz-Seq.
Dual indexes are supported for Fluidigm C1, ICELL8 full-length, 
inDrops-v3, RamDA-Seq, SCI-RNA-Seq, scifi-seq, and Smart-Seq.
Combinatorial indexing technologies have linkers between barcodes removed
automatically to match the barcode whitelist.

#### Demultiplexing for dual-indexing

For dual-indexed technologies such as inDrops-v3, Sci-Seq, SmartSeq3 it is advised to use "bcl2fastq"
before calling UniverSC:
For dual-indexed technologies such as Fluidigm C1, inDrops-v3, Sci-Seq,
SmartSeq3 it is advised to use "bcl2fastq" before calling UniverSC:

```
   /usr/local/bin/bcl2fastq  -v --runfolder-dir "/path/to/illumina/bcls"  --output-dir "./Data/Intensities/BaseCalls"\
@@ -254,14 +255,60 @@ before calling UniverSC:
```

Please adjust the lengths for `--use-bases-mask` accordingly for read 1, index 1 (i7), index 2 (i5), and read 2.
Ensure that `--create-fastq-for-index-read` is used where possible. If a sequencing facility has demultiplexed
the samples for you without this, UniverSC will attempt to extract index sequences from FASTQ headers in read 1.
Ensure that `--create-fastq-for-index-read` is used where possible.
Using `--no-lane-splitting` is optional as UniverSC can process an arbirtary number of lanes.

There is no need to specify index sequences in the same sheet for cell barcodes, using "NNNNNNNN" will match all
samples and the cell barcodes will be distinguished by the single-cell processing pipeline. Index sequences should
only be used to demultiplex samples and replicates (not cells).

#### Missing index sequences

If a sequencing facility has demultiplexed the samples for you without this,
UniverSC will attempt to extract index sequences from FASTQ headers in read 1.
If index sequences are not stored in the file headers and samples have already
been demultiplexed, a dummy index file of the same number of reads as R1 and R2
will be required. As a workaroudn, you can generate this by copying the R1 and R2
files and replacing the sequences with the first barcode in the relevant whitelist.
For example:

```
index1="TAAGGCGA"
index2="AAGGAGTA"

# create new files
cp R1_file.fastq I1_file.fastq
cp R2_file.fastq I2_file.fastq

# replace sequences
sed -i "2~4s/^.*$/${index1}/g" I1_file.fastq
sed -i "2~4s/^.*$/${index2}/g" I2_file.fastq

# replace quality scores
sed -i "4~4s/^.*$/IIIIIIII/g" I1_file.fastq I2_file.fastq
```

This results in a new "sample index" for each demultiplexed sample.
To combine demultiplexed sampls for dual indexed techniques use the following:

```
# for fastq files
cat Sample1_R1_file.fastq Sample2_R1_file.fastq Sample3_R1_file.fastq > Combined_R1_file.fastq
cat Sample1_R2_file.fastq Sample2_R2_file.fastq Sample3_R2_file.fastq > Combined_R2_file.fastq
cat Sample1_I1_file.fastq Sample2_I1_file.fastq Sample3_I1_file.fastq > Combined_I1_file.fastq
cat Sample1_I2_file.fastq Sample2_I2_file.fastq Sample3_I2_file.fastq > Combined_I2_file.fastq

# for compressed files (not need to uncompress)
cat Sample1_R1_file.fastq.gz Sample2_R1_file.fastq.gz Sample3_R1_file.fastq.gz > Combined_R1_file.fastq.gz
cat Sample1_R2_file.fastq.gz Sample2_R2_file.fastq.gz Sample3_R2_file.fastq.gz > Combined_R2_file.fastq.gz
cat Sample1_I1_file.fastq.gz Sample2_I1_file.fastq.gz Sample3_I1_file.fastq.gz > Combined_I1_file.fastq.gz
cat Sample1_I2_file.fastq.gz Sample2_I2_file.fastq.gz Sample3_I2_file.fastq.gz > Combined_I2_file.fastq.gz
```

As this needs to done on a case-by-case basis it has not been implemented by the UniverSC core functions.
We provide this workaround for using published data and data already processed by sequencing facilities.
Please contact the maintainers or file an issue on GitHub if you are having problems with this case.


#### Custom inputs

Custom inputs are also supported by giving the name "custom" and length of barcode and UMI separated by a "_" character.
+2 −1
Original line number Diff line number Diff line
@@ -249,7 +249,8 @@ Provides a conversion script to run multiple technologies and custom libraries w

            The following technologies require Index 1 or Index 2 sequences (see above):

                  inDrops-v3,  SCI-RNA-Seq, SCI-RNA-Seq3, scifi-seq, Smart-Seq2, Smart-Seq3, STRT-Seq-2i, STRT-Seq-C1
                  Fluidigm C1, ICELL8 full-length, inDrops-v3, Quartz-Seq, RamDA-Seq,
                  SCI-RNA-Seq, SCI-RNA-Seq3, scifi-seq, Smart-Seq2, Smart-Seq3, STRT-Seq-2i, STRT-Seq-C1


  -b,  --barcodefile FILE