Commit 669fcc54 authored by Christoph Ziegenhain's avatar Christoph Ziegenhain
Browse files

archive wiki

parent b5d08f88
Loading
Loading
Loading
Loading
+34 −0
Original line number Diff line number Diff line
![](https://d0.awsstatic.com/logos/aws/Powered-by-Amazon-Web-Services(!).png)


If you are a user of the Amazon Cloud (Amazon Web Services), we are providing a machine image (AMI) with a complete, functional zUMIs installation.

You can start your EC2 instance with zUMIs in the **US East (N. Virginia)** location (change this in the top right corner).
If you need to use a different location, the AMI may be copied through the AWS console.

In the **AWS Console**, click on **AMIs**.
Here, switch the search bar to **Public images** and find the zUMIs image by pasting 
> ** ami-5c98fe23** (AMI ID) 

or 

> ** zUMIs_25052018** (image name).

Klick the **Launch** button to start the instance. When selecting the instance type, remember STAR needs at least 30 Gb RAM for loading the genome index. 
Follow the assistant to launch the instance and you are good to go!

To use the new instance, connect to it through ssh. You will get instructions from Amazon if you click on the **"Connect"** button in the EC2 Dashboard. It should look like this:
```
ssh -i "your-key.pem" ec2-user@ec2-your-instance.compute-1.amazonaws.com
```


Once logged in, you can find the zUMIs installation in the following path of the EC2 instance:
```
~/programs/zUMIs/
```

The example dataset and its output are located in:
```
~/data/example/
```
 No newline at end of file
+48 −0
Original line number Diff line number Diff line
zUMIs provides three main options for selecting relevant barcodes controlled by the `-b` switch:

* automatic detection
* number of barcodes with most reads
* barcode list annotation

Here is more information on each of the modes:
## Automatic barcode detection
zUMIs infers which barcodes mark good cells from the observed sequences. To this end, we fit a k-dimensional multivariate normal distribution using the R-package mclust for the number of reads/BC, where k is empirically determined by mclust via the Bayesian Information Criterion (BIC). We reason that only the kth normal distribution with the largest mean contains barcodes that identify reads originating from intact cells. We exclude all barcodes that fall in the lower 1% tail of this kth normal-distribution to exclude spurious barcodes.

## Number of barcodes with most reads
zUMIs will make a summary statistic over all observed barcode sequences and their frequency. The user-specified number of barcodes will be selected in descending order.

e.g. `-b 1000`

## Barcode annotation
If expected barcodes are known a priori, it is usually advisable to provide these.
The format should be a plain text file without headers, where each line contains the exact barcode sequence.

For instance:
> GGGGCA
>
> TATTGT
>
> GCACGG
>
> CAATAA
>
> CGCGTG

Attention: If you have specified a 6-mer in the barcode range (eg. `-c 1-6`), this annotation should also contain 6-mer reference barcodes!
 
In case you are using an additional plate barcode in zUMIs, the expected barcodelist should contain the concatenated string of all possible expected plate+cell barcode combinations!

`[plateBC][cellBC]`

For instance, take the above cell barcodes that should all have the same plate barcode:
> CGTACTAGGGGGCA
>
> CGTACTAGTATTGT
>
> CGTACTAGGCACGG
>
> CGTACTAGCAATAA
>
> CGTACTAGCGCGTG

Attention: Make sure the annotation always contains reference barcodes with correct length (sum of plate and cell barcode lengths)!
+24 −0
Original line number Diff line number Diff line
## Cell Barcodes

In order to be compatible with well-based and droplet-based scRNA-seq methods, zUMIs is flexible with handling of cell barcodes.
As default behavior, zUMIs tries to guesstimate the relevant barcodes from the data by finding a cluster with the highest number of reads using model based clustering.


![](https://github.com/sdparekh/zUMIs/blob/master/ExampleData/zUMIs_output/stats/detected_BC.png)


To override automatic detection of barcodes, users can either give a fixed number of barcodes to consider (e.g. "-b 100") or refer to a plain text file containing known expected barcodes (e.g. "-b barcodefile.txt").
The text file should contain just one column with a list of barcodes without headers and without sample names:

```
ATGAAT
ATCAAA
GGAGCC
TAAGAT
AAAACT
GCGCTG
CCAACC
CTTTAA
TCATAT
TACTAT
```
 No newline at end of file
+46 −0
Original line number Diff line number Diff line
zUMIs is also compatible with combinatorial indexing protocols, such as sci-RNA-seq (Cao et al., 2017) and SPLiT-seq (Rosenberg et al., 2017).

However, because of their structure, these protocols need a preprocessing step before they may be used in zUMIs.

We provide perl scripts for preprocessing this type of data.

### SPLiT-seq
SPLiT-seq contains cell barcodes that are ligated after each split/pool step during the library preparation.
Thus, the final barcode read contains the actual barcode bases interspersed by fixed ligation linkers that should be removed prior to invoking zUMIs.

Use the `preprocess_splitseq.pl` script provided and use as follows.

Example:
`preprocess_splitseq.pl read2.fq.gz 1-18 49-56 87-94 16 read2 /your/output/dir pigz`

The input arguments are defined as follows:
- Read2 fasta file
- Range of UMI sequence + first barcode segment (Round 1 RT barcode)
- Range of second barcode segment (Round 2 Ligation Barcodes)
- Range of third barcode segment (Round 3 Ligation Barcodes)
- Number of threads to use for zipping the output fastq file
- Output file prefix, `.barcoderead.preprocess.fastq.gz` will be added
- Output directory path
- path to the pigz executable

This example call will generate `read2.barcoderead.preprocess.fastq.gz` as output.
Input that into zUMIs with `-f read2.barcoderead.preprocess.fastq.gz` and define the barcode ranges depending on the ranges selected in the preprocessing: eg. `-m 1-10 -c 11-34`


### sci-RNA-seq
sci-RNA-seq contains three levels of barcodes: RT barcodes and Illumina i7 and i5 indices.
Thus, the barcode sequence is split over several reads/fastq files that should be combined prior to invoking zUMIs.

Use the `cat3fq.pl` script provided and use as follows.

Example:
`cat3fq.pl R1.fastq.gz I1.fastq.gz I2.fastq.gz Combined_barcodeRead.fq 16`
The input arguments are defined as follows:
- Read1 fasta file (contains RT barcode)
- Index 1 fastq file (contains i7 barcode)
- Index 2 fastq file (contains i5 barcode)
- Output file prefix, `.gz` will be added
- Number of threads to use for zipping the output fastq file

This example call will generate `Combined_barcodeRead.fq.gz` as output.
Input that into zUMIs with `-f Combined_barcodeRead.fq.gz` and define the barcode ranges depending on the length of the input files selected in the preprocessing: eg. `-m 1-8 -c 9-48`
+25 −0
Original line number Diff line number Diff line
zUMIs has powerful downsampling capabilites. Independent of downsampling mode, the full data is always exported aswell.

- Adaptive downsampling: According to the recommendation of the Scater package (McCarthy et al., 2017) reads are downsampled to be within 3 times median absolute deviation. This is the default setting (eg. -d 0).
- downsampling to a fixed depth: Reads are downsampled to a user-specified depth. Any barcodes that do not reach the requested depth are omitted. Example: -d 10000
- downsampling to a depth range: Barcodes with read depth above the maximum of the range are downsampled to this value. All barcodes within the range are reported without downsampling and barcodes below the minimum specified read depth are ommited. Example: -d 10000-20000
- downsampling to several depths: Several depths can be requested by comma separation. Combinations of fixed depth and depth ranges may be given. Example: -d 10000,10000-20000,30000

```
bash zUMIs-master.sh -f barcoderead.fastq -r cdnaread.fastq -n test -g hg38_STAR5idx_noGTF/ -o ./ -a Homo_sapiens.GRCh38.84.gtf -p 8 -s 0 -d 10000,10000-20000,30000 -c 1-6 -m 7-16 -l 50 -b 384

```

### Output
Downsampled count tables are reported in <StudyName>.dgecounts.rds for each feature type (exons, introns, intron.exon). It is a list of all the downsamplings requested. Each of the downsampling list contains "readcounts" & "umicounts".

These tables can be saved as a Tab delimited text file using the code below.
For example:
```
AllCounts <- readRDS("zUMIs_output/expression/example.dgecounts.rds")

downsamp <- unlist(x = AllCounts$exons$downsampled,recursive = F,use.names = T)
lapply(names(downsamp),function(x) write.table(downsamp[[x]],file=paste("zUMIs_output/expression/",x,".txt",sep=""),sep = "\t",row.names = T,col.names = T))

```
Loading