Commit 1aa89fac authored by Weipeng Mo's avatar Weipeng Mo
Browse files

update README

parent 4c8bd083
Loading
Loading
Loading
Loading
+158 −1
Original line number Diff line number Diff line
# FLEP-seq Preprocessing Pipeline
# FLEP-seq Pipeline

FLEP-seq (full-length elongating and polyadenylated RNA sequencing) allows calculation of the kinetics of cotranscriptional splicing and detects polyadenylated transcripts with unspliced introns retained at specific positions posttranscriptionally.

This pipeline includes basecalling and preprocessing of the FLEP-seq data. 

## Software and package requirements

* Minimap2 (https://github.com/lh3/minimap2)
* SAMtools (http://www.htslib.org/)
* BLAST+ (https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/) (require for Nanopore data analysis)
* Python 3.7 or above, and following packages:
  * Pysam (https://github.com/pysam-developers/pysam)
  * ont_fast5_api (https://github.com/nanoporetech/ont_fast5_api)
  * pandas (https://pandas.pydata.org/)
  * NumPy (https://numpy.org/)
  * Matplotlib (https://matplotlib.org/)
  * Joblib (https://github.com/joblib/joblib)
  * click (https://click.palletsprojects.com/en/7.x/)
* R 3.5.2 or above, and following packages:
  * Tidyverse (https://www.tidyverse.org/)
  * optparse (https://cran.r-project.org/web/packages/optparse/index.html)


## Inputs

| File format | Information contained in file | File description | 
@@ -18,3 +36,142 @@ This pipeline includes basecalling and preprocessing of the FLEP-seq data.
| *.read.info.txt | long-read fastq file | Generated by `merge_read_info.R` |
| *.read.splicing_kinetics.pdf | splicing kinetics | Generated by `plot_intron_splicing_kinetics.R` |

## Step by step workflow

![](./pipeline.svg)

1.	Nanopore basecalling

You can use MinKNOW to perform real-time basecalling while sequencing, or use the GPU version of Guppy to speed up basecalling after sequencing. Both MinKNOW and Guppy are available via Nanopore community site (https://community.nanoporetech.com). Command-line for running Guppy basecalling is as follow:

```
$ guppy_basecaller -i raw_fast5_dir -s out_fast5_dir -c dna_r9.4.1_450bps_hac.cfg --recursive --fast5_out --disable_pings --qscore_filtering --device "cuda:all:100%"
```

2.	Convert FASTQ files to FASTA format

```
$ python fastqdir2fasta.py --indir path/to/fastq_pass --out all.fasta
```

3.	Use minimap2 to map reads to reference genome

```
$ minimap2 -ax splice --secondary=no genome.fasta all.fasta > tmp.sam
```

CAUTION: For the organisms with short introns, such as Arabidopsis, it might be better to use the parameter “-G” to set the max intron length, for example, “-G 12000”. You also can set “-t number_of_threads” to use more threads to speed up.

```
$ samtools sort -o mapped.bam tmp.sam
$ samtools index mapped.bam
$ rm tmp.sam
```

4.	(Optional) Remove rRNA and tRNA derived reads

```
$ python filter_rRNA_bam.py --inbam mapped.bam --inbed rRNAtRNAetc.bed --out clean.bam
$ samtools index clean.bam
```

5.	Find 3’ adapter in reads

```
$ python adapterFinder.py --inbam clean.bam --inseq all.fasta --out adapter.result.txt --threads 36
```

6.	Identify polyA tail and estimate its length

```
$ python PolyACaller.py --inadapter adapter.result.txt --summary sequencing_summary.txt --fast5dir fast5_dir --out polyA_tail.result.txt --threads 36
```

7.	Extract read information

This pipeline will produce a table containing intron retention information and Pol II position.

```
$ python extract_read_info.py --inbam clean.bam --inbed lib/exon_intron_pos.repr.bed --out read_info.result.txt
```

8.	Merge the above analysis results

```
$ Rscript merge_read_info.R --type Nanopore --inreadinfo read_info.result.txt --inadapter adapter.result.txt --inpolya polyA_tail.result.txt --out read.info.txt
```

9.	Analyze splicing kinetics

```
$ python prepare_data_for_splice_kinetics.py --inreadinfo read.info.txt --inbed lib/exon_intron_pos.repr.bed --out read.intron.pos.splicing.txt
$ Rscript plot_intron_splicing_kinetics.R --inrelpos read.intron.pos.splicing.txt --inreadinfo read.info.txt --inintron lib/select_introns.txt --out read.splicing_kinetics.txt --pdf read.splicing_kinetics.pdf 
```

10.	Calculate intron retention ratio of polyadenylated transcripts

```
$ Rscript cal_polya_transcript_ir.R --inrelpos read.intron.pos.splicing.txt --inreadinfo read.info.txt --outrna mRNA.incompletely_spliced_ratio.txt --outintron intron.unspliced_ratio.txt
```

## File details

```
.
├── aligned_data
│   ├── flep_seq_demo.adapter.result.txt
│   ├── flep_seq_demo.polyA_tail.result.txt
│   ├── flep_seq_demo.read_info.result.txt
│   ├── flep_seq_demo.sorted.bam
│   └── flep_seq_demo.sorted.bam.bai
├── basecalled_data
│   └── flep_seq_demo.fasta
├── config.yml
├── genome_data
│   ├── exon_intron_pos.repr.bed
│   ├── rRNAtRNAetc.bed
│   └── select_introns.txt
├── guppy_out
├── pipeline.svg
├── README.md
├── requirements.txt
├── results
│   ├── flep_seq_demo.read.info.txt
│   ├── flep_seq_demo.read.intron.ir.stat
│   ├── flep_seq_demo.read.intron.pos.splicing.txt
│   ├── flep_seq_demo.read.rna.ir.stat
│   ├── flep_seq_demo.read.splicing_kinetics.pdf
│   └── flep_seq_demo.read.splicing_kinetics.txt
├── script
│   ├── adapterFinder.py
│   ├── ASCaller.py
│   ├── cal_polya_transcript_ir.R
│   ├── extract_read_info.generate_seq.py
│   ├── extract_read_info.py
│   ├── fastqdir2fasta.py
│   ├── filter_rRNA_bam.py
│   ├── lima_bam2fasta.py
│   ├── merge_read_info.R
│   ├── merge_sequencing_summary.py
│   ├── pacbio_find_polyA.py
│   ├── plot_intron_splicing_kinetics.R
│   ├── PolyACaller.py
│   ├── prepare_data_for_splice_kinetics.py
│   ├── run_guppy2.sh
│   ├── run_nanoplot.sh
│   ├── run_toulligqc.sh
│   └── split_files_into_dirs.py
└── snakefile
```

## Demo data

The demo data is available at figshare:

https://doi.org/10.6084/m9.figshare.20217632.v1

## References

Long, Y., Jia, J., Mo, W. et al. FLEP-seq: simultaneous detection of RNA polymerase II position, splicing status, polyadenylation site and poly(A) tail length at genome-wide scale by single-molecule nascent RNA sequencing. Nat Protoc 16, 4355–4381 (2021). https://doi.org/10.1038/s41596-021-00581-7

Jia, J., Long, Y., Zhang, H. et al. Post-transcriptional splicing of nascent RNA contributes to widespread intron retention in plants. Nat. Plants 6, 780–788 (2020). https://doi.org/10.1038/s41477-020-0688-1
 No newline at end of file

pipeline.svg

0 → 100644
+152 −0
Original line number Diff line number Diff line
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
 "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<!-- Generated by graphviz version 2.49.1 (20211004.0028)
 -->
<!-- Title: snakemake_dag Pages: 1 -->
<svg width="272pt" height="548pt"
 viewBox="0.00 0.00 271.50 548.00" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
<g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 544)">
<title>snakemake_dag</title>
<polygon fill="white" stroke="transparent" points="-4,4 -4,-544 267.5,-544 267.5,4 -4,4"/>
<!-- 0 -->
<g id="node1" class="node">
<title>0</title>
<path fill="none" stroke="#d88d56" stroke-width="2" d="M128.5,-36C128.5,-36 98.5,-36 98.5,-36 92.5,-36 86.5,-30 86.5,-24 86.5,-24 86.5,-12 86.5,-12 86.5,-6 92.5,0 98.5,0 98.5,0 128.5,0 128.5,0 134.5,0 140.5,-6 140.5,-12 140.5,-12 140.5,-24 140.5,-24 140.5,-30 134.5,-36 128.5,-36"/>
<text text-anchor="middle" x="113.5" y="-15.5" font-family="sans" font-size="10.00">all</text>
</g>
<!-- 1 -->
<g id="node2" class="node">
<title>1</title>
<path fill="none" stroke="#56d8a9" stroke-width="2" d="M163.5,-252C163.5,-252 115.5,-252 115.5,-252 109.5,-252 103.5,-246 103.5,-240 103.5,-240 103.5,-228 103.5,-228 103.5,-222 109.5,-216 115.5,-216 115.5,-216 163.5,-216 163.5,-216 169.5,-216 175.5,-222 175.5,-228 175.5,-228 175.5,-240 175.5,-240 175.5,-246 169.5,-252 163.5,-252"/>
<text text-anchor="middle" x="139.5" y="-231.5" font-family="sans" font-size="10.00">merge_info</text>
</g>
<!-- 1&#45;&gt;0 -->
<g id="edge1" class="edge">
<title>1&#45;&gt;0</title>
<path fill="none" stroke="grey" stroke-width="2" d="M112.16,-215.77C100.39,-206.72 87.88,-194.48 81.5,-180 61.73,-135.14 82.97,-77.87 99.04,-45.13"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="102.18,-46.68 103.61,-36.18 95.95,-43.5 102.18,-46.68"/>
</g>
<!-- 7 -->
<g id="node8" class="node">
<title>7</title>
<path fill="none" stroke="#56c9d8" stroke-width="2" d="M176.5,-180C176.5,-180 102.5,-180 102.5,-180 96.5,-180 90.5,-174 90.5,-168 90.5,-168 90.5,-156 90.5,-156 90.5,-150 96.5,-144 102.5,-144 102.5,-144 176.5,-144 176.5,-144 182.5,-144 188.5,-150 188.5,-156 188.5,-156 188.5,-168 188.5,-168 188.5,-174 182.5,-180 176.5,-180"/>
<text text-anchor="middle" x="139.5" y="-159.5" font-family="sans" font-size="10.00">splicing_kinetics</text>
</g>
<!-- 1&#45;&gt;7 -->
<g id="edge12" class="edge">
<title>1&#45;&gt;7</title>
<path fill="none" stroke="grey" stroke-width="2" d="M139.5,-215.7C139.5,-207.98 139.5,-198.71 139.5,-190.11"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="143,-190.1 139.5,-180.1 136,-190.1 143,-190.1"/>
</g>
<!-- 8 -->
<g id="node9" class="node">
<title>8</title>
<path fill="none" stroke="#afd856" stroke-width="2" d="M251.5,-108C251.5,-108 153.5,-108 153.5,-108 147.5,-108 141.5,-102 141.5,-96 141.5,-96 141.5,-84 141.5,-84 141.5,-78 147.5,-72 153.5,-72 153.5,-72 251.5,-72 251.5,-72 257.5,-72 263.5,-78 263.5,-84 263.5,-84 263.5,-96 263.5,-96 263.5,-102 257.5,-108 251.5,-108"/>
<text text-anchor="middle" x="202.5" y="-87.5" font-family="sans" font-size="10.00">intron_retention_ratio</text>
</g>
<!-- 1&#45;&gt;8 -->
<g id="edge14" class="edge">
<title>1&#45;&gt;8</title>
<path fill="none" stroke="grey" stroke-width="2" d="M166.18,-215.84C177.95,-206.73 190.66,-194.42 197.5,-180 206.6,-160.83 207.39,-136.67 206.16,-118.36"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="209.64,-117.98 205.24,-108.34 202.66,-118.61 209.64,-117.98"/>
</g>
<!-- 2 -->
<g id="node3" class="node">
<title>2</title>
<path fill="none" stroke="#5692d8" stroke-width="2" d="M99.5,-324C99.5,-324 19.5,-324 19.5,-324 13.5,-324 7.5,-318 7.5,-312 7.5,-312 7.5,-300 7.5,-300 7.5,-294 13.5,-288 19.5,-288 19.5,-288 99.5,-288 99.5,-288 105.5,-288 111.5,-294 111.5,-300 111.5,-300 111.5,-312 111.5,-312 111.5,-318 105.5,-324 99.5,-324"/>
<text text-anchor="middle" x="59.5" y="-303.5" font-family="sans" font-size="10.00">identify_read_info</text>
</g>
<!-- 2&#45;&gt;1 -->
<g id="edge4" class="edge">
<title>2&#45;&gt;1</title>
<path fill="none" stroke="grey" stroke-width="2" d="M79.28,-287.7C89.25,-278.97 101.51,-268.24 112.36,-258.75"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="114.73,-261.32 119.95,-252.1 110.12,-256.06 114.73,-261.32"/>
</g>
<!-- 3 -->
<g id="node4" class="node">
<title>3</title>
<path fill="none" stroke="#d8cb56" stroke-width="2" d="M107,-468C107,-468 12,-468 12,-468 6,-468 0,-462 0,-456 0,-456 0,-444 0,-444 0,-438 6,-432 12,-432 12,-432 107,-432 107,-432 113,-432 119,-438 119,-444 119,-444 119,-456 119,-456 119,-462 113,-468 107,-468"/>
<text text-anchor="middle" x="59.5" y="-447.5" font-family="sans" font-size="10.00">mapping_to_genome</text>
</g>
<!-- 3&#45;&gt;2 -->
<g id="edge7" class="edge">
<title>3&#45;&gt;2</title>
<path fill="none" stroke="grey" stroke-width="2" d="M59.5,-431.87C59.5,-407.67 59.5,-363.21 59.5,-334.39"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="63,-334.19 59.5,-324.19 56,-334.19 63,-334.19"/>
</g>
<!-- 5 -->
<g id="node6" class="node">
<title>5</title>
<path fill="none" stroke="#56d873" stroke-width="2" d="M168,-396C168,-396 119,-396 119,-396 113,-396 107,-390 107,-384 107,-384 107,-372 107,-372 107,-366 113,-360 119,-360 119,-360 168,-360 168,-360 174,-360 180,-366 180,-372 180,-372 180,-384 180,-384 180,-390 174,-396 168,-396"/>
<text text-anchor="middle" x="143.5" y="-375.5" font-family="sans" font-size="10.00">find_3linker</text>
</g>
<!-- 3&#45;&gt;5 -->
<g id="edge9" class="edge">
<title>3&#45;&gt;5</title>
<path fill="none" stroke="grey" stroke-width="2" d="M80.26,-431.7C90.74,-422.97 103.61,-412.24 115,-402.75"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="117.53,-405.19 122.97,-396.1 113.05,-399.82 117.53,-405.19"/>
</g>
<!-- 4 -->
<g id="node5" class="node">
<title>4</title>
<path fill="none" stroke="#70d856" stroke-width="2" d="M157.5,-540C157.5,-540 49.5,-540 49.5,-540 43.5,-540 37.5,-534 37.5,-528 37.5,-528 37.5,-516 37.5,-516 37.5,-510 43.5,-504 49.5,-504 49.5,-504 157.5,-504 157.5,-504 163.5,-504 169.5,-510 169.5,-516 169.5,-516 169.5,-528 169.5,-528 169.5,-534 163.5,-540 157.5,-540"/>
<text text-anchor="middle" x="103.5" y="-525" font-family="sans" font-size="10.00">fastq_to_fasta</text>
<text text-anchor="middle" x="103.5" y="-514" font-family="sans" font-size="10.00">sample: flep_seq_demo</text>
</g>
<!-- 4&#45;&gt;3 -->
<g id="edge8" class="edge">
<title>4&#45;&gt;3</title>
<path fill="none" stroke="grey" stroke-width="2" d="M92.62,-503.7C87.51,-495.56 81.3,-485.69 75.66,-476.7"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="78.54,-474.71 70.25,-468.1 72.61,-478.43 78.54,-474.71"/>
</g>
<!-- 4&#45;&gt;5 -->
<g id="edge10" class="edge">
<title>4&#45;&gt;5</title>
<path fill="none" stroke="grey" stroke-width="2" d="M112.62,-503.76C117.66,-493.6 123.66,-480.35 127.5,-468 133.82,-447.67 137.94,-423.9 140.41,-406.04"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="143.89,-406.4 141.71,-396.03 136.95,-405.49 143.89,-406.4"/>
</g>
<!-- 5&#45;&gt;1 -->
<g id="edge5" class="edge">
<title>5&#45;&gt;1</title>
<path fill="none" stroke="grey" stroke-width="2" d="M143.02,-359.87C142.34,-335.67 141.08,-291.21 140.27,-262.39"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="143.76,-262.09 139.98,-252.19 136.77,-262.28 143.76,-262.09"/>
</g>
<!-- 6 -->
<g id="node7" class="node">
<title>6</title>
<path fill="none" stroke="#d85656" stroke-width="2" d="M227,-324C227,-324 180,-324 180,-324 174,-324 168,-318 168,-312 168,-312 168,-300 168,-300 168,-294 174,-288 180,-288 180,-288 227,-288 227,-288 233,-288 239,-294 239,-300 239,-300 239,-312 239,-312 239,-318 233,-324 227,-324"/>
<text text-anchor="middle" x="203.5" y="-303.5" font-family="sans" font-size="10.00">polyacaller</text>
</g>
<!-- 5&#45;&gt;6 -->
<g id="edge11" class="edge">
<title>5&#45;&gt;6</title>
<path fill="none" stroke="grey" stroke-width="2" d="M158.33,-359.7C165.52,-351.3 174.3,-341.07 182.19,-331.86"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="184.99,-333.97 188.84,-324.1 179.67,-329.42 184.99,-333.97"/>
</g>
<!-- 6&#45;&gt;1 -->
<g id="edge6" class="edge">
<title>6&#45;&gt;1</title>
<path fill="none" stroke="grey" stroke-width="2" d="M187.68,-287.7C179.93,-279.22 170.46,-268.86 161.98,-259.58"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="164.47,-257.12 155.14,-252.1 159.3,-261.85 164.47,-257.12"/>
</g>
<!-- 7&#45;&gt;0 -->
<g id="edge2" class="edge">
<title>7&#45;&gt;0</title>
<path fill="none" stroke="grey" stroke-width="2" d="M136.36,-143.87C131.93,-119.67 123.79,-75.21 118.52,-46.39"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="121.89,-45.4 116.65,-36.19 115.01,-46.66 121.89,-45.4"/>
</g>
<!-- 7&#45;&gt;8 -->
<g id="edge13" class="edge">
<title>7&#45;&gt;8</title>
<path fill="none" stroke="grey" stroke-width="2" d="M155.07,-143.7C162.7,-135.22 172.02,-124.86 180.38,-115.58"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="183.02,-117.88 187.11,-108.1 177.81,-113.2 183.02,-117.88"/>
</g>
<!-- 8&#45;&gt;0 -->
<g id="edge3" class="edge">
<title>8&#45;&gt;0</title>
<path fill="none" stroke="grey" stroke-width="2" d="M180.5,-71.7C169.29,-62.88 155.5,-52.03 143.35,-42.47"/>
<polygon fill="grey" stroke="grey" stroke-width="2" points="145.27,-39.54 135.25,-36.1 140.94,-45.04 145.27,-39.54"/>
</g>
</g>
</svg>