Unverified Commit c3739133 authored by Sai Ma's avatar Sai Ma Committed by GitHub
Browse files

Update reprocess_deposited_data.md

parent e8d90dc3
Loading
Loading
Loading
Loading
+6 −1
Original line number Diff line number Diff line
@@ -4,10 +4,15 @@ The SHARE-seqV2 alignment pipeline generates a pair of fastqs for each sample. T
We have deposited SHARE-seqV2 data on [GEO](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE207308)
As an example, we can download the species mixing ATAC SRA record by fasterq-dump -A SRR19912835 --split-files  -p
This will generate two fastq files: SRR19912835_1.fastq and SRR19912835_2.fastq

The read header looks like:\
head -4 SRR19912835_1.fastq\
@SRR19912835.1 A01389:111:H2Y5KDMXY:1:1101:1127:1000_R1.003,R2.032,R3.081,P1.06 length=50\
GGGCTACACAGAGAAACCCTGTCTCGAAAAACAAACAAAACAAAACAAAA\
+SRR19912835.1 A01389:111:H2Y5KDMXY:1:1101:1127:1000_R1.003,R2.032,R3.081,P1.06 length=50\
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:\
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:

Run these lines to modify the header and convert the format to be compatable with SHARE-seq pipeline. 
cat SRR19912835_1.fastq | awk '{if(NR%4==1) print "@"$2; else if(NR%4==2) print; else if(NR%4==3) print "+"; else if(NR%4==3) print $0}' | bgzip > speciesmix.ATAC.R1.fastq.gz
cat SRR19912835_2.fastq | awk '{if(NR%4==1) print "@"$2; else if(NR%4==2) print; else if(NR%4==3) print "+"; else if(NR%4==3) print $0}' | bgzip > speciesmix.ATAC.R2.fastq.gz