Initial Commit v0.9 (bb8570e2) · Commits · Chaos / chaos.juicer

AWS/README.md

0 → 100755

+52 −0

Original line number	Diff line number	Diff line
		#Quick Start

		We've provided a step-by-step guide to showcase some of the features of
		Juicer. If you run into problems, see below for more detailed documentation.
		This example runs on Amazon Web Services, but you can install the pipeline
		on any LSF, Univa Grid Engine, or SLURM cluster.

		1. Make sure you're in the top-level directory, with the Juicer_AWS.pem file.
		2. You were given an anonymous IP address. At a command line prompt, type:
		ssh -i Juicer_AWS.pem ubuntu@<given IP address>
		3. This will log you into an AWS instance that contains all the software
		needed to run the pipeline. Type
		cd /opt/juicer/work/
		4. We will run the pipeline on a test dataset of a single chromosome of the primary+
		replicate map from (Rao+Huntley et al., 2014). Type:
		cd MBR19
		5. Run the Juicer pipeline on the raw data, which is stored in the fastq
		directory:
		/opt/juicer/scripts/juicer.sh -g hg19 -s MboI
		6. You will see a series of messages sending jobs to the cluster. Do not
		kill the script or close the server connection until you see:
		“(-: Finished adding all jobs... please wait while processing.”
		7. At this point you can close the connection and come back later.
		To see the progress of the pipeline as it works, type:
		bjobs -w
		7. Eventually the bjobs command will report “No unfinished job found”. Type:
		tail lsf.out
		You should see “(-: Pipeline successfully completed (-:”
		8. Results are available in the aligned directory. The Hi-C maps are in
		inter.hic (for MAPQ > 0) and inter_30.hic (for MAPQ >= 30). The Hi-C maps
		can be loaded in Juicebox and explored. They can also be used for
		automatic feature annotation and to extract matrices at specific
		resolutions.
		These results also include automatic feature annotation. The output files include
		a genome-wide annotation of loops and, whenever possible, the CTCF motifs that anchor
		them (identified using the HiCCUPS algorithm). The files also include a genome-wide
		annotation of contact domains (identified using the Arrowhead algorithm). The formats
		of these files are described in the Juicebox tutorial online; both files can be loaded
		into Juicebox as a 2D annotation.
		9. To download a file (e.g. inter.hic) from AWS to load into Juicebox, type:
		sftp -i Juicer_AWS.pem ubuntu@<given IP address>
		cd /opt/juicer/work/MBR19/aligned
		get inter.hic
		get inter_30.hic
		get ... (each of hiccups, apa, motifs, and arrowhead output files)
		10. You can also run the pipeline on genome-wide dataset that is lower resolution. Type
		cd /opt/juicer/work/HIC003
		Then
		/opt/juicer/scripts/juicer.sh -g hg19 -s MboI
		Again the pipeline will run. The results will be available in the aligned directory.
		Because this is not a deeply sequenced map, loop lists and domain lists will not be
		produced.

AWS/scripts/LibraryComplexity.class

0 → 100755

+4.55 KiB

File added.

No diff preview for this file type.

View file

AWS/scripts/check.sh

0 → 100755

+87 −0

Original line number	Diff line number	Diff line
		#!/bin/bash

		##########
		#The MIT License (MIT)
		#
		# Copyright (c) 2015 Aiden Lab
		#
		# Permission is hereby granted, free of charge, to any person obtaining a copy
		# of this software and associated documentation files (the "Software"), to deal
		# in the Software without restriction, including without limitation the rights
		# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
		# copies of the Software, and to permit persons to whom the Software is
		# furnished to do so, subject to the following conditions:
		#
		# The above copyright notice and this permission notice shall be included in
		# all copies or substantial portions of the Software.
		#
		# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
		# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
		# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
		# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
		# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
		# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
		# THE SOFTWARE.
		##########
		#
		# Sanity check once pipeline is complete to be sure number of lines adds up, deduping
		# proceeded appropriately, and that files were created

		if [ -z $ARG1 ]
		then
		# Start by checking the statistics to see if they add up
		res1=`cat ${splitdir}/normres*`
		check1=`cat ${splitdir}/normres* \| awk '{s2+=$2; s3+=$3; s4+=$4; s5+=$5; s6+=$6}END{if (s2 != s3+s4+s5+s6){print 0}else{print 1}}'`
		if [ $check1 -eq 0 ] \|\| [ -z "$res1" ]
		then
		echo "***! Error! The statistics do not add up. Alignment likely failed to complete on one or more files. Run relaunch_prep.sh"
		exit 100
		fi
		fi

		# Check the sizes of merged_sort versus the dups/no dups files to be sure
		# no reads were lost
		total=1
		total2=0
		total=`ls -l ${outputdir}/merged_sort.txt \| awk '{print $5}'`
		total2=`ls -l ${outputdir}/merged_nodups.txt ${outputdir}/dups.txt ${outputdir}/opt_dups.txt \| awk '{sum = sum + $5}END{print sum}'`

		if [ -z $total ] \|\| [ -z $total2 ] \|\| [ $total -ne $total2 ]
		then
		echo "***! Error! The sorted file and dups/no dups files do not add up, or were empty. Merge or dedupping likely failed, restart pipeline with -S merge or -S dedup"
		exit 100
		fi

		wctotal=`cat ${splitdir}/*_linecount.txt \| awk '{sum+=$1}END{print sum/4}'`
		check2=`cat ${splitdir}/normres* \| awk '{s2+=$2;}END{print s2}'`
		check4=`cat ${splitdir}/normres* \| awk '{s4+=$4;}END{print s4}'`

		if [ -z $ARG1 ]
		then
		if [ $wctotal -ne $check2 ]
		then
		echo "***! Error! The number of reads in the fastqs (${wctotal}) is not the same as the number of reads reported in the stats (${check2}), likely due to a failure during alignment"
		exit 100
		fi
		else
		# Alternate aligner check
		unmapamb=`awk 'NR%4==1{split($1,a,"_");split(a[1],b,"@"); c[b[2]]++}END{print length(c)}' ${splitdir}/*.fq`
		calctotal=`expr $check4 + $unmapamb`
		if [ $wctotal -ne $calctotal ]
		then
		echo "***! Error! The number of reads in the fastqs (${wctotal}) is not the same as the number of alignable reads reported in the stats (${check4}) plus unmapped/ambiguous (${unmapamb}), likely due to a failure during alignment. Run relaunch_prep.sh"
		exit 100
		else
		echo "Total: $wctotal"
		echo "Unalignable: $unmapamb"
		fi
		fi

		if [ -f ${outputdir}/inter.hic ] && [ -s ${outputdir}/inter.hic ] && [ -f ${outputdir}/inter_30.hic ] && [ -s ${outputdir}/inter_30.hic ]
		then
		echo "(-: Pipeline successfully completed (-:";
		echo "Run cleanup.sh to remove the splits directory";
		else
		echo "***! Error! Either inter.hic or inter_30.hic were not created"
		exit 100
		fi

AWS/scripts/chimeric_blacklist.awk

0 → 100755

+501 −0

Original line number	Diff line number	Diff line
		#!/usr/bin/awk -f
		##########
		#The MIT License (MIT)
		#
		# Copyright (c) 2015 Aiden Lab
		#
		# Permission is hereby granted, free of charge, to any person obtaining a copy
		# of this software and associated documentation files (the "Software"), to deal
		# in the Software without restriction, including without limitation the rights
		# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
		# copies of the Software, and to permit persons to whom the Software is
		# furnished to do so, subject to the following conditions:
		#
		# The above copyright notice and this permission notice shall be included in
		# all copies or substantial portions of the Software.
		#
		# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
		# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
		# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
		# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
		# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
		# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
		# THE SOFTWARE.
		##########
		# Takes a SAM file, looks for chimeric reads and normal reads. Outputs
		# only those reads mapped to chromosomes 1-24 with MAPQ > 0.
		#
		# Output to fname1 is of the form:
		# strand1 chromosome1 position1 frag1 strand2 chromosome2 position2 frag2
		# mapq1 cigar1 seq1 mapq2 cigar2 seq2
		# where chr1 is always <= chr2
		#
		# Output to fname2 retains SAM format.
		#
		# Chimeric reads are treated as follows:
		# "Normal" chimeras (with 3 reads, two with the ligation junction), where the
		# third read (w/o ligation junction) is within 20Kbp of one of the ligation
		# junction reads, are sent to fname1. All other chimeras (where either there
		# are more than 3 reads, or they don't contain the ligation junction, or the
		# one non-ligation junction end is far from the others, are sent to fname2.
		#
		# awk -f chimeric_blacklist.awk -v fname1="norm_chimera" fname2="abnorm_chimera" fname3="unmapped"

		# returns absolute value
		function abs(value)
		{
		return (value<0?-value:value);
		}
		# returns minimum of two values
		function min(value1,value2)
		{
		return (value1<value2?value1:value2);
		}
		# examines read1 (s1,c1,p1) versus read2 and returns true if
		# the first read comes before the second read
		# this is so duplicates can be found after sorting even when the strand and
		# chromosome are the same
		function less_than(s1,c1,p1,s2,c2,p2)
		{

		if (c1 < c2) return 1;
		if (c1 > c2) return 0;
		# c1 == c2
		if (s1 < s2) return 1;
		if (s1 > s2) return 0;
		# s1 == s2 && c1 == c2
		if (p1 < p2) return 1;
		if (p1 > p2) return 0;
		# all are equal, doesn't matter
		return 1;
		}
		BEGIN{
		OFS="\t";
		tottot = -1; # will count first non-group
		}
		{
		# input file is sorted by read name. Look at read name to group
		# appropriately
		split($1,a,"/");
		if(a[1]==prev){
		# move on to next record. look below this block for actions that occur
		# regardless.
		count++;
		}
		else {
		# deal with read pair group
		tottot++;
		if (count==3) {
		# chimeric read
		for (j=1; j <= 3; j++) {
		split(c[j], tmp);
		split(tmp[1],readname,"/");
		read[j] = readname[2];
		name[j] = tmp[1];

		# strand; Bit 16 set means reverse strand
		str[j] = and(tmp[2],16);
		# chromosome
		chr[j] = tmp[3];
		# position
		pos[j] = tmp[4];
		# mapq score
		m[j] = tmp[5];
		# cigar string
		cigarstr[j] = tmp[6];
		# sequence
		seq[j] = tmp[10];
		qual[j] = tmp[11];

		# get rid of soft clipping to know correct position
		if (str[j] == 0 && tmp[6] ~/^[0-9]+S/) {
		split(tmp[6], cigar, "S");
		pos[j] = pos[j] - cigar[1];
		if (pos[j] <= 0) {
		pos[j] = 1;
		}
		}
		# get rid of hard clipping to know correct position
		else if (str[j] == 0 && tmp[6] ~/^[0-9]+H/) {
		split(tmp[6], cigar, "H");
		pos[j] = pos[j] - cigar[1];
		if (pos[j] <= 0) {
		pos[j] = 1;
		}
		}
		else if (str[j] == 16) {
		# count Ms,Ds,Ns,Xs,=s for sequence length
		seqlength=0;
		currstr=tmp[6];
		# can look like 15M10S20M, need to count all not just first
		where=match(currstr, /[0-9]+[M\|D\|N\|X\|=]/);
		while (where>0) {
		seqlength+=substr(currstr,where,RLENGTH-1)+0;
		currstr=substr(currstr,where+RLENGTH);
		where=match(currstr, /[0-9]+[M\|D\|N\|X\|=]/);
		}
		pos[j] = pos[j] + seqlength - 1;
		# add soft clipped bases in for proper position
		if (tmp[6] ~ /[0-9]+S$/) {
		where = match(tmp[6],/[0-9]+S$/);
		cigloc = substr(tmp[6],where,RLENGTH-1) + 0;
		pos[j] = pos[j] + cigloc;
		}
		else if (tmp[6] ~ /[0-9]+H$/) {
		where = match(tmp[6],/[0-9]+H$/);
		cigloc = substr(tmp[6],where,RLENGTH-1) + 0;
		pos[j] = pos[j] + cigloc;
		}
		# Mitochrondria loops around
		if (chr[j] ~ /MT/ && pos[j] >= 16569) {
		pos[j] = pos[j] - 16569;
		}
		}
		# blacklist - if 3rd bit set (=4) it means unmapped
		mapped[j] = and(tmp[2],4) == 0;
		}

		dist[12] = abs(chr[1]-chr[2])*10000000 + abs(pos[1]-pos[2]);
		dist[23] = abs(chr[2]-chr[3])*10000000 + abs(pos[2]-pos[3]);
		dist[13] = abs(chr[1]-chr[3])*10000000 + abs(pos[1]-pos[3]);

		if (min(dist[12],min(dist[23],dist[13])) < 1000) {
		# The paired ends look like A/B...B. Make sure we take A and B.
		if (read[1] == read[2]) {
		# take the unique one "B" for sure
		read2 = 3;
		# take the end of "A/B" that isn't close to "B"
		read1 = dist[13] > dist[23] ? 1:2;
		}
		else if (read[1] == read[3]) {
		read2 = 2;
		read1 = dist[12] > dist[23] ? 1:3;
		}
		else if (read[2] == read[3]) {
		read2 = 1;
		read1 = dist[12] > dist[13] ? 2:3;
		}
		else {
		printf("reads strange\n") > "/dev/stderr"
		exit 1
		}

		if (mapped[read1] && mapped[read2]) {
		count_norm++;
		if (less_than(str[read1],chr[read1],pos[read1],str[read2],chr[read2],pos[read2])) {
		print str[read1],chr[read1],pos[read1],str[read2],chr[read2],pos[read2],m[read1],cigarstr[read1],seq[read1],m[read2],cigarstr[read2],seq[read2],name[read1],name[read2] > fname1;
		}
		else {
		print str[read2],chr[read2],pos[read2],str[read1],chr[read1],pos[read1],m[read2],cigarstr[read2],seq[read2],m[read1],cigarstr[read1],seq[read1],name[read2],name[read1] > fname1;
		}
		}
		else {
		for (i in c) {
		print c[i] > fname3;
		}
		count_unmapped++;
		}
		}
		else {
		# chimeric read with the 3 ends > 1KB apart
		count_abnorm++;
		for (i in c) {
		print c[i] > fname2;
		}
		}
		}
		else if (count > 3) {
		# chimeric read > 3, too many to deal with
		count_abnorm++;
		for (i in c) {
		print c[i] > fname2;
		}
		}
		else if (count == 2) {
		# code here should be same as above, but it's a "normal" read
		j = 0;
		for (i in c) {
		split(c[i], tmp);
		split(tmp[1],readname,"/");
		str[j] = and(tmp[2],16);
		chr[j] = tmp[3];
		pos[j] = tmp[4];
		m[j] = tmp[5];
		cigarstr[j] = tmp[6];
		seq[j] = tmp[10];
		qual[j] = tmp[11];
		name[j] = tmp[1];

		# blacklist - if 3rd bit set (=4) it means unmapped
		mapped[j] = and(tmp[2],4) == 0;

		if (str[j] == 0 && tmp[6] ~/^[0-9]+S/) {
		split(tmp[6], cigar, "S");
		pos[j] = pos[j] - cigar[1];
		if (pos[j] <= 0) {
		pos[j] = 1;
		}
		}
		else if (str[j] == 0 && tmp[6] ~/^[0-9]+H/) {
		split(tmp[6], cigar, "H");
		pos[j] = pos[j] - cigar[1];
		if (pos[j] <= 0) {
		pos[j] = 1;
		}
		}
		else if (str[j] == 16) {
		# count Ms,Ds,Ns,Xs,=s for sequence length
		seqlength=0;
		currstr=tmp[6];
		where=match(currstr, /[0-9]+[M\|D\|N\|X\|=]/);
		while (where>0) {
		seqlength+=substr(currstr,where,RLENGTH-1)+0;
		currstr=substr(currstr,where+RLENGTH);
		where=match(currstr, /[0-9]+[M\|D\|N\|X\|=]/);
		}
		pos[j] = pos[j] + seqlength - 1;
		if (tmp[6] ~ /[0-9]+S$/) {
		where = match(tmp[6],/[0-9]+S$/);
		cigloc = substr(tmp[6],where,RLENGTH-1) + 0;
		pos[j] = pos[j] + cigloc;
		}
		if (tmp[6] ~ /[0-9]+H$/) {
		where = match(tmp[6],/[0-9]+H$/);
		cigloc = substr(tmp[6],where,RLENGTH-1) + 0;
		pos[j] = pos[j] + cigloc;
		}
		if (chr[j] ~ /MT/ && pos[j] >= 16569) {
		pos[j] = pos[j] - 16569;
		}
		}
		j++;
		}
		if (mapped[0] && mapped[1]) {
		count_reg++;
		if (less_than(str[0],chr[0],pos[0],str[1],chr[1],pos[1])) {
		# ideally we'll get rid of printing out cigar string at some point
		print str[0],chr[0],pos[0],str[1],chr[1],pos[1],m[0],cigarstr[0],seq[0],m[1],cigarstr[1],seq[1],name[0],name[1] > fname1;
		}
		else {
		print str[1],chr[1],pos[1],str[0],chr[0],pos[0],m[1],cigarstr[1],seq[1],m[0],cigarstr[0],seq[0],name[1],name[0] > fname1;
		}
		}
		else {
		for (i in c) {
		print c[i] > fname3;
		}
		count_unmapped++;
		}
		}
		# reset variables
		delete c;
		count=1;
		prev=a[1];
		}
		# these happen no matter what, after the above processing
		c[count] = $0;
		}
		END{
		# deal with read pair group
		tottot++;
		if (count==3) {
		# chimeric read
		for (j=1; j <= 3; j++) {
		split(c[j], tmp);
		split(tmp[1],readname,"/");
		read[j] = readname[2];
		name[j] = tmp[1];

		# strand
		str[j] = and(tmp[2],16);
		# chromosome
		chr[j] = tmp[3];
		# position
		pos[j] = tmp[4];
		# mapq score
		m[j] = tmp[5];
		# cigar string
		cigarstr[j] = tmp[6];
		# sequence
		seq[j] = tmp[10];
		qual[j] = tmp[11];
		# get rid of soft clipping to know correct position
		if (str[j] == 0 && tmp[6] ~/^[0-9]+S/) {
		split(tmp[6], cigar, "S");
		pos[j] = pos[j] - cigar[1];
		if (pos[j] <= 0) {
		pos[j] = 1;
		}
		}
		else if (str[j] == 0 && tmp[6] ~/^[0-9]+H/) {
		split(tmp[6], cigar, "H");
		pos[j] = pos[j] - cigar[1];
		if (pos[j] <= 0) {
		pos[j] = 1;
		}
		}
		else if (str[j] == 16) {
		# count Ms,Ds,Ns,Xs,=s for sequence length
		seqlength=0;
		currstr=tmp[6];
		# can look like 15M10S20M, need to count all not just first
		where=match(currstr, /[0-9]+[M\|D\|N\|X\|=]/);
		while (where>0) {
		seqlength+=substr(currstr,where,RLENGTH-1)+0;
		currstr=substr(currstr,where+RLENGTH);
		where=match(currstr, /[0-9]+[M\|D\|N\|X\|=]/);
		}
		pos[j] = pos[j] + seqlength - 1;
		# add soft clipped bases in for proper position
		if (tmp[6] ~ /[0-9]+S$/) {
		where = match(tmp[6],/[0-9]+S$/);
		cigloc = substr(tmp[6],where,RLENGTH-1) + 0;
		pos[j] = pos[j] + cigloc;
		}
		if (tmp[6] ~ /[0-9]+H$/) {
		where = match(tmp[6],/[0-9]+H$/);
		cigloc = substr(tmp[6],where,RLENGTH-1) + 0;
		pos[j] = pos[j] + cigloc;
		}
		# Mitochrondria loops around
		if (chr[j] ~ /MT/ && pos[j] >= 16569) {
		pos[j] = pos[j] - 16569;
		}
		}
		mapped[j] = and(tmp[2],4) == 0;
		}

		dist[12] = abs(chr[1]-chr[2])*10000000 + abs(pos[1]-pos[2]);
		dist[23] = abs(chr[2]-chr[3])*10000000 + abs(pos[2]-pos[3]);
		dist[13] = abs(chr[1]-chr[3])*10000000 + abs(pos[1]-pos[3]);

		if (min(dist[12],min(dist[23],dist[13])) < 1000) {
		# The paired ends look like A/B...B. Make sure we take A and B.
		if (read[1] == read[2]) {
		# take the unique one "B" for sure
		read2 = 3;
		# take the end of "A/B" that isn't close to "B"
		read1 = dist[13] > dist[23] ? 1:2;
		}
		else if (read[1] == read[3]) {
		read2 = 2;
		read1 = dist[12] > dist[23] ? 1:3;
		}
		else if (read[2] == read[3]) {
		read2 = 1;
		read1 = dist[12] > dist[13] ? 2:3;
		}
		else {
		printf("reads strange\n") > "/dev/stderr"
		exit 1
		}

		if (mapped[read1] && mapped[read2]) {
		count_norm++;
		if (less_than(str[read1],chr[read1],pos[read1],str[read2],chr[read2],pos[read2])) {
		print str[read1],chr[read1],pos[read1],str[read2],chr[read2],pos[read2],m[read1],cigarstr[read1],seq[read1],m[read2],cigarstr[read2],seq[read2],name[read1],name[read2] > fname1;
		}
		else {
		print str[read2],chr[read2],pos[read2],str[read1],chr[read1],pos[read1],m[read2],cigarstr[read2],seq[read2],m[read1],cigarstr[read1],seq[read1],name[read2],name[read1] > fname1;
		}
		}
		else {
		for (i in c) {
		print c[i] > fname3;
		}
		count_unmapped++;
		}
		}
		else {
		# chimeric read with the 3 ends > 1KB apart
		count_abnorm++;
		for (i in c) {
		print c[i] > fname2;
		}
		}
		}
		else if (count > 3) {
		# chimeric read > 3, too many to deal with
		count_abnorm++;
		for (i in c) {
		print c[i] > fname2;
		}
		}
		else if (count == 2) {
		# code here should be same as above, but it's a "normal" read
		j = 0;
		for (i in c) {
		split(c[i], tmp);
		split(tmp[1],readname,"/");
		str[j] = and(tmp[2],16);
		chr[j] = tmp[3];
		pos[j] = tmp[4];
		m[j] = tmp[5];
		cigarstr[j] = tmp[6];
		seq[j] = tmp[10];
		qual[j] = tmp[11];
		name[j] = tmp[1];


		if (str[j] == 0 && tmp[6] ~/^[0-9]+S/) {
		split(tmp[6], cigar, "S");
		pos[j] = pos[j] - cigar[1];
		if (pos[j] <= 0) {
		pos[j] = 1;
		}
		}
		else if (str[j] == 0 && tmp[6] ~/^[0-9]+H/) {
		split(tmp[6], cigar, "H");
		pos[j] = pos[j] - cigar[1];
		if (pos[j] <= 0) {
		pos[j] = 1;
		}
		}
		else if (str[j] == 16) {
		# count Ms,Ds,Ns,Xs,=s for sequence length
		seqlength=0;
		currstr=tmp[6];
		where=match(currstr, /[0-9]+[M\|D\|N\|X\|=]/);
		while (where>0) {
		seqlength+=substr(currstr,where,RLENGTH-1)+0;
		currstr=substr(currstr,where+RLENGTH);
		where=match(currstr, /[0-9]+[M\|D\|N\|X\|=]/);
		}
		pos[j] = pos[j] + seqlength - 1;
		if (tmp[6] ~ /[0-9]+S$/) {
		where = match(tmp[6],/[0-9]+S$/);
		cigloc = substr(tmp[6],where,RLENGTH-1) + 0;
		pos[j] = pos[j] + cigloc;
		}
		if (tmp[6] ~ /[0-9]+H$/) {
		where = match(tmp[6],/[0-9]+H$/);
		cigloc = substr(tmp[6],where,RLENGTH-1) + 0;
		pos[j] = pos[j] + cigloc;
		}
		if (chr[j] ~ /MT/ && pos[j] >= 16569) {
		pos[j] = pos[j] - 16569;
		}
		}
		mapped[j] = and(tmp[2],4) == 0;
		j++;
		}
		if (mapped[0] && mapped[1]) {
		count_reg++;
		if (less_than(str[0],chr[0],pos[0],str[1],chr[1],pos[1])) {
		# ideally we'll get rid of printing out cigar string at some point
		print str[0],chr[0],pos[0],str[1],chr[1],pos[1],m[0],cigarstr[0],seq[0],m[1],cigarstr[1],seq[1],name[0],name[1] > fname1;
		}
		else {
		print str[1],chr[1],pos[1],str[0],chr[0],pos[0],m[1],cigarstr[1],seq[1],m[0],cigarstr[0],seq[0],name[1],name[0] > fname1;
		}
		}
		else {
		for (i in c) {
		print c[i] > fname3;
		}
		count_unmapped++;
		}
		}

		printf("%d %d %d %d %d\n", tottot, count_unmapped, count_reg, count_norm, count_abnorm) >> fname1".res.txt";
		}

AWS/scripts/cleanup.sh

0 → 100755

+31 −0

Original line number	Diff line number	Diff line
		#!/bin/bash
		##########
		#The MIT License (MIT)
		#
		# Copyright (c) 2015 Aiden Lab
		#
		# Permission is hereby granted, free of charge, to any person obtaining a copy
		# of this software and associated documentation files (the "Software"), to deal
		# in the Software without restriction, including without limitation the rights
		# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
		# copies of the Software, and to permit persons to whom the Software is
		# furnished to do so, subject to the following conditions:
		#
		# The above copyright notice and this permission notice shall be included in
		# all copies or substantial portions of the Software.
		#
		# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
		# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
		# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
		# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
		# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
		# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
		# THE SOFTWARE.
		##########

		# Script to clean up big repetitive files and zip fastqs. Run after you are
		# sure the pipeline ran successfully. Run from top directory (HIC001 e.g.).
		total=`ls -l aligned/merged_sort.txt \| awk '{print $5}'`
		total2=`ls -l aligned/merged_nodups.txt aligned/dups.txt aligned/opt_dups.txt \| awk '{sum = sum + $5}END{print sum}'`
		if [ $total -eq $total2 ]; then rm aligned/merged_sort.txt; rm -r splits; else echo "Problem: The sum of merged_nodups and the dups files is not the same size as merged_sort.txt"; fi
		#bzip2 fastq/*.fastq

Admin message