updated README with summary file information (e2435e26) · Commits · github_fork / Chromap

README.md

+50 −0

Original line number	Diff line number	Diff line
		@@ -25,6 +25,10 @@ cd chromap && make
		- [Map ChIP-seq short reads](#map-chip)
		- [Map ATAC-seq/scATAC-seq short reads](#map-atac)
		- [Map Hi-C short reads](#map-hic)
		- [Summarizing mapping statistcs/quality control](#atacseq-qc)
		- [Summary File](#summaryfile)
		- [Estimating FRiP](#estfrip)
		- [Features to assist in doublet detection](#doublet)
		- [Getting help](#help)
		- [Citing Chromap](#cite)

		@@ -124,6 +128,52 @@ chromap --preset hic -x index -r ref.fa -1 read1.fa -2 read2.fa -o aln.pairs
		```
		Chromap will perform split alignment (--split-alignment) on Hi-C reads and output mappings in [pairs][pairs] format (--pairs), which is used in [4DN Hi-C data processing pipeline][4DN]. Some Hi-C data analysis pipelines may require the reads are sorted in specific chromosome order other than the one in the index. Therefore, Chromap provides the option --chr-order to specify the alignment order, and --pairs-natural-chr-order for flipping the pair in the pairs format.

		### <a name="atacseq-qc"></a>Summarizing mapping statistcs/quality control

		Chromap allows you to summarize the dataset's mapping statistics as well as quality metrics at either a bulk or single cell level. To enable this feature, users can specify a file path using this option, --summary [FILE], where a csv file will be saved.

		This summary file will output a series of metrics for each barcode (or the overall dataset if it is bulk). Here are the different columns contained within the summary file:

		```sh
		barcode,total,duplicate,unmapped,lowmapq,cachehit,fric,estfrip,numcacheslots
		```

		- `barcode` - Barcode label for cell
		- `total` - Total number of fragments
		- `duplicate` - Number of duplicate fragments
		- `unmapped` - Number of unmapped fragments
		- `lowmapq` - Number of fragments with a low MAPQ
		- `cachehit` - Number of fragments that were found in the chromap cache during alignment
		- `fric` - Fraction of fragments in the chromap cache
		- `estfrip` - Estimated FRiP value based on a linear model ([See below for more details](#estfrip))
		- `numcacheslots` - Number of unique associated cache slots for this barcode (Relevant feature for doublet detection, [see below for more](#doublet))

		The summary contains metrics relevant to the mappability of fragments from each barcode.
		However, it also contains metrics (`estfrip` and `numcacheslots`) relevant to quality control for chromatin profiling assays like ATAC-seq, CHIP-seq and Hi-C data.
		The next two sections briefly describe these two metrics and how they can be useful for users.

		#### <a name="estfrip"></a>Estimating FRiP

		The `estfrip` column in Chromap's summary file represents an estimate of the FRiP score (Fraction of Reads in Peak Regions) computed by Chromap.
		Chromap uses a simple multi-variate linear model to estimate the FRiP for each barcode and the features used in this model are `fric`, `duplicate`, `unmapped` and `lowmapq`.

		Typically, the FRiP score is used to assess the quality of chromatin profiles, where typically the higher the FRiP score the better.

		For users, this `estfrip` can be used to quickly gauge the quality of the data by plotting all the values in a histogram and looking to see if you a multi-modal distribution.
		In addition, when combining Chromap with downstream analysis tools such as [SnapATAC2](https://github.com/kaizhang/SnapATAC2) that perform clustering, the `estfrip` can be used to quickly identify any specific clusters that are lower quality than the rest.

		An important note to users, the `estfrip` values for every barcode should not be taken by themselves and used as the true FRiP score.
		These estimates are mainly intended to be used for quality control at a dataset level where we compare different `estfrip` values to each other.

		#### <a name="doublet"></a>Features to assist in doublet detection

		The `numcacheslots` column in Chromap's summary file estimates the number of unique cache slots queried for each barcode during the alignment. This feature can be useful in assisting users for doublet detection/filtering.

		Typically for doublet detection in single-cell datasets, a simple and naive metric used to identify potential doublets is the number of fragments in cells (i.e. more reads, more likely a doublet).

		Chromap uses the simple intuition that barcodes with higher number of peaks than usual, could be doublets. The number of unique cache slots that are queried can be seen as a proxy for the number of peaks. In our experiments, using `numcacheslots` yields a larger AUC compared using `total` for binary classification of doublets. Therefore, users can potentially use this metric as an additional check/feature along with other doublet-detection specific methods.


		### <a name="help"></a>Getting help

		Detailed description of Chromap command line options and optional tags can be displayed by running Chromap with -h or be found at the [manpage][manpage]. If you encounter bugs or have further questions or requests, you can raise an issue at the [issue page][issue].

Admin message