Introduction

‘minimapR’ is a wrapper for ‘Minimap2’. ‘Minimap2’ is a very valuable long read aligner for the Pacbio and Oxford Nanopore Technologies sequencing platforms. ‘minimapR’ is an R wrapper for ‘minimap2’ which was developed by Heng Li me@liheng.org. This tool aligns long reads to a reference genome and is used in many different bioinformatics workflows.

library(minimapR)
#> Loading required package: Rsamtools
#> Loading required package: GenomeInfoDb
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#>     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, rank, rbind, rownames, sapply, setdiff, table, tapply,
#>     union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#> 
#>     findMatches
#> The following objects are masked from 'package:base':
#> 
#>     I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomicRanges
#> Loading required package: Biostrings
#> Loading required package: XVector
#> 
#> Attaching package: 'Biostrings'
#> The following object is masked from 'package:base':
#> 
#>     strsplit
#> Loading required package: git2r
#> 
#> Attaching package: 'git2r'
#> The following objects are masked from 'package:Biostrings':
#> 
#>     as.data.frame, head
#> The following object is masked from 'package:XVector':
#> 
#>     as.data.frame
#> The following objects are masked from 'package:GenomicRanges':
#> 
#>     as.data.frame, merge
#> The following objects are masked from 'package:GenomeInfoDb':
#> 
#>     as.data.frame, merge
#> The following objects are masked from 'package:IRanges':
#> 
#>     as.data.frame, diff, merge
#> The following objects are masked from 'package:S4Vectors':
#> 
#>     as.data.frame, head, merge
#> The following object is masked from 'package:BiocGenerics':
#> 
#>     as.data.frame
#> Loading required package: pafr
#> Loading required package: ggplot2

Installation

‘minimap2’ and ‘samtools’ must be installed on your system along with the R packages ‘Rsamtools’, ‘git2r’, and ‘pafr’. ‘minimap2’ can be installed on various operating systems by running the following function or following the instruction from the output of the function. “/your/path/to/directory/for/install” should be replaced with the path to the directory where you want to install ‘minimap2’.

minimap2_installation("/your/path/to/directory/for/install")

check if dependencies were successfully installed.

minimap2_check()
#> minimap2 is installed.
#>                                minimap2 
#> "/home/jake/software/minimap2/minimap2"
samtools_check()
#> 
#> samtools is installed.
#>                            samtools 
#> "/home/jake/anaconda3/bin/samtools"

Usage

Downloading data for example

We will download the reference genomes and the query sequences for the example. The reference genomes are the human genome (GRCh38) and the yeast genome (S288C). The query sequences are the human ONT reads and yeast HIFI reads.

tmp_folder <- tempdir()
cat("Temporary folder is:", tmp_folder, "\n")
#> Temporary folder is: /tmp/Rtmp8GTYNH
href_url <- "https://github.com/jake-bioinfo/minimapR/raw/master/inst/extdata/GRCh38_chr1_50m.fa"
hfq_url <- "https://github.com/jake-bioinfo/minimapR/raw/master/inst/extdata/ont_hs_sample.fastq.gz"
yref_url <- "https://github.com/jake-bioinfo/minimapR/raw/master/inst/extdata/S288C_ref_genome.fasta"
yfq_url <- "https://github.com/jake-bioinfo/minimapR/raw/master/inst/extdata/yeast_sample_hifi.fastq.gz"
url_list <- c(href_url, hfq_url, yref_url, yfq_url)
lapply(url_list, function(x) download.file(x, destfile = file.path(tmp_folder, basename(x))))
#> [[1]]
#> [1] 0
#> 
#> [[2]]
#> [1] 0
#> 
#> [[3]]
#> [1] 0
#> 
#> [[4]]
#> [1] 0

# Contents of the temporary folder
cat("Contents of the temporary folder are:", "\n")
#> Contents of the temporary folder are:
fa_list <- list.files(tmp_folder, pattern = ".fa", full.names = TRUE)
fa_list
#> [1] "/tmp/Rtmp8GTYNH/GRCh38_chr1_50m.fa"        
#> [2] "/tmp/Rtmp8GTYNH/S288C_ref_genome.fasta"    
#> [3] "/tmp/Rtmp8GTYNH/ont_hs_sample.fastq.gz"    
#> [4] "/tmp/Rtmp8GTYNH/yeast_sample_hifi.fastq.gz"

Aligning reads to the reference genomes

We will align the human ONT reads to the human genome and the yeast HIFI reads to the yeast genome using ‘minimap2’. The ‘minimap2’ function returns a data frame with the alignment information.

# Human ONT alignment
h_out <- file.path(tmp_folder, "ont_hs_sample")
h_alignment <- minimap2(reference = fa_list[1], 
                        query_sequences = fa_list[2], 
                        output_file_prefix = h_out, 
                        preset_string = "map-ont", 
                        threads = 4, 
                        return = TRUE, 
                        verbose = FALSE)

# Yeast HIFI alignment
y_out <- file.path(tmp_folder, "yeast_sample_hifi")
y_alignment <- minimap2(reference = fa_list[3], 
                        query_sequences = fa_list[4], 
                        output_file_prefix = y_out, 
                        preset_string = "map-hifi", 
                        threads = 4, 
                        return = TRUE, 
                        verbose = TRUE)
#> Generating output file: /tmp/Rtmp8GTYNH/yeast_sample_hifi
#> Running minimap2...
#> Running minimap with the following command:
#> /home/jake/software/minimap2/minimap2 -ax map-hifi -t 4 /tmp/Rtmp8GTYNH/ont_hs_sample.fastq.gz /tmp/Rtmp8GTYNH/yeast_sample_hifi.fastq.gz -o /tmp/Rtmp8GTYNH/yeast_sample_hifi.sam

# Check the alignment
cat("Alignment for the human sample is:\n")
#> Alignment for the human sample is:
h_alignment[1:5, 1:7]
#>            qname flag                   rname strand     pos qwidth mapq
#> 1 ref|NC_001135|  256 NC_000001.11_1-50000000      + 3929397 316620    0
#> 2 ref|NC_001224| 2064 NC_000001.11_1-50000000      - 4099809    611    1
#> 3 ref|NC_001135|  272 NC_000001.11_1-50000000      - 5386833 316620    0
#> 4 ref|NC_001224| 2064 NC_000001.11_1-50000000      - 9123181    665    1
#> 5 ref|NC_001143|  256 NC_000001.11_1-50000000      + 9223252 666816    0
cat("Alignment for the yeast sample is:\n")
#> Alignment for the yeast sample is:
y_alignment[1:5, 1:7]
#>             qname flag rname strand pos qwidth mapq
#> 1  SRR13577847.50    4  <NA>   <NA>  NA     NA   NA
#> 2  SRR13577847.52    4  <NA>   <NA>  NA     NA   NA
#> 3 SRR13577847.144    4  <NA>   <NA>  NA     NA   NA
#> 4 SRR13577847.203    4  <NA>   <NA>  NA     NA   NA
#> 5 SRR13577847.251    4  <NA>   <NA>  NA     NA   NA

Clean up

# Clean up
unlink(tmp_folder, recursive = TRUE)
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] minimapR_0.0.1.0     pafr_0.0.2           ggplot2_3.5.1       
#>  [4] git2r_0.33.0         Rsamtools_2.20.0     Biostrings_2.72.0   
#>  [7] XVector_0.44.0       GenomicRanges_1.56.0 GenomeInfoDb_1.40.0 
#> [10] IRanges_2.38.0       S4Vectors_0.42.0     BiocGenerics_0.50.0 
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.9              utf8_1.2.4              generics_0.1.3         
#>  [4] bitops_1.0-7            stringi_1.8.4           digest_0.6.36          
#>  [7] magrittr_2.0.3          evaluate_0.24.0         grid_4.4.1             
#> [10] fastmap_1.2.0           jsonlite_1.8.8          httr_1.4.7             
#> [13] fansi_1.0.6             UCSC.utils_1.0.0        scales_1.3.0           
#> [16] codetools_0.2-20        jquerylib_0.1.4         cli_3.6.3              
#> [19] rlang_1.1.4             crayon_1.5.3            munsell_0.5.1          
#> [22] withr_3.0.0             cachem_1.1.0            yaml_2.3.9             
#> [25] tools_4.4.1             parallel_4.4.1          BiocParallel_1.38.0    
#> [28] dplyr_1.1.4             colorspace_2.1-0        GenomeInfoDbData_1.2.12
#> [31] vctrs_0.6.5             R6_2.5.1                lifecycle_1.0.4        
#> [34] zlibbioc_1.50.0         stringr_1.5.1           pkgconfig_2.0.3        
#> [37] bslib_0.7.0             pillar_1.9.0            gtable_0.3.5           
#> [40] glue_1.7.0              xfun_0.45               tibble_3.2.1           
#> [43] tidyselect_1.2.1        knitr_1.48              htmltools_0.5.8.1      
#> [46] rmarkdown_2.27          compiler_4.4.1