DNA Sequencing is the process of determining the sequence of base-pairs composing DNA. Shotgun sequencing splits DNA into small chunks called reads, and reads need to be reassembled or mapped into the overall DNA. Mapper speed remains a bottleneck in DNA read mapping. Processing and aligning billions of reads take a substantial amount of time, and there is a quest to build faster and better algorithms to improve runtime and throughput. In order to further the use DNA sequencing technology, it’s important to make it accessible and efficient.
Approximate String Matching is a pivotal problem in the field of computer science. It serves as an integral component for many string algorithms, most notably, DNA read mapping and alignment. The improved LV algorithm proposes an improved dynamic programming strategy over the banded Smith-Waterman algorithm but suffers from support of a limited selection of scoring schemes. In this paper, we propose the Leaping Toad problem, a generalization of the approximate string matching problem, as well as LEAP, a generalization of the Landau-Vishkin’s algorithm that solves the Leaping Toad problem under a broader selection of scoring schemes.
Single processor compute capabilities have not improved significantly, and with the advent of multi-core and multi-threaded machines, parallelism is important in achieving the next generation of performance. Many aspects of DNA read mapping are inherently parallel, so this makes it a viable approach for achieving improvements in runtime. This research also focuses on optimizing and parallelizing a component of DNA read mapping pipeline, seed selection. We want to determine how efficiently we can distribute work over cores to achieve a large speedup.
In this research, we focus on improving the complexity and runtime of seed-and-extend based mappers. Seed-and-extend mappers align reads by splitting reads into smaller seeds. These seeds are searched in the reference and used to align the reads. The complexity of the alignment is in correlation with the how often the seeds occur in the reference, or the frequency of the seeds. If seeds occur more frequently, a larger number of locations need to checked for potential mapping sites of the read, which increases complexity. Therefore it is imperative that we choose seeds which have the lowest sum of frequencies to minimize the number of mapping locations, and hence computation.
Therefore the goal is to develop state-of-the-art seed selection schemes which are able to select seeds from a read in a memory efficient and cache efficient manner while minimizing the sum of the frequency of the seeds. These new heuristics and algorithms will be compared with existing seed selection schemes found in seed-and-extend mappers such as Hobbes and FastHash, among others.
Optimizing seed selection is an important problem in read mapping. The number of non-overlapping seeds a mapper selects determines the sensitivity of the mapper while the total frequency of all selected seeds determines the speed of the mapper. Modern seed-and-extend mappers usually select seeds with either an equal and fixed-length scheme or with an inflexible placement scheme, both of which limit the ability of the mapper in selecting less frequent seeds to speed up the mapping process. Therefore, it is crucial to develop a new algorithm that can adjust both the individual seed length and the seed placement, as well as derive less frequent seeds.
We published a paper in the Oxford Bioinformatics Journal.
Began work in DNA Read Mapping.
This work was done by Sunny Nahar under Prof. Onur Mutlu and Hongyi Xin as part of the CMU Safari Group.