Indiana University Bloomington

Luddy School of Informatics, Computing, and Engineering

Technical Report TR709:
A New Module in RAPSearch2 for Fast Protein Similarity Search of Paired-end Sequences

Xiaoqian Zhang, Haixu Tang
(Dec 2013), 16 pages pages
[Master's capstone report]
Abstract:
Protein similarity search is a fundamental step for taxonomic classification and function annotation of sequencing data from metagenomic and metatranscriptomic projects. Currently, the most popular tool for similarity search is BLAST (or specifically, the blastx), which have proved very efficient in aligning conventional sequencing data such as Sanger reads. The application and extension of Next Generation Sequencing (NGS) technology, which generates massive sequencing data, poses new challenge for classical algorithms of sequence comparison and similarity search. If we use BLAST to precede NGS sequences, the speed will be too slow. To address this challenge, RAPSearch [12,13] has been developed. It is a fast protein similarity search tool, which utilizes reduced amino acid alphabet to speed up the similarity search a few magnitudes and meet the demand of NGS sequence analysis. Paired end sequencing is a common technique used in NGS. It produces two reads from proximal locations of a target DNA or RNA molecule in both forward and reverse direction, which could be potentially utilized to enhance the alignment precise and coverage. RAPsearch has two versions (RAPsearch and RAPsearch2), both can only treat single-end sequences. Here, I will present a method applying to RAPSearch2 that combine paired-end reads as one hit and evaluate the significance in the similarity search to improve sensitivity of alignment. Based on the RAPSearch2 algorithm, I built a new module that could process the paired-end reads simultaneously. By using the paired end sequences aligned on the proximal locations on the same subject sequences, the method could increase the searching sensitivity by about 0.5%~0.6%, comparing to the similarity search by using each of the paired end sequences individually.

Available as: