Algorithms for Suffix Arrays and Compressed Suffix Arrays

Participants: Kunihiko Sadakane, Tak-Wah Lam, Keng Sung, Swee Seong Wong.

Background

Suffix arrays and compressed suffix arrays are important full-text indexing technologies frequently used for indexing texts without word boundary, such as DNA or protein sequences. However, constructing these arrays requires a lot of space, making these data-structures impractical. For example, constructing CSA for human genome requires at least 24G RAM. In this project, we aim to develop efficient algorithms for searching and constructing suffix arrays, compressed suffix arrays, and other types of full-text indices.

Achievements

We are pioneer in efficient algorithms for searching and constructing full-text indices like suffix arrays and compressed suffix arrays. Up to now, no known algorithm can build a full-text index for a big dataset using reasonable computing resources. For instance, we are the only group who can construct a full-text index for the whole human genome using a PC with 2G RAM within 2 hours.

We have achieved good results in computer science (e.g., our FOCS paper in 2003). We have also applied our algorithms to solve real biology problems (e.g., our Nature Method paper in 2005 and Cell paper in 2006). One of our algorithms won the Japan Forum on Information Technology Award in 2003. We have also worked on pattern searching. Utilizing our full-text indices, we can do approximate searching more efficiently than many previous solutions (e.g., our ESA in 2006).

Selected Publications

Wing-Kai Hon, Kunihiko Sadakane, Wing-Kin Sung. Breaking a Time-and-Space Barrier in Constructing Full-Text Indices. Proc. 44th IEEE Symp. on Foundations of Computer Science (FOCS 2003), pages 251--260, October 2003. PS
Tak Wah Lam, Wing-Kin Sung, Swee-Seong Wong. Improved Approximate String Matching Using Compressed Suffix Data Structures. Proc. 16th International Symposium on Algorithms and Computation (ISAAC 2005), pages 339-348, 2005.
Ho-Leung Chan, Tak-Wah Lam, Wing-Kin Sung, Siu-Lung Tam, Swee-Seong Wong. Compressed Indexes for Approximate String Matching. Proc. 16th European Symp. on Algorithms (ESA 2006), pages 208-219, September 2006.
Partick Ng, Chia-Lin Wei, Wing-Kin Sung, Kuo Ping Chiu, Leonard Lipovich, Chin Chin Ang, Sanjay Gupta, Atif Shahab, Azmi Ridwan, Chee Hong Wong, Edison T. Liu, Yijun Ruan. Gene Identification Signature (GIS) Analysis for Transcriptome Characterization and Genome Annotation. Nature Methods, 2:105-111, 2005.
Trinh N. D. Huynh, Wing-Kai Hon, Tak Wah Lam, Wing-Kin Sung. Approximate string matching using compressed suffix arrays. Theor. Comput. Sci., 352(1-3):240-249, 2006.
Ho-Leung Chan, Tak Wah Lam, Wing-Kin Sung, Siu-Lung Tam, Swee-Seong Wong. A Linear Size Index for Approximate Pattern Matching. Proc. 17th Symp. on Combinatorial Pattern Matching (CPM 2006), pages 49-59, July 2006.
Jesper Jansson, Kunihiko Sadakane, Wing-Kin Sung. Ultra-succinct Representation of Ordered Trees. Proceedings of ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 575-584, New Orleans, Luisiana, January 2007.
Swee-Seong Wong, Wing-Kin Sung, Limsoon Wong. CPS-Tree: A Compact Partitioned Suffix Tree for Disk-Based Indexing on Large Genome Sequences. Proceedings of 23rd IEEE International Conference on Data Engineering, pages 1350--1354, Istanbul, Turkey, April 2007.

Last updated: 1/5/07, Limsoon Wong.