Beginning in mid-February 2008, the 1997-2007 online version of the Science Watch® newsletter, ESI-Topics.com, and in-cites.com, will all be featured together on the redesigned ScienceWatch.com. All previous content from the three sites will be permanently archived, and remain accessible from any existing bookmarks to the archived pages. No new content will be added to this site. Updates and new content (updated biweekly) are available at ScienceWatch.com now.

Fast Breaking Comments

By Weizhong Li

ESI Special Topics, December 2007
Citing URL - http://www.esi-topics.com/fbp/2007/december07-WeizhongLi.html

Weizhong LiWeizhong Li answers a few questions about this month's fast breaking paper in the field of Computer Science. The author has also sent along images of their work.


From •>>December 2007

Field: Computer Science
Article Title: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences
Authors: Li, WZ;Godzik, A
Journal: BIOINFORMATICS
Volume: 22
Issue: 13
Page: 1658-1659
Year: JUL 1 2006
* Burnham Inst Med Res, La Jolla, CA 92037 USA.
* Burnham Inst Med Res, La Jolla, CA 92037 USA.

ST:  Why do you think your paper is highly cited?

CD-HIT is an ultra-fast program for the clustering of biological sequences. This paper describes significant improvements and new developments of CD-HIT. I see three primary reasons for its being highly cited. 1) There is a great need for a fast sequence clustering program to help various sequence-based research, and there are not many such programs available. CD-HIT may be the only one that can handle a huge dataset of tens of millions of sequences. 2) A previous version of CD-HIT exists which has attracted many users and citations. 3) The CD-HIT program is reasonably well maintained, distributed, and documented, so that a user can easily download and apply it without experiencing technical problems.

ST:  Does it describe a new discovery, methodology, or synthesis of knowledge?


“This paper describes a computer program and the underlying algorithm for rapid sequence clustering and comparing.”

This paper describes a computer program and the underlying algorithm for rapid sequence clustering and comparing. Some earlier works had previously been published—see Bioinformatics 17:282, (2001), and Bioinformatics 18: 77, (2002)—but several significant improvements were first described in this paper.

ST:  Would you summarize the significance of your paper in layman’s terms?

As a result of high-throughput genome sequencing projects, researchers are facing serious challenges and problems from the explosive growth of public sequence databases. Routine sequence analysis is getting more computationally expensive and more complex. Also, the general growth of databases is quite uneven, which may lead to biased conclusions.

An efficient clustering method is the key to addressing these challenges and overcoming the problems. However, sequence clustering is also quite computationally intensive. Our contribution here is an algorithm which could speed up the clustering calculation by two to three orders of magnitude. So, a user can easily apply our method to his/her sequences, even with a very huge dataset.

ST:  How did you become involved in this research, and were there any problems along the way?

I have witnessed the increasing growth of public sequence databases since I first began working in the bioinformatics field 10 years ago. In our research, in order to reduce the effort of sequence analysis, we clustered the sequence datasets and used only its representatives. We noticed that such usage actually improved the result, and similar findings were reported in the literature.

With the database continuing to grow, sequence clustering itself became a challenge: this is why I began a search for an efficient algorithm for rapid sequence clustering and to concentrate on writing the CD-HIT program. The development has been fairly smooth, and I’ve received considerable support, help, and suggestions from many CD-HIT users. The only problem has been the dearth of grant support—much of the program development was completed in my spare time.

ST:  Where do you see your research leading in the future?

Sequence clustering methods will have numerous applications in various fields for as long as a large amount of biological sequences are being analyzed. Among the unique features of our CD-HIT program are its ultra-high speed and its inherent capability of handling a huge amount of sequences. I envision several new and important applications, such as the analysis of metagenomic sequences. I can also foresee potential developments which will further improve the clustering method, and which will simultaneously become invaluable throughout the worldwide research community.End

Weizhong Li, Ph.D.
Senior Scientist
California Institute for Telecommunications and Information Technology (Calit2)
University of California, San Diego
La Jolla, CA, USA


A Closer Look...

A closer look... Below are images sent in by Weizhong Li which corresponds with the featured paper, or current research.

Figure 1:

Figure 1: CD-HIT program (http://cd-hit.org) is available from http://www.bioinformatics.org/ as an open source package.

  

  

Figure 2:

Figure 2: A clustered database not only reduces the search time but also helps to improve the sensitivity of PSI-BLAST. NR is the NCBI non-redundant database. NR80 and NR50 are databases where similar sequences at 80% and 50% identity are removed, see Protein Engineering, 15:643 (2002).  

   

ESI Special Topics, December 2007
Citing URL - http://www.esi-topics.com/fbp/2007/december07-WeizhongLi.html

•> Search Special Topics
Fast Breaking Papers Menu || All Topics Menu
Fast Breaking Papers Comments Menu
Help || About || Contact

ScienceWatch.com - Tracking Trends and Perfomance in Basic Research
Go to the new ScienceWatch.com

Write to the Webmaster with questions/comments. Terms of Usage.
The Research Services Group of Thomson Scientific |
(c) 2008 The Thomson Corporation.