Why
do you think your paper is highly cited?
CD-HIT is an ultra-fast program for the clustering of
biological sequences. This paper describes significant
improvements and new developments of CD-HIT. I see three
primary reasons for its being highly cited. 1) There is a
great need for a fast sequence clustering program to help
various sequence-based research, and there are not many such
programs available. CD-HIT may be the only one that can
handle a huge dataset of tens of millions of sequences. 2) A
previous version of CD-HIT exists which has attracted many
users and citations. 3) The CD-HIT program is reasonably
well maintained, distributed, and documented, so that a user
can easily download and apply it without experiencing
technical problems.
Does
it describe a new discovery, methodology, or synthesis of
knowledge?
|

“This paper describes a computer program and the
underlying algorithm for rapid sequence
clustering and comparing.” |
|
This paper describes a computer program and the
underlying algorithm for rapid sequence clustering and
comparing. Some earlier works had previously been
published—see Bioinformatics 17:282, (2001), and
Bioinformatics 18: 77, (2002)—but several significant
improvements were first described in this paper.
Would
you summarize the significance of your paper in layman’s terms?
As a result of high-throughput genome sequencing
projects, researchers are facing serious challenges and
problems from the explosive growth of public sequence
databases. Routine sequence analysis is getting more
computationally expensive and more complex. Also, the
general growth of databases is quite uneven, which may lead
to biased conclusions.
An efficient clustering method is the key to addressing
these challenges and overcoming the problems. However,
sequence clustering is also quite computationally intensive.
Our contribution here is an algorithm which could speed up
the clustering calculation by two to three orders of
magnitude. So, a user can easily apply our method to his/her
sequences, even with a very huge dataset.
How
did you become involved in this research, and were there any
problems along the way?
I have witnessed the increasing growth of public sequence
databases since I first began working in the bioinformatics
field 10 years ago. In our research, in order to reduce the
effort of sequence analysis, we clustered the sequence
datasets and used only its representatives. We noticed that
such usage actually improved the result, and similar
findings were reported in the literature.
With the database continuing to grow, sequence clustering
itself became a challenge: this is why I began a search for
an efficient algorithm for rapid sequence clustering and to
concentrate on writing the CD-HIT program. The development
has been fairly smooth, and I’ve received considerable
support, help, and suggestions from many CD-HIT users. The
only problem has been the dearth of grant support—much of
the program development was completed in my spare time.
Where
do you see your research leading in the future?
Sequence clustering methods will have numerous
applications in various fields for as long as a large amount
of biological sequences are being analyzed. Among the unique
features of our CD-HIT program are its ultra-high speed and
its inherent capability of handling a huge amount of
sequences. I envision several new and important
applications, such as the analysis of metagenomic sequences.
I can also foresee potential developments which will further
improve the clustering method, and which will simultaneously
become invaluable throughout the worldwide research
community.
Weizhong Li, Ph.D.
Senior Scientist
California Institute for Telecommunications and Information
Technology (Calit2)
University of California, San Diego
La Jolla, CA, USA