On-line Evolving Clustering of Web Documents

Computing and Communications

Associated organisational units

Keywords

Clustering, Information Retrieval, Data Mining, Incremental Clustering, On-line Clustering, DCS-publications-id, inproc-536, DCS-publications-credits, dsp, DCS-publications-personnel-id, 90, 82, 102

View graph of relations

Research output: Contribution to conference - Without ISBN/ISSN › Conference paper › peer-review

Published

Standard

On-line Evolving Clustering of Web Documents. / Evans, A; Angelov, Plamen; Zhou, Xiaowei.
2006. Paper presented at 2nd Annual Symposium on Nature Inspired Smart Adaptive Systems, Santa Cruz, Tenerife, Spain.

Research output: Contribution to conference - Without ISBN/ISSN › Conference paper › peer-review

Harvard

Evans, A, Angelov, P & Zhou, X 2006, 'On-line Evolving Clustering of Web Documents', Paper presented at 2nd Annual Symposium on Nature Inspired Smart Adaptive Systems, Santa Cruz, Tenerife, Spain, 29/11/06 - 1/12/06. <http://www.nisis.risk-technologies.com/(S(kzkfkqfrzhknwsrjwt3j1jfm))/Filedown.aspx?File=237>

APA

Evans, A., Angelov, P., & Zhou, X. (2006). On-line Evolving Clustering of Web Documents. Paper presented at 2nd Annual Symposium on Nature Inspired Smart Adaptive Systems, Santa Cruz, Tenerife, Spain. http://www.nisis.risk-technologies.com/(S(kzkfkqfrzhknwsrjwt3j1jfm))/Filedown.aspx?File=237

Vancouver

Evans A, Angelov P, Zhou X. On-line Evolving Clustering of Web Documents. 2006. Paper presented at 2nd Annual Symposium on Nature Inspired Smart Adaptive Systems, Santa Cruz, Tenerife, Spain.

Author

Evans, A ; Angelov, Plamen ; Zhou, Xiaowei. / On-line Evolving Clustering of Web Documents. Paper presented at 2nd Annual Symposium on Nature Inspired Smart Adaptive Systems, Santa Cruz, Tenerife, Spain.6 p.

Bibtex

@conference{4dfb117b74f040b9bdeed2e39b977565,

title = "On-line Evolving Clustering of Web Documents",

abstract = "In this paper an approach that is using evolving, incremental (on-line) clustering to automatically group relevant Web-based documents is proposed. It is centred on a recently introduced evolving fuzzy rule-based clustering approach and borrows heavily from the Nature in the sense that it is evolution-inspired. That is, the structure of the clusters and their number is not predefined, but it self-develops, they grow and shrink when new web documents are accessed. Existing Web-based search engine technology returns long lists of web pages that contain the user's search term query but are not presented in order of contextual similarity. For example, search terms that have more than one meaning such as {"}cold{"} are presented to the user in a list containing documents relating to the Cold War and the common cold. If these results could be clustered “on the fly” then this improved presentation of results would allow the end user to find relevant documents more easily by requiring the inspection of one cluster of contextually similar documents rather than entire list of documents containing information pertaining to irrelevant contexts. An issue that is paid a special attention to is the similarity measure between the textual documents. Euclidean, Levenstein, and Cosine similarity measures have been used with cosine dissimilarity/distance performing best and addressing the problem of different number of features in each document. The proposed evolving classifier has also learning capability – it improves the result on-line with any new document that has been accessed. Finally, the proposed approach is characterized by low complexity. This paper reports the results of research that is going on for more than two years at Lancaster University on development of a novel clustering method that is suitable for real-time implementations. It is based on evolution principles and tries to address the limitations of existing clustering algorithms which cannot cope in an online mode with high dimensional datasets. This evolution-inspired and Nature-inspired approach introduces the new concept of potential values which describes the fitness of a new sample (web document) to be the prototype of a new cluster without the need to store each previously encountered documents but taking into account the contextual similarity density between all previous documents in a recursive and thus computationally efficient way (thereby reducing memory requirements and improving speed compared with existing approaches). This paper examines the clustering of documents by contextual similarity using extracted keywords represented in a vector space model.",

keywords = "Clustering, Information Retrieval, Data Mining, Incremental Clustering, On-line Clustering, DCS-publications-id, inproc-536, DCS-publications-credits, dsp, DCS-publications-personnel-id, 90, 82, 102",

author = "A Evans and Plamen Angelov and Xiaowei Zhou",

year = "2006",

month = dec,

language = "English",

note = "2nd Annual Symposium on Nature Inspired Smart Adaptive Systems ; Conference date: 29-11-2006 Through 01-12-2006",

}

RIS

TY - CONF

T1 - On-line Evolving Clustering of Web Documents

AU - Evans, A

AU - Angelov, Plamen

AU - Zhou, Xiaowei

PY - 2006/12

Y1 - 2006/12

N2 - In this paper an approach that is using evolving, incremental (on-line) clustering to automatically group relevant Web-based documents is proposed. It is centred on a recently introduced evolving fuzzy rule-based clustering approach and borrows heavily from the Nature in the sense that it is evolution-inspired. That is, the structure of the clusters and their number is not predefined, but it self-develops, they grow and shrink when new web documents are accessed. Existing Web-based search engine technology returns long lists of web pages that contain the user's search term query but are not presented in order of contextual similarity. For example, search terms that have more than one meaning such as "cold" are presented to the user in a list containing documents relating to the Cold War and the common cold. If these results could be clustered “on the fly” then this improved presentation of results would allow the end user to find relevant documents more easily by requiring the inspection of one cluster of contextually similar documents rather than entire list of documents containing information pertaining to irrelevant contexts. An issue that is paid a special attention to is the similarity measure between the textual documents. Euclidean, Levenstein, and Cosine similarity measures have been used with cosine dissimilarity/distance performing best and addressing the problem of different number of features in each document. The proposed evolving classifier has also learning capability – it improves the result on-line with any new document that has been accessed. Finally, the proposed approach is characterized by low complexity. This paper reports the results of research that is going on for more than two years at Lancaster University on development of a novel clustering method that is suitable for real-time implementations. It is based on evolution principles and tries to address the limitations of existing clustering algorithms which cannot cope in an online mode with high dimensional datasets. This evolution-inspired and Nature-inspired approach introduces the new concept of potential values which describes the fitness of a new sample (web document) to be the prototype of a new cluster without the need to store each previously encountered documents but taking into account the contextual similarity density between all previous documents in a recursive and thus computationally efficient way (thereby reducing memory requirements and improving speed compared with existing approaches). This paper examines the clustering of documents by contextual similarity using extracted keywords represented in a vector space model.

AB - In this paper an approach that is using evolving, incremental (on-line) clustering to automatically group relevant Web-based documents is proposed. It is centred on a recently introduced evolving fuzzy rule-based clustering approach and borrows heavily from the Nature in the sense that it is evolution-inspired. That is, the structure of the clusters and their number is not predefined, but it self-develops, they grow and shrink when new web documents are accessed. Existing Web-based search engine technology returns long lists of web pages that contain the user's search term query but are not presented in order of contextual similarity. For example, search terms that have more than one meaning such as "cold" are presented to the user in a list containing documents relating to the Cold War and the common cold. If these results could be clustered “on the fly” then this improved presentation of results would allow the end user to find relevant documents more easily by requiring the inspection of one cluster of contextually similar documents rather than entire list of documents containing information pertaining to irrelevant contexts. An issue that is paid a special attention to is the similarity measure between the textual documents. Euclidean, Levenstein, and Cosine similarity measures have been used with cosine dissimilarity/distance performing best and addressing the problem of different number of features in each document. The proposed evolving classifier has also learning capability – it improves the result on-line with any new document that has been accessed. Finally, the proposed approach is characterized by low complexity. This paper reports the results of research that is going on for more than two years at Lancaster University on development of a novel clustering method that is suitable for real-time implementations. It is based on evolution principles and tries to address the limitations of existing clustering algorithms which cannot cope in an online mode with high dimensional datasets. This evolution-inspired and Nature-inspired approach introduces the new concept of potential values which describes the fitness of a new sample (web document) to be the prototype of a new cluster without the need to store each previously encountered documents but taking into account the contextual similarity density between all previous documents in a recursive and thus computationally efficient way (thereby reducing memory requirements and improving speed compared with existing approaches). This paper examines the clustering of documents by contextual similarity using extracted keywords represented in a vector space model.

KW - Clustering

KW - Information Retrieval

KW - Data Mining

KW - Incremental Clustering

KW - On-line Clustering

KW - DCS-publications-id

KW - inproc-536

KW - DCS-publications-credits

KW - dsp

KW - DCS-publications-personnel-id

KW - 90

KW - 82

KW - 102

M3 - Conference paper

T2 - 2nd Annual Symposium on Nature Inspired Smart Adaptive Systems

Y2 - 29 November 2006 through 1 December 2006

ER -

Research

Associated organisational units

Links

Keywords