On-line Evolving Clustering of Web Documents

Computing and Communications

Associated organisational units

Keywords

Clustering, Information Retrieval, Data Mining, Incremental Clustering, On-line Clustering, DCS-publications-id, inproc-536, DCS-publications-credits, dsp, DCS-publications-personnel-id, 90, 82, 102

View graph of relations

Research output: Contribution to conference - Without ISBN/ISSN › Conference paper › peer-review

Published

A Evans
Plamen Angelov
Xiaowei Zhou

More...

Publication date	12/2006
Number of pages	6
<mark>Original language</mark>	English
Event	2nd Annual Symposium on Nature Inspired Smart Adaptive Systems - Santa Cruz, Tenerife, Spain Duration: 29/11/2006 → 1/12/2006

Conference

Conference	2nd Annual Symposium on Nature Inspired Smart Adaptive Systems
City	Santa Cruz, Tenerife, Spain
Period	29/11/06 → 1/12/06

Abstract

In this paper an approach that is using evolving, incremental (on-line) clustering to automatically group relevant Web-based documents is proposed. It is centred on a recently introduced evolving fuzzy rule-based clustering approach and borrows heavily from the Nature in the sense that it is evolution-inspired. That is, the structure of the clusters and their number is not predefined, but it self-develops, they grow and shrink when new web documents are accessed. Existing Web-based search engine technology returns long lists of web pages that contain the user's search term query but are not presented in order of contextual similarity. For example, search terms that have more than one meaning such as "cold" are presented to the user in a list containing documents relating to the Cold War and the common cold. If these results could be clustered “on the fly” then this improved presentation of results would allow the end user to find relevant documents more easily by requiring the inspection of one cluster of contextually similar documents rather than entire list of documents containing information pertaining to irrelevant contexts. An issue that is paid a special attention to is the similarity measure between the textual documents. Euclidean, Levenstein, and Cosine similarity measures have been used with cosine dissimilarity/distance performing best and addressing the problem of different number of features in each document. The proposed evolving classifier has also learning capability – it improves the result on-line with any new document that has been accessed. Finally, the proposed approach is characterized by low complexity. This paper reports the results of research that is going on for more than two years at Lancaster University on development of a novel clustering method that is suitable for real-time implementations. It is based on evolution principles and tries to address the limitations of existing clustering algorithms which cannot cope in an online mode with high dimensional datasets. This evolution-inspired and Nature-inspired approach introduces the new concept of potential values which describes the fitness of a new sample (web document) to be the prototype of a new cluster without the need to store each previously encountered documents but taking into account the contextual similarity density between all previous documents in a recursive and thus computationally efficient way (thereby reducing memory requirements and improving speed compared with existing approaches). This paper examines the clustering of documents by contextual similarity using extracted keywords represented in a vector space model.

Research

Associated organisational units

Links

Keywords