Home > Research > Publications & Outputs > Projection methods for clustering and semi-supe...

Electronic data

  • 2016hofmeyrphd

    Final published version, 6.04 MB, PDF document

    Available under license: CC BY-ND: Creative Commons Attribution-NoDerivatives 4.0 International License

Text available via DOI:

View graph of relations

Projection methods for clustering and semi-supervised classification

Research output: ThesisDoctoral Thesis

Published
Publication date2016
Number of pages233
QualificationPhD
Awarding Institution
Supervisors/Advisors
Award date14/12/2016
Publisher
  • Lancaster University
<mark>Original language</mark>English

Abstract

This thesis focuses on data projection methods for the purposes of clustering
and semi-supervised classification, with a primary focus on clustering.
A number of contributions are presented which address this problem in a
principled manner; using projection pursuit formulations to identify subspaces
which contain useful information for the clustering task. Projection
methods are extremely useful in high dimensional applications, and situations
in which the data contain irrelevant dimensions which can be counterinformative
for the clustering task. The final contribution addresses high
dimensionality in the context of a data stream. Data streams and high dimensionality
have been identified as two of the key challenges in data clustering.
The first piece of work is motivated by identifying the minimum density
hyperplane separator in the finite sample setting. This objective is directly
related to the problem of discovering clusters defined as connected regions
of high data density, which is a widely adopted definition in non-parametric
statistics and machine learning. A thorough investigation into the theoretical
aspects of this method, as well as the practical task of solving the associated
optimisation problem efficiently is presented. The proposed methodology is
applied to both clustering and semi-supervised classification problems, and
is shown to reliably find low density hyperplane separators in both contexts.
The second and third contributions focus on a different approach to clustering
based on graph cuts. The minimum normalised graph cut objective has
gained considerable attention as relaxations of the objective have been developed,
which make them solvable for reasonably well sized problems. This
has been adopted by the highly popular spectral clustering methods. The
second piece of work focuses on identifying the optimal subspace in which
to perform spectral clustering, by minimising the second eigenvalue of the
graph Laplacian for a graph defined over the data within that subspace. A
rigorous treatment of this objective is presented, and an algorithm is proposed
for its optimisation. An approximation method is proposed which
allows this method to be applied to much larger problems than would otherwise
be possible. An extension of this work deals with the spectral projection
pursuit method for semi-supervised classification.
iii
The third body of work looks at minimising the normalised graph cut using
hyperplane separators. This formulation allows for the exact normalised
cut to be computed, rather than the spectral relaxation. It also allows for a
computationally efficient method for optimisation. The asymptotic properties
of the normalised cut based on a hyperplane separator are investigated, and
shown to have similarities with the clustering objective based on low density
separation. In fact, both the methods in the second and third works are
shown to be connected with the first, in that all three have the same solution
asymptotically, as their relative scaling parameters are reduced to zero.
The final body of work addresses both problems of high dimensionality
and incremental clustering in a data stream context. A principled statistical
framework is adopted, in which clustering by low density separation again
becomes the focal objective. A divisive hierarchical clustering model is proposed,
using a collection of low density hyperplanes. The adopted framework
provides well founded methodology for determining the number of
clusters automatically, and also identifying changes in the data stream which
are relevant to the clustering objective. It is apparent that no existing methods
can make both of these claims.