Home > Research > Publications & Outputs > Projection methods for clustering and semi-supe...

Electronic data

  • 2016hofmeyrphd

    Final published version, 6.04 MB, PDF document

    Available under license: CC BY-ND: Creative Commons Attribution-NoDerivatives 4.0 International License

Text available via DOI:

View graph of relations

Projection methods for clustering and semi-supervised classification

Research output: ThesisDoctoral Thesis

Published

Standard

Projection methods for clustering and semi-supervised classification. / Hofmeyr, David.
Lancaster University, 2016. 233 p.

Research output: ThesisDoctoral Thesis

Harvard

APA

Vancouver

Hofmeyr D. Projection methods for clustering and semi-supervised classification. Lancaster University, 2016. 233 p. doi: 10.17635/lancaster/thesis/58

Author

Bibtex

@phdthesis{9364da435f324c1f8c107f79181d0b22,
title = "Projection methods for clustering and semi-supervised classification",
abstract = "This thesis focuses on data projection methods for the purposes of clusteringand semi-supervised classification, with a primary focus on clustering.A number of contributions are presented which address this problem in aprincipled manner; using projection pursuit formulations to identify subspaceswhich contain useful information for the clustering task. Projectionmethods are extremely useful in high dimensional applications, and situationsin which the data contain irrelevant dimensions which can be counterinformativefor the clustering task. The final contribution addresses highdimensionality in the context of a data stream. Data streams and high dimensionalityhave been identified as two of the key challenges in data clustering.The first piece of work is motivated by identifying the minimum densityhyperplane separator in the finite sample setting. This objective is directlyrelated to the problem of discovering clusters defined as connected regionsof high data density, which is a widely adopted definition in non-parametricstatistics and machine learning. A thorough investigation into the theoreticalaspects of this method, as well as the practical task of solving the associatedoptimisation problem efficiently is presented. The proposed methodology isapplied to both clustering and semi-supervised classification problems, andis shown to reliably find low density hyperplane separators in both contexts.The second and third contributions focus on a different approach to clusteringbased on graph cuts. The minimum normalised graph cut objective hasgained considerable attention as relaxations of the objective have been developed,which make them solvable for reasonably well sized problems. Thishas been adopted by the highly popular spectral clustering methods. Thesecond piece of work focuses on identifying the optimal subspace in whichto perform spectral clustering, by minimising the second eigenvalue of thegraph Laplacian for a graph defined over the data within that subspace. Arigorous treatment of this objective is presented, and an algorithm is proposedfor its optimisation. An approximation method is proposed whichallows this method to be applied to much larger problems than would otherwisebe possible. An extension of this work deals with the spectral projectionpursuit method for semi-supervised classification.iiiThe third body of work looks at minimising the normalised graph cut usinghyperplane separators. This formulation allows for the exact normalisedcut to be computed, rather than the spectral relaxation. It also allows for acomputationally efficient method for optimisation. The asymptotic propertiesof the normalised cut based on a hyperplane separator are investigated, andshown to have similarities with the clustering objective based on low densityseparation. In fact, both the methods in the second and third works areshown to be connected with the first, in that all three have the same solutionasymptotically, as their relative scaling parameters are reduced to zero.The final body of work addresses both problems of high dimensionalityand incremental clustering in a data stream context. A principled statisticalframework is adopted, in which clustering by low density separation againbecomes the focal objective. A divisive hierarchical clustering model is proposed,using a collection of low density hyperplanes. The adopted frameworkprovides well founded methodology for determining the number ofclusters automatically, and also identifying changes in the data stream whichare relevant to the clustering objective. It is apparent that no existing methodscan make both of these claims.",
author = "David Hofmeyr",
year = "2016",
doi = "10.17635/lancaster/thesis/58",
language = "English",
publisher = "Lancaster University",
school = "Lancaster University",

}

RIS

TY - BOOK

T1 - Projection methods for clustering and semi-supervised classification

AU - Hofmeyr, David

PY - 2016

Y1 - 2016

N2 - This thesis focuses on data projection methods for the purposes of clusteringand semi-supervised classification, with a primary focus on clustering.A number of contributions are presented which address this problem in aprincipled manner; using projection pursuit formulations to identify subspaceswhich contain useful information for the clustering task. Projectionmethods are extremely useful in high dimensional applications, and situationsin which the data contain irrelevant dimensions which can be counterinformativefor the clustering task. The final contribution addresses highdimensionality in the context of a data stream. Data streams and high dimensionalityhave been identified as two of the key challenges in data clustering.The first piece of work is motivated by identifying the minimum densityhyperplane separator in the finite sample setting. This objective is directlyrelated to the problem of discovering clusters defined as connected regionsof high data density, which is a widely adopted definition in non-parametricstatistics and machine learning. A thorough investigation into the theoreticalaspects of this method, as well as the practical task of solving the associatedoptimisation problem efficiently is presented. The proposed methodology isapplied to both clustering and semi-supervised classification problems, andis shown to reliably find low density hyperplane separators in both contexts.The second and third contributions focus on a different approach to clusteringbased on graph cuts. The minimum normalised graph cut objective hasgained considerable attention as relaxations of the objective have been developed,which make them solvable for reasonably well sized problems. Thishas been adopted by the highly popular spectral clustering methods. Thesecond piece of work focuses on identifying the optimal subspace in whichto perform spectral clustering, by minimising the second eigenvalue of thegraph Laplacian for a graph defined over the data within that subspace. Arigorous treatment of this objective is presented, and an algorithm is proposedfor its optimisation. An approximation method is proposed whichallows this method to be applied to much larger problems than would otherwisebe possible. An extension of this work deals with the spectral projectionpursuit method for semi-supervised classification.iiiThe third body of work looks at minimising the normalised graph cut usinghyperplane separators. This formulation allows for the exact normalisedcut to be computed, rather than the spectral relaxation. It also allows for acomputationally efficient method for optimisation. The asymptotic propertiesof the normalised cut based on a hyperplane separator are investigated, andshown to have similarities with the clustering objective based on low densityseparation. In fact, both the methods in the second and third works areshown to be connected with the first, in that all three have the same solutionasymptotically, as their relative scaling parameters are reduced to zero.The final body of work addresses both problems of high dimensionalityand incremental clustering in a data stream context. A principled statisticalframework is adopted, in which clustering by low density separation againbecomes the focal objective. A divisive hierarchical clustering model is proposed,using a collection of low density hyperplanes. The adopted frameworkprovides well founded methodology for determining the number ofclusters automatically, and also identifying changes in the data stream whichare relevant to the clustering objective. It is apparent that no existing methodscan make both of these claims.

AB - This thesis focuses on data projection methods for the purposes of clusteringand semi-supervised classification, with a primary focus on clustering.A number of contributions are presented which address this problem in aprincipled manner; using projection pursuit formulations to identify subspaceswhich contain useful information for the clustering task. Projectionmethods are extremely useful in high dimensional applications, and situationsin which the data contain irrelevant dimensions which can be counterinformativefor the clustering task. The final contribution addresses highdimensionality in the context of a data stream. Data streams and high dimensionalityhave been identified as two of the key challenges in data clustering.The first piece of work is motivated by identifying the minimum densityhyperplane separator in the finite sample setting. This objective is directlyrelated to the problem of discovering clusters defined as connected regionsof high data density, which is a widely adopted definition in non-parametricstatistics and machine learning. A thorough investigation into the theoreticalaspects of this method, as well as the practical task of solving the associatedoptimisation problem efficiently is presented. The proposed methodology isapplied to both clustering and semi-supervised classification problems, andis shown to reliably find low density hyperplane separators in both contexts.The second and third contributions focus on a different approach to clusteringbased on graph cuts. The minimum normalised graph cut objective hasgained considerable attention as relaxations of the objective have been developed,which make them solvable for reasonably well sized problems. Thishas been adopted by the highly popular spectral clustering methods. Thesecond piece of work focuses on identifying the optimal subspace in whichto perform spectral clustering, by minimising the second eigenvalue of thegraph Laplacian for a graph defined over the data within that subspace. Arigorous treatment of this objective is presented, and an algorithm is proposedfor its optimisation. An approximation method is proposed whichallows this method to be applied to much larger problems than would otherwisebe possible. An extension of this work deals with the spectral projectionpursuit method for semi-supervised classification.iiiThe third body of work looks at minimising the normalised graph cut usinghyperplane separators. This formulation allows for the exact normalisedcut to be computed, rather than the spectral relaxation. It also allows for acomputationally efficient method for optimisation. The asymptotic propertiesof the normalised cut based on a hyperplane separator are investigated, andshown to have similarities with the clustering objective based on low densityseparation. In fact, both the methods in the second and third works areshown to be connected with the first, in that all three have the same solutionasymptotically, as their relative scaling parameters are reduced to zero.The final body of work addresses both problems of high dimensionalityand incremental clustering in a data stream context. A principled statisticalframework is adopted, in which clustering by low density separation againbecomes the focal objective. A divisive hierarchical clustering model is proposed,using a collection of low density hyperplanes. The adopted frameworkprovides well founded methodology for determining the number ofclusters automatically, and also identifying changes in the data stream whichare relevant to the clustering objective. It is apparent that no existing methodscan make both of these claims.

U2 - 10.17635/lancaster/thesis/58

DO - 10.17635/lancaster/thesis/58

M3 - Doctoral Thesis

PB - Lancaster University

ER -