Home > Research > Publications & Outputs > Low-Density Cluster Separators for Large, High-...

Electronic data

  • 2018katieyatesphd

    Final published version, 14.7 MB, PDF document

    Available under license: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License

Text available via DOI:

View graph of relations

Low-Density Cluster Separators for Large, High-Dimensional, Mixed and Non-Linearly Separable Data.

Research output: ThesisDoctoral Thesis

Published

Standard

Low-Density Cluster Separators for Large, High-Dimensional, Mixed and Non-Linearly Separable Data. / Yates, Katie.
Lancaster: Lancaster University, 2018. 218 p.

Research output: ThesisDoctoral Thesis

Harvard

APA

Vancouver

Yates K. Low-Density Cluster Separators for Large, High-Dimensional, Mixed and Non-Linearly Separable Data.. Lancaster: Lancaster University, 2018. 218 p. doi: 10.17635/lancaster/thesis/204

Author

Bibtex

@phdthesis{970f6221666442bf9255c42fd2fb2166,
title = "Low-Density Cluster Separators for Large, High-Dimensional, Mixed and Non-Linearly Separable Data.",
abstract = "The location of groups of similar observations (clusters) in data is a well-studied problem,and has many practical applications. There are a wide range of approaches to clustering,which rely on different definitions of similarity, and are appropriate for datasets with differentcharacteristics. Despite a rich literature, there exist a number of open problems inclustering, and limitations to existing algorithms.This thesis develops methodology for clustering high-dimensional, mixed datasets withcomplex clustering structures, using low-density cluster separators that bi-partition datasetsusing cluster boundaries that pass through regions of minimal density, separating regions ofhigh probability density, associated with clusters. The bi-partitions arising from a successionof minimum density cluster separators are combined using divisive hierarchical and partitionalalgorithms, to locate a complete clustering, while estimating the number of clusters.The proposed algorithms locate cluster separators using one-dimensional arbitrarily orientedsubspaces, circumventing the challenges associated with clustering in high-dimensionalspaces. This requires continuous observations; thus, to extend the applicability of the proposedalgorithms to mixed datasets, methods for producing an appropriate continuousrepresentation of datasets containing non-continuous features are investigated. The exactevaluation of the density intersected by a cluster boundary is restricted to linear separators.This limitation is lifted by a non-linear mapping of the original observations into a featurespace, in which a linear separator permits the correct identification of non-linearly separableclusters in the original dataset.In large, high-dimensional datasets, searching for one-dimensional subspaces, which resultin a minimum density separator is computationally expensive. Therefore, a computationallyefficient approach to low-density cluster separation using approximately optimalprojection directions is proposed, which searches over a collection of one-dimensional randomprojections for an appropriate subspace for cluster identification. The proposed approachesproduce high-quality partitions, that are competitive with well-established andstate-of-the-art algorithms.",
author = "Katie Yates",
year = "2018",
doi = "10.17635/lancaster/thesis/204",
language = "English",
publisher = "Lancaster University",
school = "Lancaster University",

}

RIS

TY - BOOK

T1 - Low-Density Cluster Separators for Large, High-Dimensional, Mixed and Non-Linearly Separable Data.

AU - Yates, Katie

PY - 2018

Y1 - 2018

N2 - The location of groups of similar observations (clusters) in data is a well-studied problem,and has many practical applications. There are a wide range of approaches to clustering,which rely on different definitions of similarity, and are appropriate for datasets with differentcharacteristics. Despite a rich literature, there exist a number of open problems inclustering, and limitations to existing algorithms.This thesis develops methodology for clustering high-dimensional, mixed datasets withcomplex clustering structures, using low-density cluster separators that bi-partition datasetsusing cluster boundaries that pass through regions of minimal density, separating regions ofhigh probability density, associated with clusters. The bi-partitions arising from a successionof minimum density cluster separators are combined using divisive hierarchical and partitionalalgorithms, to locate a complete clustering, while estimating the number of clusters.The proposed algorithms locate cluster separators using one-dimensional arbitrarily orientedsubspaces, circumventing the challenges associated with clustering in high-dimensionalspaces. This requires continuous observations; thus, to extend the applicability of the proposedalgorithms to mixed datasets, methods for producing an appropriate continuousrepresentation of datasets containing non-continuous features are investigated. The exactevaluation of the density intersected by a cluster boundary is restricted to linear separators.This limitation is lifted by a non-linear mapping of the original observations into a featurespace, in which a linear separator permits the correct identification of non-linearly separableclusters in the original dataset.In large, high-dimensional datasets, searching for one-dimensional subspaces, which resultin a minimum density separator is computationally expensive. Therefore, a computationallyefficient approach to low-density cluster separation using approximately optimalprojection directions is proposed, which searches over a collection of one-dimensional randomprojections for an appropriate subspace for cluster identification. The proposed approachesproduce high-quality partitions, that are competitive with well-established andstate-of-the-art algorithms.

AB - The location of groups of similar observations (clusters) in data is a well-studied problem,and has many practical applications. There are a wide range of approaches to clustering,which rely on different definitions of similarity, and are appropriate for datasets with differentcharacteristics. Despite a rich literature, there exist a number of open problems inclustering, and limitations to existing algorithms.This thesis develops methodology for clustering high-dimensional, mixed datasets withcomplex clustering structures, using low-density cluster separators that bi-partition datasetsusing cluster boundaries that pass through regions of minimal density, separating regions ofhigh probability density, associated with clusters. The bi-partitions arising from a successionof minimum density cluster separators are combined using divisive hierarchical and partitionalalgorithms, to locate a complete clustering, while estimating the number of clusters.The proposed algorithms locate cluster separators using one-dimensional arbitrarily orientedsubspaces, circumventing the challenges associated with clustering in high-dimensionalspaces. This requires continuous observations; thus, to extend the applicability of the proposedalgorithms to mixed datasets, methods for producing an appropriate continuousrepresentation of datasets containing non-continuous features are investigated. The exactevaluation of the density intersected by a cluster boundary is restricted to linear separators.This limitation is lifted by a non-linear mapping of the original observations into a featurespace, in which a linear separator permits the correct identification of non-linearly separableclusters in the original dataset.In large, high-dimensional datasets, searching for one-dimensional subspaces, which resultin a minimum density separator is computationally expensive. Therefore, a computationallyefficient approach to low-density cluster separation using approximately optimalprojection directions is proposed, which searches over a collection of one-dimensional randomprojections for an appropriate subspace for cluster identification. The proposed approachesproduce high-quality partitions, that are competitive with well-established andstate-of-the-art algorithms.

U2 - 10.17635/lancaster/thesis/204

DO - 10.17635/lancaster/thesis/204

M3 - Doctoral Thesis

PB - Lancaster University

CY - Lancaster

ER -