Home > Research > Publications & Outputs > Subspace Clustering of Very Sparse High-Dimensi...

Electronic data

  • Peng2018_final

    Rights statement: ©2018 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

    Accepted author manuscript, 110 KB, PDF document

    Available under license: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License

View graph of relations

Subspace Clustering of Very Sparse High-Dimensional Data

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Published
Publication date24/01/2019
Host publication2018 IEEE International Conference on Big Data (Big Data)
PublisherIEEE
Pages3780-3783
Number of pages4
ISBN (electronic)9781538650356
<mark>Original language</mark>English
EventAdvances in High-Dimensional Big Data in conjunction with the 2018 IEEE International Conference on Big Data (IEEE BigData 2018) - Seattle, United States
Duration: 10/12/201813/12/2018
https://sites.google.com/site/adhdbigdata3/home

Workshop

WorkshopAdvances in High-Dimensional Big Data in conjunction with the 2018 IEEE International Conference on Big Data (IEEE BigData 2018)
Country/TerritoryUnited States
CitySeattle
Period10/12/1813/12/18
Internet address

Workshop

WorkshopAdvances in High-Dimensional Big Data in conjunction with the 2018 IEEE International Conference on Big Data (IEEE BigData 2018)
Country/TerritoryUnited States
CitySeattle
Period10/12/1813/12/18
Internet address

Abstract

In this paper we consider the problem of clustering collections of very short texts using subspace clustering. This problem arises in many applications such as product categorisation, fraud detection, and sentiment analysis. The main challenge lies in the fact that the vectorial representation of short texts is both high-dimensional, due to the large number of unique terms in the corpus, and extremely sparse, as each text contains a very
small number of words with no repetition. We propose a new, simple subspace clustering algorithm that relies on linear algebra to cluster such datasets. Experimental results on identifying product categories from product names obtained from the US Amazon website indicate that the algorithm can be competitive against state-of-the-art clustering algorithms.

Bibliographic note

©2018 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.