Empirical data analytics

Associated organisational units

Electronic data

EAD_accepted_version
Rights statement: This is the peer reviewed version of the following article: Angelov, P., Gu, X. and Kangin, D. (2017), Empirical Data Analytics. Int. J. Intell. Syst., 32: 1261–1284. doi:10.1002/int.21899 which has been published in final form at http://onlinelibrary.wiley.com/doi/10.1002/int.21899/abstract This article may be used for non-commercial purposes in accordance With Wiley Terms and Conditions for self-archiving.
Accepted author manuscript, 1.46 MB, PDF document
Available under license: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License

Text available via DOI:

https://doi.org/10.1002/int.21899
Final published version

View graph of relations

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Standard

Empirical data analytics. / Angelov, Plamen Parvanov ; Gu, Xiaowei; Kangin, Dmitry.
In: International Journal of Intelligent Systems, Vol. 32, No. 12, 12.2017, p. 1261-1284.

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Harvard

Angelov, PP , Gu, X & Kangin, D 2017, 'Empirical data analytics', International Journal of Intelligent Systems, vol. 32, no. 12, pp. 1261-1284. https://doi.org/10.1002/int.21899

APA

Angelov, P. P., Gu, X., & Kangin, D. (2017). Empirical data analytics. International Journal of Intelligent Systems, 32(12), 1261-1284. https://doi.org/10.1002/int.21899

Vancouver

Angelov PP , Gu X, Kangin D. Empirical data analytics. International Journal of Intelligent Systems. 2017 Dec;32(12):1261-1284. Epub 2017 Mar 21. doi: 10.1002/int.21899

Author

Angelov, Plamen Parvanov ; Gu, Xiaowei ; Kangin, Dmitry. / Empirical data analytics. In: International Journal of Intelligent Systems. 2017 ; Vol. 32, No. 12. pp. 1261-1284.

Bibtex

@article{142cb1e3ddca4881bb3d54cad3d3fe24,

title = "Empirical data analytics",

abstract = "In this paper, we propose an approach to data analysis, which is based entirely on the empirical observations of discrete data samples and the relative proximity of these points in the data space. At the core of the proposed new approach is the typicality—an empirically derived quantity that resembles probability. This nonparametric measure is a normalized form of the square centrality (centrality is a measure of closeness used in graph theory). It is also closely linked to the cumulative proximity and eccentricity (a measure of the tail of the distributions that is very useful for anomaly detection and analysis of extreme values). In this paper, we introduce and study two types of typicality, namely its local and global versions. The local typicality resembles the well-known probability density function (pdf), probability mass function, and fuzzy set membership but differs from all of them. The global typicality, on the other hand, resembles well-known histograms but also differs from them. A distinctive feature of the proposed new approach, empirical data analysis (EDA), is that it is not limited by restrictive impractical prior assumptions about the data generation model as the traditional probability theory and statistical learning approaches are. Moreover, it does not require an explicit and binary assumption of either randomness or determinism of the empirically observed data, their independence, or even their number (it can be as low as a couple of data samples). The typicality is considered as a fundamental quantity in the pattern analysis, which is derived directly from data and is stated in a discrete form in contrast to the traditional approach where a continuous pdf is assumed a priori and estimated from data afterward. The typicality introduced in this paper is free from the paradoxes of the pdf. Typicality is objectivist while the fuzzy sets and the belief-based branch of the probability theory are subjectivist. The local typicality is expressed in a closed analytical form and can be calculated recursively, thus, computationally very efficiently. The other nonparametric ensemble properties of the data introduced and studied in this paper, namely, the square centrality, cumulative proximity, and eccentricity, can also be updated recursively for various types of distance metrics. Finally, a new type of classifier called na{\"i}ve typicality-based EDA class is introduced, which is based on the newly introduced global typicality. This is only one of the wide range of possible applications of EDA including but not limited for anomaly detection, clustering, classification, control, prediction, control, rare events analysis, etc., which will be the subject of further research.",

author = "Angelov, {Plamen Parvanov} and Xiaowei Gu and Dmitry Kangin",

note = "This is the peer reviewed version of the following article: Angelov, P., Gu, X. and Kangin, D. (2017), Empirical Data Analytics. Int. J. Intell. Syst., 32: 1261–1284. doi:10.1002/int.21899 which has been published in final form at http://onlinelibrary.wiley.com/doi/10.1002/int.21899/abstract This article may be used for non-commercial purposes in accordance With Wiley Terms and Conditions for self-archiving.",

year = "2017",

month = dec,

doi = "10.1002/int.21899",

language = "English",

volume = "32",

pages = "1261--1284",

journal = "International Journal of Intelligent Systems",

issn = "0884-8173",

publisher = "John Wiley and Sons Ltd",

number = "12",

}

RIS

TY - JOUR

T1 - Empirical data analytics

AU - Angelov, Plamen Parvanov

AU - Gu, Xiaowei

AU - Kangin, Dmitry

N1 - This is the peer reviewed version of the following article: Angelov, P., Gu, X. and Kangin, D. (2017), Empirical Data Analytics. Int. J. Intell. Syst., 32: 1261–1284. doi:10.1002/int.21899 which has been published in final form at http://onlinelibrary.wiley.com/doi/10.1002/int.21899/abstract This article may be used for non-commercial purposes in accordance With Wiley Terms and Conditions for self-archiving.

PY - 2017/12

Y1 - 2017/12

N2 - In this paper, we propose an approach to data analysis, which is based entirely on the empirical observations of discrete data samples and the relative proximity of these points in the data space. At the core of the proposed new approach is the typicality—an empirically derived quantity that resembles probability. This nonparametric measure is a normalized form of the square centrality (centrality is a measure of closeness used in graph theory). It is also closely linked to the cumulative proximity and eccentricity (a measure of the tail of the distributions that is very useful for anomaly detection and analysis of extreme values). In this paper, we introduce and study two types of typicality, namely its local and global versions. The local typicality resembles the well-known probability density function (pdf), probability mass function, and fuzzy set membership but differs from all of them. The global typicality, on the other hand, resembles well-known histograms but also differs from them. A distinctive feature of the proposed new approach, empirical data analysis (EDA), is that it is not limited by restrictive impractical prior assumptions about the data generation model as the traditional probability theory and statistical learning approaches are. Moreover, it does not require an explicit and binary assumption of either randomness or determinism of the empirically observed data, their independence, or even their number (it can be as low as a couple of data samples). The typicality is considered as a fundamental quantity in the pattern analysis, which is derived directly from data and is stated in a discrete form in contrast to the traditional approach where a continuous pdf is assumed a priori and estimated from data afterward. The typicality introduced in this paper is free from the paradoxes of the pdf. Typicality is objectivist while the fuzzy sets and the belief-based branch of the probability theory are subjectivist. The local typicality is expressed in a closed analytical form and can be calculated recursively, thus, computationally very efficiently. The other nonparametric ensemble properties of the data introduced and studied in this paper, namely, the square centrality, cumulative proximity, and eccentricity, can also be updated recursively for various types of distance metrics. Finally, a new type of classifier called naïve typicality-based EDA class is introduced, which is based on the newly introduced global typicality. This is only one of the wide range of possible applications of EDA including but not limited for anomaly detection, clustering, classification, control, prediction, control, rare events analysis, etc., which will be the subject of further research.

AB - In this paper, we propose an approach to data analysis, which is based entirely on the empirical observations of discrete data samples and the relative proximity of these points in the data space. At the core of the proposed new approach is the typicality—an empirically derived quantity that resembles probability. This nonparametric measure is a normalized form of the square centrality (centrality is a measure of closeness used in graph theory). It is also closely linked to the cumulative proximity and eccentricity (a measure of the tail of the distributions that is very useful for anomaly detection and analysis of extreme values). In this paper, we introduce and study two types of typicality, namely its local and global versions. The local typicality resembles the well-known probability density function (pdf), probability mass function, and fuzzy set membership but differs from all of them. The global typicality, on the other hand, resembles well-known histograms but also differs from them. A distinctive feature of the proposed new approach, empirical data analysis (EDA), is that it is not limited by restrictive impractical prior assumptions about the data generation model as the traditional probability theory and statistical learning approaches are. Moreover, it does not require an explicit and binary assumption of either randomness or determinism of the empirically observed data, their independence, or even their number (it can be as low as a couple of data samples). The typicality is considered as a fundamental quantity in the pattern analysis, which is derived directly from data and is stated in a discrete form in contrast to the traditional approach where a continuous pdf is assumed a priori and estimated from data afterward. The typicality introduced in this paper is free from the paradoxes of the pdf. Typicality is objectivist while the fuzzy sets and the belief-based branch of the probability theory are subjectivist. The local typicality is expressed in a closed analytical form and can be calculated recursively, thus, computationally very efficiently. The other nonparametric ensemble properties of the data introduced and studied in this paper, namely, the square centrality, cumulative proximity, and eccentricity, can also be updated recursively for various types of distance metrics. Finally, a new type of classifier called naïve typicality-based EDA class is introduced, which is based on the newly introduced global typicality. This is only one of the wide range of possible applications of EDA including but not limited for anomaly detection, clustering, classification, control, prediction, control, rare events analysis, etc., which will be the subject of further research.

U2 - 10.1002/int.21899

DO - 10.1002/int.21899

M3 - Journal article

VL - 32

SP - 1261

EP - 1284

JO - International Journal of Intelligent Systems

JF - International Journal of Intelligent Systems

SN - 0884-8173

IS - 12

ER -

Research

Associated organisational units

Electronic data

Links

Text available via DOI: