Data quality measures for identity resolution

Associated organisational unit

UCREL - University Centre for Computer Corpus Research on Language

Electronic data

2018edwardsphd
Final published version, 1.16 MB, PDF document
Available under license: CC BY-NC-ND: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License

Text available via DOI:

https://doi.org/10.17635/lancaster/thesis/259
Final published version

View graph of relations

Research output: Thesis › Doctoral Thesis

Published

Matthew Edwards

More...

Publication date	2018
Number of pages	252
Qualification	PhD
Awarding Institution	Lancaster University
Supervisors/Advisors	Rashid, Awais, Supervisor Rayson, Paul, Supervisor
Publisher	Lancaster University
<mark>Original language</mark>	English

Abstract

The explosion in popularity of online social networks has led to increased interest in identity resolution from security practitioners. Being able to connect together the multiple online accounts of a user can be of use in verifying identity attributes and in tracking the activity of malicious users. At the same time, privacy researchers are exploring the same phenomenon with interest in identifying privacy risks caused by re-identification attacks.
Existing literature has explored how particular components of an online identity may be used to connect profiles, but few if any studies have attempted to assess the comparative value of information attributes. In addition, few of the methods being reported are easily comparable, due to difficulties with obtaining and sharing ground- truth data. Attempts to gain a comprehensive understanding of the identifiability of profile attributes are hindered by these issues.
With a focus on overcoming these hurdles to effective research, this thesis first develops a methodology for sampling ground-truth data from online social networks. Building on this with reference to both existing literature and samples of real profile data, this thesis describes and grounds a comprehensive matching schema of profile attributes. The work then defines data quality measures which are important for identity resolution, and measures the availability, consistency and uniqueness of the schema’s contents. The developed measurements are then applied in a feature selection scheme to reduce the impact of missing data issues common in identity resolution.
Finally, this thesis addresses the purposes to which identity resolution may be applied, defining the further application-oriented data quality measurements of novelty, veracity and relevance, and demonstrating their calculation and application for a particular use case: evaluating the social engineering vulnerability of an organisation.

Research

Associated organisational unit

Electronic data

Text available via DOI: