The World Wide Web has evolved into an interactive information network, allowing web users to collaborate and share information on a massive scale. A large amount of information now available on the World Wide Web is personal information, which has either been disseminated voluntarily (i.e. a personal web page, profile page) or involuntarily (i.e. telephone directory, electoral register). The sensitive nature of personal information and its widespread visibility has lead to a rise in malevolent web practices such as lateral surveillance and identity theft. In order to avoid falling victim to these practices, web users are forced to find web resources (web pages, data feeds) which may contain their personal information and then decide which of those web resources do. This decision process disambiguates web resources which are identity web references} for a given person, however performing this process manually is time consuming and costly. Furthermore, as more information is published on a regular basis, the process must be repeated constantly and handle an ever increasing information load.
In order to overcome the need for the manual discovery of identity web references automated techniques are a necessity. To function effectively such techniques require seed data - background knowledge describing the identity of a given person (i.e. their biographical information, their social network) - however producing this seed data manually is expensive and restricted by access to resources: one of the hard problems within the machine learning community. In this paper we explain how the Social Web can be utilised as a source for this seed data. In recent years Social Web platforms such as Facebook, MySpace and Twitter have provided environments in which web users can construct bespoke digital identity representations. Profile pages are a common feature of such platforms together with the social networks compiled by users. We investigate the hypothesis that users of Social Web platforms construct identity representations which mirror their real-world equivalents. We explore this line of thinking through a detailed user study which quantifies the overlap between digital and real-world identity information. The presentation and discussion of the results from this study provide empirical evidence to support our hypothesis and also support the theoretical discussions made in similar sociological studies, which explore the relationship between real-world identities and digital identity representations which are constructed on Social Web platforms.
Automating the process of disambiguation attempts to replicate the cognitive practice of decision making given prior knowledge about an individual. In the case of disambiguating identity web references automated techniques use the provided seed data - leveraged from the Social Web - to accurately detect a web resource as citing a given individual. In this paper we present an overview of two distinct disambiguation techniques and explain the science behind such approaches.
Our first technique constructs a rule-base from the provided seed data, the rules are then applied to web resources in order to infer a given web resource as being an identity web reference. The use of rules allows a logical deduction to be made via inferencing over available information, this process mimics the approach by which humans decide whether a web resource cites them: i.e. If a web resource contains my name and my email address then it refers to me. The digital identity representation gathered from the Social Web provides the necessary background knowledge to facilitate such decisions - by associating information present within a web resource with known information about an individual. Our second technique uses a machine learning strategy known as self-training which first trains a classifier using background knowledge about a given person, and applies the classifier to a set of web resources. The classifier is then retrained on the strongest classifications, and the process is repeated thus self-training. Unlike the use of rules, this technique allows features of web resources to be reused for future disambiguation decisions. For instance if a web resource contains information about a person's work colleagues, then the technique can learn from this information and therefore enhance future disambiguation decisions.
By conducting an extensive evaluation of our automated disambiguation techniques we have found that accuracy levels can be achieved that surpass human processing. Web users have web presence levels which vary greatly depending on their visibility and propensity to disseminate their personal information. From our results we discuss the implications of web presence levels on automated disambiguation techniques compared to humans performing the same task. The findings indicate that automated techniques significantly outperform humans when disambiguating identity web references for individuals who have a low web presence levels, whilst accuracy levels over individuals with high levels of web presence are similar. We believe that such results empirically show the reliability of automated techniques to spot sparse identity web references and their suitability to application over large information spaces.