Home > Research > Publications & Outputs > Detecting deceptive behaviour in the wild

Associated organisational unit

Electronic data

  • 2018peersmanphd

    Final published version, 2.15 MB, PDF document

    Available under license: CC BY-ND: Creative Commons Attribution-NoDerivatives 4.0 International License

Text available via DOI:

View graph of relations

Detecting deceptive behaviour in the wild: text mining for online child protection in the presence of noisy and adversarial social media communications

Research output: ThesisDoctoral Thesis

Published

Standard

Harvard

APA

Vancouver

Author

Bibtex

@phdthesis{cc97d78aeab8457093513a74cc6b0426,
title = "Detecting deceptive behaviour in the wild: text mining for online child protection in the presence of noisy and adversarial social media communications",
abstract = "A real-life application of text mining research “in the wild”, i.e. in online social media, differsfrom more general applications in that its defining characteristics are both domain and processdependent. This gives rise to a number of challenges of which contemporary research has onlyscratched the surface. More specifically, a text mining approach applied in the wild typicallyhas no control over the dataset size. Hence, the system has to be robust towards limited dataavailability, a variable number of samples across users and a highly skewed dataset. Additionally,the quality of the data cannot be guaranteed. As a result, the approach needs to be tolerant toa certain degree of linguistic noise. Finally, it has to be robust towards deceptive behaviour oradversaries.This thesis examines the viability of a text mining approach for supporting cybercrimeinvestigations pertaining to online child protection. The main contributions of this dissertationare as follows. A systematic study of different aspects of methodological design of a state-ofthe-art text mining approach is presented to assess its scalability towards a large, imbalancedand linguistically noisy social media dataset. In this framework, three key automatic textcategorisation tasks are examined, namely the feasibility to (i) identify a social network user{\textquoteright}s agegroup and gender based on textual information found in only one single message; (ii) aggregatepredictions on the message level to the user level without neglecting potential clues of deceptionand detect false user profiles on social networks and (iii) identify child sexual abuse media amongthousands of legal other media, including adult pornography, based on their filename. Finally, anovel approach is presented that combines age group predictions with advanced text clusteringtechniques and unsupervised learning to identify online child sex offenders{\textquoteright} grooming behaviour.The methodology presented in this thesis was extensively discussed with law enforcementto assess its forensic readiness. Additionally, each component was evaluated on actual child sexoffender data. Despite the challenging characteristics of these text types, the results show highdegrees of accuracy for false profile detection, identifying grooming behaviour and child sexualabuse media identification.",
author = "Claudia Peersman",
year = "2018",
doi = "10.17635/lancaster/thesis/553",
language = "English",
publisher = "Lancaster University",
school = "Lancaster University",

}

RIS

TY - BOOK

T1 - Detecting deceptive behaviour in the wild

T2 - text mining for online child protection in the presence of noisy and adversarial social media communications

AU - Peersman, Claudia

PY - 2018

Y1 - 2018

N2 - A real-life application of text mining research “in the wild”, i.e. in online social media, differsfrom more general applications in that its defining characteristics are both domain and processdependent. This gives rise to a number of challenges of which contemporary research has onlyscratched the surface. More specifically, a text mining approach applied in the wild typicallyhas no control over the dataset size. Hence, the system has to be robust towards limited dataavailability, a variable number of samples across users and a highly skewed dataset. Additionally,the quality of the data cannot be guaranteed. As a result, the approach needs to be tolerant toa certain degree of linguistic noise. Finally, it has to be robust towards deceptive behaviour oradversaries.This thesis examines the viability of a text mining approach for supporting cybercrimeinvestigations pertaining to online child protection. The main contributions of this dissertationare as follows. A systematic study of different aspects of methodological design of a state-ofthe-art text mining approach is presented to assess its scalability towards a large, imbalancedand linguistically noisy social media dataset. In this framework, three key automatic textcategorisation tasks are examined, namely the feasibility to (i) identify a social network user’s agegroup and gender based on textual information found in only one single message; (ii) aggregatepredictions on the message level to the user level without neglecting potential clues of deceptionand detect false user profiles on social networks and (iii) identify child sexual abuse media amongthousands of legal other media, including adult pornography, based on their filename. Finally, anovel approach is presented that combines age group predictions with advanced text clusteringtechniques and unsupervised learning to identify online child sex offenders’ grooming behaviour.The methodology presented in this thesis was extensively discussed with law enforcementto assess its forensic readiness. Additionally, each component was evaluated on actual child sexoffender data. Despite the challenging characteristics of these text types, the results show highdegrees of accuracy for false profile detection, identifying grooming behaviour and child sexualabuse media identification.

AB - A real-life application of text mining research “in the wild”, i.e. in online social media, differsfrom more general applications in that its defining characteristics are both domain and processdependent. This gives rise to a number of challenges of which contemporary research has onlyscratched the surface. More specifically, a text mining approach applied in the wild typicallyhas no control over the dataset size. Hence, the system has to be robust towards limited dataavailability, a variable number of samples across users and a highly skewed dataset. Additionally,the quality of the data cannot be guaranteed. As a result, the approach needs to be tolerant toa certain degree of linguistic noise. Finally, it has to be robust towards deceptive behaviour oradversaries.This thesis examines the viability of a text mining approach for supporting cybercrimeinvestigations pertaining to online child protection. The main contributions of this dissertationare as follows. A systematic study of different aspects of methodological design of a state-ofthe-art text mining approach is presented to assess its scalability towards a large, imbalancedand linguistically noisy social media dataset. In this framework, three key automatic textcategorisation tasks are examined, namely the feasibility to (i) identify a social network user’s agegroup and gender based on textual information found in only one single message; (ii) aggregatepredictions on the message level to the user level without neglecting potential clues of deceptionand detect false user profiles on social networks and (iii) identify child sexual abuse media amongthousands of legal other media, including adult pornography, based on their filename. Finally, anovel approach is presented that combines age group predictions with advanced text clusteringtechniques and unsupervised learning to identify online child sex offenders’ grooming behaviour.The methodology presented in this thesis was extensively discussed with law enforcementto assess its forensic readiness. Additionally, each component was evaluated on actual child sexoffender data. Despite the challenging characteristics of these text types, the results show highdegrees of accuracy for false profile detection, identifying grooming behaviour and child sexualabuse media identification.

U2 - 10.17635/lancaster/thesis/553

DO - 10.17635/lancaster/thesis/553

M3 - Doctoral Thesis

PB - Lancaster University

ER -