Home > Research > Publications & Outputs > The use of saturated count models for synthesis...

Electronic data

Text available via DOI:

View graph of relations

The use of saturated count models for synthesis of large confidential administrative databases

Research output: ThesisDoctoral Thesis

Published

Standard

The use of saturated count models for synthesis of large confidential administrative databases. / Jackson, James.
Lancaster University, 2022. 151 p.

Research output: ThesisDoctoral Thesis

Harvard

APA

Vancouver

Jackson J. The use of saturated count models for synthesis of large confidential administrative databases. Lancaster University, 2022. 151 p. doi: 10.17635/lancaster/thesis/1860

Author

Bibtex

@phdthesis{c8d2b1acc7c7492da2dd8903dc00a20a,
title = "The use of saturated count models for synthesis of large confidential administrative databases",
abstract = "Synthetic data sets are being increasingly used to protect data confidentiality. In the three decades since they were first introduced, methods for synthetic data generation have evolved, but mainly within the domain of survey data sets. As greater interest is being taken in utilising administrative data for statistical purposes, there is inevitably greater interest in creating synthetic administrative databases. Yet there are characteristics of these databases that require special attention from a synthesis perspective, such as their size and the presence of structural zeros. This thesis, through the fitting of saturated models in conjunction with overdispersed count distributions, presents a mechanism that allows large administrative databases to be synthesized efficiently. This thesis also proposes a concept of satisfying risk and utility metrics a priori - that is, prior to synthetic data generation - using the synthesis mechanism{\textquoteright}s tuning parameters, allowing a more formalized approach to synthesis. The methods are demonstrated empirically throughout, primarily through synthesizing a database that can be viewed as a close substitute to the English School Census.",
keywords = "Synthetic data, Statistical disclosure control, count distributions, tabular data",
author = "James Jackson",
year = "2022",
doi = "10.17635/lancaster/thesis/1860",
language = "English",
publisher = "Lancaster University",
school = "Lancaster University",

}

RIS

TY - BOOK

T1 - The use of saturated count models for synthesis of large confidential administrative databases

AU - Jackson, James

PY - 2022

Y1 - 2022

N2 - Synthetic data sets are being increasingly used to protect data confidentiality. In the three decades since they were first introduced, methods for synthetic data generation have evolved, but mainly within the domain of survey data sets. As greater interest is being taken in utilising administrative data for statistical purposes, there is inevitably greater interest in creating synthetic administrative databases. Yet there are characteristics of these databases that require special attention from a synthesis perspective, such as their size and the presence of structural zeros. This thesis, through the fitting of saturated models in conjunction with overdispersed count distributions, presents a mechanism that allows large administrative databases to be synthesized efficiently. This thesis also proposes a concept of satisfying risk and utility metrics a priori - that is, prior to synthetic data generation - using the synthesis mechanism’s tuning parameters, allowing a more formalized approach to synthesis. The methods are demonstrated empirically throughout, primarily through synthesizing a database that can be viewed as a close substitute to the English School Census.

AB - Synthetic data sets are being increasingly used to protect data confidentiality. In the three decades since they were first introduced, methods for synthetic data generation have evolved, but mainly within the domain of survey data sets. As greater interest is being taken in utilising administrative data for statistical purposes, there is inevitably greater interest in creating synthetic administrative databases. Yet there are characteristics of these databases that require special attention from a synthesis perspective, such as their size and the presence of structural zeros. This thesis, through the fitting of saturated models in conjunction with overdispersed count distributions, presents a mechanism that allows large administrative databases to be synthesized efficiently. This thesis also proposes a concept of satisfying risk and utility metrics a priori - that is, prior to synthetic data generation - using the synthesis mechanism’s tuning parameters, allowing a more formalized approach to synthesis. The methods are demonstrated empirically throughout, primarily through synthesizing a database that can be viewed as a close substitute to the English School Census.

KW - Synthetic data

KW - Statistical disclosure control

KW - count distributions

KW - tabular data

U2 - 10.17635/lancaster/thesis/1860

DO - 10.17635/lancaster/thesis/1860

M3 - Doctoral Thesis

PB - Lancaster University

ER -