On integrating the number of synthetic data sets m into the a priori synthesis approach

School Of Mathematical Sciences

Associated organisational unit

Medical and Social Statistics

Electronic data

Multiple_Data_Sets (19)
Accepted author manuscript, 845 KB, PDF document
Available under license: CC BY: Creative Commons Attribution 4.0 International License

Text available via DOI:

https://doi.org/10.1007/978-3-031-13945-1_15
Final published version

View graph of relations

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Published

James Jackson
Robin Mitra
Brian Francis
Iain Dove

More...

Publication date	14/09/2022
Host publication	Privacy in Statistical Databases: International Conference, PSD 2022, Paris, France, September 21–23, 2022, Proceedings
Editors	Josep Domingo-Ferrer, Maryline Laurent
Place of Publication	Cham
Publisher	Springer
Pages	205-219
Number of pages	15
ISBN (electronic)	9783031139451
ISBN (print)	9783031139444
<mark>Original language</mark>	English

Publication series

Name	Lecture Notes in Computer Science
Publisher	Springer Cham
ISSN (Print)	0302-9743
ISSN (electronic)	1611-3349

Abstract

The synthesis mechanism given in Jackson et al. (2022) uses saturated models, along with overdispersed count distributions, to generate synthetic categorical data. The mechanism is controlled by tuning parameters, which can be tuned according to a specific risk or utility metric. Thus expected properties of synthetic data sets can be determined analytically a priori, that is, before they are generated. While Jackson et al. (2022) considered the case of generating m = 1 data set, this paper considers generating m > 1 data sets. In effect, m becomes a tuning parameter and the role of m in relation to the risk-utility trade-off can be shown analytically. The paper introduces a pair of risk metrics, τ₃(k,d) and τ₄(k,d) that are suited to m > 1 data sets; and also considers the more general issue of how best to analyse categorical data sets: average the data sets pre-analysis or average results post-analysis. Finally, the methods are demonstrated empirically with the synthesis of a constructed data set which is used to represent the English School Census.

Research

Associated organisational unit

Electronic data

Links

Text available via DOI: