Home > Research > Publications & Outputs > BOFFIN TTS

Electronic data

  • boffin

    Rights statement: ©2020 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

    Accepted author manuscript, 615 KB, PDF document

    Available under license: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License

View graph of relations

BOFFIN TTS: Few-Shot Speaker Adaptation by Bayesian Optimization

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Published

Standard

BOFFIN TTS: Few-Shot Speaker Adaptation by Bayesian Optimization. / Moss, Henry; Aggarwal, Vatsal; Prateek, Nishant et al.
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020. ed. IEEE, 2020. p. 7639-7643 (ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Vol. 2020).

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Harvard

Moss, H, Aggarwal, V, Prateek, N, González , J & Barra-Chicote, R 2020, BOFFIN TTS: Few-Shot Speaker Adaptation by Bayesian Optimization. in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020 edn, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2020, IEEE, pp. 7639-7643.

APA

Moss, H., Aggarwal, V., Prateek, N., González , J., & Barra-Chicote, R. (2020). BOFFIN TTS: Few-Shot Speaker Adaptation by Bayesian Optimization. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020 ed., pp. 7639-7643). (ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Vol. 2020). IEEE.

Vancouver

Moss H, Aggarwal V, Prateek N, González J, Barra-Chicote R. BOFFIN TTS: Few-Shot Speaker Adaptation by Bayesian Optimization. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020 ed. IEEE. 2020. p. 7639-7643. (ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)). Epub 2020 May 4.

Author

Moss, Henry ; Aggarwal, Vatsal ; Prateek, Nishant et al. / BOFFIN TTS : Few-Shot Speaker Adaptation by Bayesian Optimization. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020. ed. IEEE, 2020. pp. 7639-7643 (ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)).

Bibtex

@inproceedings{f0f59cb199f5416b99176ca4f713e880,
title = "BOFFIN TTS: Few-Shot Speaker Adaptation by Bayesian Optimization",
abstract = "We present BOFFIN TTS (Bayesian Optimization For FIne-tuning Neural Text To Speech), a novel approach for few-shot speaker adaptation. Here, the task is to fine-tune a pre-trained TTS model to mimic a new speaker using a small corpus of target utterances. We demonstrate that there does not exist a one-size-fits-all adaptation strategy, with convincing synthesis requiring a corpus-specific configuration of the hyper-parameters that control fine-tuning. By using Bayesian optimization to efficiently optimize these hyper-parameter values for a target speaker, we are able to perform adaptation with an average 30% improvement in speaker similarity over standard techniques. Results indicate, across multiple corpora, that BOFFIN TTS can learn to synthesize new speakers using less than ten minutes of audio, achieving the same naturalness as produced for the speakers used to train the base model.",
author = "Henry Moss and Vatsal Aggarwal and Nishant Prateek and Javier Gonz{\'a}lez and Roberto Barra-Chicote",
note = "{\textcopyright}2020 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. ",
year = "2020",
month = may,
day = "14",
language = "English",
isbn = "9781509066322",
series = "ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
publisher = "IEEE",
pages = "7639--7643",
booktitle = "ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
edition = "2020",

}

RIS

TY - GEN

T1 - BOFFIN TTS

T2 - Few-Shot Speaker Adaptation by Bayesian Optimization

AU - Moss, Henry

AU - Aggarwal, Vatsal

AU - Prateek, Nishant

AU - González , Javier

AU - Barra-Chicote, Roberto

N1 - ©2020 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

PY - 2020/5/14

Y1 - 2020/5/14

N2 - We present BOFFIN TTS (Bayesian Optimization For FIne-tuning Neural Text To Speech), a novel approach for few-shot speaker adaptation. Here, the task is to fine-tune a pre-trained TTS model to mimic a new speaker using a small corpus of target utterances. We demonstrate that there does not exist a one-size-fits-all adaptation strategy, with convincing synthesis requiring a corpus-specific configuration of the hyper-parameters that control fine-tuning. By using Bayesian optimization to efficiently optimize these hyper-parameter values for a target speaker, we are able to perform adaptation with an average 30% improvement in speaker similarity over standard techniques. Results indicate, across multiple corpora, that BOFFIN TTS can learn to synthesize new speakers using less than ten minutes of audio, achieving the same naturalness as produced for the speakers used to train the base model.

AB - We present BOFFIN TTS (Bayesian Optimization For FIne-tuning Neural Text To Speech), a novel approach for few-shot speaker adaptation. Here, the task is to fine-tune a pre-trained TTS model to mimic a new speaker using a small corpus of target utterances. We demonstrate that there does not exist a one-size-fits-all adaptation strategy, with convincing synthesis requiring a corpus-specific configuration of the hyper-parameters that control fine-tuning. By using Bayesian optimization to efficiently optimize these hyper-parameter values for a target speaker, we are able to perform adaptation with an average 30% improvement in speaker similarity over standard techniques. Results indicate, across multiple corpora, that BOFFIN TTS can learn to synthesize new speakers using less than ten minutes of audio, achieving the same naturalness as produced for the speakers used to train the base model.

M3 - Conference contribution/Paper

SN - 9781509066322

T3 - ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

SP - 7639

EP - 7643

BT - ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

PB - IEEE

ER -