Making Mitochondrial Haplogroup and DNA Sequence Predictions from Low-Density Genotyping data

Biomedical and Life Sciences

Electronic data

2021DrummondMScBiomed
Final published version, 7.06 MB, PDF document

Text available via DOI:

https://doi.org/10.17635/lancaster/thesis/1391
Final published version

Keywords

mitochondria, mtDNA, GENOTYPES, imputation, R software

View graph of relations

Research output: Thesis › Master's Thesis

Published

Standard

Making Mitochondrial Haplogroup and DNA Sequence Predictions from Low-Density Genotyping data. / Drummond, Emma.
Lancaster University, 2021. 149 p.

Research output: Thesis › Master's Thesis

Bibtex

@mastersthesis{d7475f40e07d4dff94f0e9ae9cadd3d3,

title = "Making Mitochondrial Haplogroup and DNA Sequence Predictions from Low-Density Genotyping data",

abstract = "The mitochondrial genome (mtDNA) is inherited differently and mutates more frequently than the genetic material residing in the cells{\textquoteright} nucleus. Whilst the genome of the mtDNA is small, at only 16.5 kilobases, it contains key components of the metabolic chain, and must communicate in a precise and timely way with the genes in the nuclear genome and sense the minute-to-minute needs of its host cell. MtDNA is an underexplored place to search for health-related variants. Unlike the time-consuming and expensive methods of whole genome sequencing, genotyping examines certain positions in the genome allowing imputation of the other variants typically linked to these positions. Current methods, which use nuclear genome data to model their predictions, do not tailor imputation to take advantage of the different inheritance patterns of the mtDNA. I present a novel method, using an open-source library of fully sequenced mtDNA samples with manually assigned haplogroups, to take genotyping data and predict the other variants presentin the sample{\textquoteright}s mtDNA sequence, a two-stage method referred to as in silico genotyping and barcode matching. The method has been assessed for performance on a test data set to explore inconsistencies across the mitochondrial genome and the human mtDNA phylogeny. The first use of in silico genotyping and barcode matching is presented; extending the use of UKBiobank{\textquoteright}s data [22].The UKBiobank represents data which is not only rich in detail but also covers a large population of individuals aged between 51 and 84 in 2021. The phenotypic data is health-focussed, including general health records, which is being augmented by new diagnoses or events in the participants{\textquoteright} medical history.Extensive use is being made of the data in UKBiobank with the exception of the mitochondrial DNA (mtDNA). The scale of the phenotypic data collected by the UKBiobank is proving a valuable resource, values all the more because of the difficulty and expense of its collection. Making further use of phenotyping by extending potential associations into the mtDNA is vital, and likely to offersubstantial rewards. Using the method described below to transform genotypes into predicted mtDNA sequence opens the doors for mitochondrial variation to be put to considerable use too. The introduction presents evidence that: (a) the mitochondrion is essential for cell and organism function, (b) mtDNA can harbour variations associating with phenotypes, and (c) the current methods of mtDNA imputation can be improved upon. The method presented mimics any genotyping microarray to produce a library of data transformed to appear as if it had been genotyped by the physical array. The effectiveness and accuracy of this transformation have been investigated and the results are presented.Finally, the transformed library is used to predict the UKBiobank participant data to greatly extend a data set with huge reserves of potential especially for mitochondrial data. My development of in silico genotyping and barcode matching has allowed me to make weighted prediction for test samples, guessing their haplogroups and the variants they carry.Whilst I admit to the significant potential to improve algorithms, the overall accuracy of these predictions is at a level high enough to search for links between UKBiobank samples and their phenotypic data in a GWAS-style search.",

keywords = "mitochondria, mtDNA, GENOTYPES, imputation, R software",

author = "Emma Drummond",

year = "2021",

month = jul,

day = "30",

doi = "10.17635/lancaster/thesis/1391",

language = "English",

publisher = "Lancaster University",

school = "Lancaster University",

}

RIS

TY - THES

T1 - Making Mitochondrial Haplogroup and DNA Sequence Predictions from Low-Density Genotyping data

AU - Drummond, Emma

PY - 2021/7/30

Y1 - 2021/7/30

N2 - The mitochondrial genome (mtDNA) is inherited differently and mutates more frequently than the genetic material residing in the cells’ nucleus. Whilst the genome of the mtDNA is small, at only 16.5 kilobases, it contains key components of the metabolic chain, and must communicate in a precise and timely way with the genes in the nuclear genome and sense the minute-to-minute needs of its host cell. MtDNA is an underexplored place to search for health-related variants. Unlike the time-consuming and expensive methods of whole genome sequencing, genotyping examines certain positions in the genome allowing imputation of the other variants typically linked to these positions. Current methods, which use nuclear genome data to model their predictions, do not tailor imputation to take advantage of the different inheritance patterns of the mtDNA. I present a novel method, using an open-source library of fully sequenced mtDNA samples with manually assigned haplogroups, to take genotyping data and predict the other variants presentin the sample’s mtDNA sequence, a two-stage method referred to as in silico genotyping and barcode matching. The method has been assessed for performance on a test data set to explore inconsistencies across the mitochondrial genome and the human mtDNA phylogeny. The first use of in silico genotyping and barcode matching is presented; extending the use of UKBiobank’s data [22].The UKBiobank represents data which is not only rich in detail but also covers a large population of individuals aged between 51 and 84 in 2021. The phenotypic data is health-focussed, including general health records, which is being augmented by new diagnoses or events in the participants’ medical history.Extensive use is being made of the data in UKBiobank with the exception of the mitochondrial DNA (mtDNA). The scale of the phenotypic data collected by the UKBiobank is proving a valuable resource, values all the more because of the difficulty and expense of its collection. Making further use of phenotyping by extending potential associations into the mtDNA is vital, and likely to offersubstantial rewards. Using the method described below to transform genotypes into predicted mtDNA sequence opens the doors for mitochondrial variation to be put to considerable use too. The introduction presents evidence that: (a) the mitochondrion is essential for cell and organism function, (b) mtDNA can harbour variations associating with phenotypes, and (c) the current methods of mtDNA imputation can be improved upon. The method presented mimics any genotyping microarray to produce a library of data transformed to appear as if it had been genotyped by the physical array. The effectiveness and accuracy of this transformation have been investigated and the results are presented.Finally, the transformed library is used to predict the UKBiobank participant data to greatly extend a data set with huge reserves of potential especially for mitochondrial data. My development of in silico genotyping and barcode matching has allowed me to make weighted prediction for test samples, guessing their haplogroups and the variants they carry.Whilst I admit to the significant potential to improve algorithms, the overall accuracy of these predictions is at a level high enough to search for links between UKBiobank samples and their phenotypic data in a GWAS-style search.

AB - The mitochondrial genome (mtDNA) is inherited differently and mutates more frequently than the genetic material residing in the cells’ nucleus. Whilst the genome of the mtDNA is small, at only 16.5 kilobases, it contains key components of the metabolic chain, and must communicate in a precise and timely way with the genes in the nuclear genome and sense the minute-to-minute needs of its host cell. MtDNA is an underexplored place to search for health-related variants. Unlike the time-consuming and expensive methods of whole genome sequencing, genotyping examines certain positions in the genome allowing imputation of the other variants typically linked to these positions. Current methods, which use nuclear genome data to model their predictions, do not tailor imputation to take advantage of the different inheritance patterns of the mtDNA. I present a novel method, using an open-source library of fully sequenced mtDNA samples with manually assigned haplogroups, to take genotyping data and predict the other variants presentin the sample’s mtDNA sequence, a two-stage method referred to as in silico genotyping and barcode matching. The method has been assessed for performance on a test data set to explore inconsistencies across the mitochondrial genome and the human mtDNA phylogeny. The first use of in silico genotyping and barcode matching is presented; extending the use of UKBiobank’s data [22].The UKBiobank represents data which is not only rich in detail but also covers a large population of individuals aged between 51 and 84 in 2021. The phenotypic data is health-focussed, including general health records, which is being augmented by new diagnoses or events in the participants’ medical history.Extensive use is being made of the data in UKBiobank with the exception of the mitochondrial DNA (mtDNA). The scale of the phenotypic data collected by the UKBiobank is proving a valuable resource, values all the more because of the difficulty and expense of its collection. Making further use of phenotyping by extending potential associations into the mtDNA is vital, and likely to offersubstantial rewards. Using the method described below to transform genotypes into predicted mtDNA sequence opens the doors for mitochondrial variation to be put to considerable use too. The introduction presents evidence that: (a) the mitochondrion is essential for cell and organism function, (b) mtDNA can harbour variations associating with phenotypes, and (c) the current methods of mtDNA imputation can be improved upon. The method presented mimics any genotyping microarray to produce a library of data transformed to appear as if it had been genotyped by the physical array. The effectiveness and accuracy of this transformation have been investigated and the results are presented.Finally, the transformed library is used to predict the UKBiobank participant data to greatly extend a data set with huge reserves of potential especially for mitochondrial data. My development of in silico genotyping and barcode matching has allowed me to make weighted prediction for test samples, guessing their haplogroups and the variants they carry.Whilst I admit to the significant potential to improve algorithms, the overall accuracy of these predictions is at a level high enough to search for links between UKBiobank samples and their phenotypic data in a GWAS-style search.

KW - mitochondria

KW - mtDNA

KW - GENOTYPES

KW - imputation

KW - R software

U2 - 10.17635/lancaster/thesis/1391

DO - 10.17635/lancaster/thesis/1391

M3 - Master's Thesis

PB - Lancaster University

ER -

Research

Electronic data

Text available via DOI:

Keywords