Learning to combine multiple string similarity metrics for effective toponym matching

History

Electronic data

Manusc_Combining_Multiple_String_Similarity_Metrics_for_Effective_Toponym_Matching
Rights statement: This is an Accepted Manuscript of an article published by Taylor & Francis in International Journal of Digital Earth on 06/09/2017, available online: http://www.tandfonline.com/doi/full/10.1080/17538947.2017.1371253
Accepted author manuscript, 490 KB, PDF document
Available under license: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License

Text available via DOI:

https://doi.org/10.1080/17538947.2017.1371253
Final published version

Keywords

duplicate detection, ensemble learning, geographic information retrieval, string similarity metrics, supervised learning, Toponym matching

View graph of relations

Research output: Contribution to Journal/Magazine › Journal article › peer-review

E-pub ahead of print

Standard

Learning to combine multiple string similarity metrics for effective toponym matching. / Santos, Rui; Murrieta-Flores, Patricia; Martins, Bruno.
In: International Journal of Digital Earth, 06.09.2017.

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Harvard

Santos, R, Murrieta-Flores, P & Martins, B 2017, 'Learning to combine multiple string similarity metrics for effective toponym matching', International Journal of Digital Earth. https://doi.org/10.1080/17538947.2017.1371253

APA

Santos, R., Murrieta-Flores, P., & Martins, B. (2017). Learning to combine multiple string similarity metrics for effective toponym matching. International Journal of Digital Earth. Advance online publication. https://doi.org/10.1080/17538947.2017.1371253

Vancouver

Santos R, Murrieta-Flores P, Martins B. Learning to combine multiple string similarity metrics for effective toponym matching. International Journal of Digital Earth. 2017 Sept 6. Epub 2017 Sept 6. doi: 10.1080/17538947.2017.1371253

Author

Santos, Rui ; Murrieta-Flores, Patricia ; Martins, Bruno. / Learning to combine multiple string similarity metrics for effective toponym matching. In: International Journal of Digital Earth. 2017.

Bibtex

@article{6675e427223740d78fa80071ff44b074,

title = "Learning to combine multiple string similarity metrics for effective toponym matching",

abstract = "Several tasks related to geographical information retrieval and to the geographical information sciences involve toponym matching, that is, the problem of matching place names that share a common referent. In this article, we present the results of a wide-ranging evaluation on the performance of different string similarity metrics over the toponym matching task. We also report on experiments involving the usage of supervised machine learning for combining multiple similarity metrics, which has the natural advantage of avoiding the manual tuning of similarity thresholds. Experiments with a very large dataset show that the performance differences for the individual similarity metrics are relatively small, and that carefully tuning the similarity threshold is important for achieving good results. The methods based on supervised machine learning, particularly when considering ensembles of decision trees, can achieve good results on this task, significantly outperforming the individual similarity metrics.",

keywords = "duplicate detection, ensemble learning, geographic information retrieval, string similarity metrics, supervised learning, Toponym matching",

author = "Rui Santos and Patricia Murrieta-Flores and Bruno Martins",

note = "This is an Accepted Manuscript of an article published by Taylor & Francis in International Journal of Digital Earth on 06/09/2017, available online: http://www.tandfonline.com/doi/full/10.1080/17538947.2017.1371253 ",

year = "2017",

month = sep,

day = "6",

doi = "10.1080/17538947.2017.1371253",

language = "English",

journal = "International Journal of Digital Earth",

issn = "1753-8947",

publisher = "Taylor and Francis Ltd.",

}

RIS

TY - JOUR

T1 - Learning to combine multiple string similarity metrics for effective toponym matching

AU - Santos, Rui

AU - Murrieta-Flores, Patricia

AU - Martins, Bruno

N1 - This is an Accepted Manuscript of an article published by Taylor & Francis in International Journal of Digital Earth on 06/09/2017, available online: http://www.tandfonline.com/doi/full/10.1080/17538947.2017.1371253

PY - 2017/9/6

Y1 - 2017/9/6

N2 - Several tasks related to geographical information retrieval and to the geographical information sciences involve toponym matching, that is, the problem of matching place names that share a common referent. In this article, we present the results of a wide-ranging evaluation on the performance of different string similarity metrics over the toponym matching task. We also report on experiments involving the usage of supervised machine learning for combining multiple similarity metrics, which has the natural advantage of avoiding the manual tuning of similarity thresholds. Experiments with a very large dataset show that the performance differences for the individual similarity metrics are relatively small, and that carefully tuning the similarity threshold is important for achieving good results. The methods based on supervised machine learning, particularly when considering ensembles of decision trees, can achieve good results on this task, significantly outperforming the individual similarity metrics.

AB - Several tasks related to geographical information retrieval and to the geographical information sciences involve toponym matching, that is, the problem of matching place names that share a common referent. In this article, we present the results of a wide-ranging evaluation on the performance of different string similarity metrics over the toponym matching task. We also report on experiments involving the usage of supervised machine learning for combining multiple similarity metrics, which has the natural advantage of avoiding the manual tuning of similarity thresholds. Experiments with a very large dataset show that the performance differences for the individual similarity metrics are relatively small, and that carefully tuning the similarity threshold is important for achieving good results. The methods based on supervised machine learning, particularly when considering ensembles of decision trees, can achieve good results on this task, significantly outperforming the individual similarity metrics.

KW - duplicate detection

KW - ensemble learning

KW - geographic information retrieval

KW - string similarity metrics

KW - supervised learning

KW - Toponym matching

U2 - 10.1080/17538947.2017.1371253

DO - 10.1080/17538947.2017.1371253

M3 - Journal article

AN - SCOPUS:85029430534

JO - International Journal of Digital Earth

JF - International Journal of Digital Earth

SN - 1753-8947

ER -

Research

Electronic data

Links

Text available via DOI:

Keywords