Rights statement: This is an Accepted Manuscript of an article published by Taylor & Francis in International Journal of Digital Earth on 06/09/2017, available online: http://www.tandfonline.com/doi/full/10.1080/17538947.2017.1371253
Accepted author manuscript, 490 KB, PDF document
Available under license: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License
Final published version
Research output: Contribution to Journal/Magazine › Journal article › peer-review
Research output: Contribution to Journal/Magazine › Journal article › peer-review
}
TY - JOUR
T1 - Learning to combine multiple string similarity metrics for effective toponym matching
AU - Santos, Rui
AU - Murrieta-Flores, Patricia
AU - Martins, Bruno
N1 - This is an Accepted Manuscript of an article published by Taylor & Francis in International Journal of Digital Earth on 06/09/2017, available online: http://www.tandfonline.com/doi/full/10.1080/17538947.2017.1371253
PY - 2017/9/6
Y1 - 2017/9/6
N2 - Several tasks related to geographical information retrieval and to the geographical information sciences involve toponym matching, that is, the problem of matching place names that share a common referent. In this article, we present the results of a wide-ranging evaluation on the performance of different string similarity metrics over the toponym matching task. We also report on experiments involving the usage of supervised machine learning for combining multiple similarity metrics, which has the natural advantage of avoiding the manual tuning of similarity thresholds. Experiments with a very large dataset show that the performance differences for the individual similarity metrics are relatively small, and that carefully tuning the similarity threshold is important for achieving good results. The methods based on supervised machine learning, particularly when considering ensembles of decision trees, can achieve good results on this task, significantly outperforming the individual similarity metrics.
AB - Several tasks related to geographical information retrieval and to the geographical information sciences involve toponym matching, that is, the problem of matching place names that share a common referent. In this article, we present the results of a wide-ranging evaluation on the performance of different string similarity metrics over the toponym matching task. We also report on experiments involving the usage of supervised machine learning for combining multiple similarity metrics, which has the natural advantage of avoiding the manual tuning of similarity thresholds. Experiments with a very large dataset show that the performance differences for the individual similarity metrics are relatively small, and that carefully tuning the similarity threshold is important for achieving good results. The methods based on supervised machine learning, particularly when considering ensembles of decision trees, can achieve good results on this task, significantly outperforming the individual similarity metrics.
KW - duplicate detection
KW - ensemble learning
KW - geographic information retrieval
KW - string similarity metrics
KW - supervised learning
KW - Toponym matching
U2 - 10.1080/17538947.2017.1371253
DO - 10.1080/17538947.2017.1371253
M3 - Journal article
AN - SCOPUS:85029430534
JO - International Journal of Digital Earth
JF - International Journal of Digital Earth
SN - 1753-8947
ER -