Home > Research > Publications & Outputs > Tracing science-technology-linkages

Links

Text available via DOI:

View graph of relations

Tracing science-technology-linkages: A machine learning pipeline for extracting and matching patent in-text references to scientific publications

Research output: Contribution to Journal/MagazineJournal articlepeer-review

E-pub ahead of print

Standard

Tracing science-technology-linkages: A machine learning pipeline for extracting and matching patent in-text references to scientific publications. / Abbasiantaeb, Zahra; Verberne, Suzan; Wang, Jian.
In: Information Processing & Management, Vol. 62, No. 6, 104264, 30.11.2025.

Research output: Contribution to Journal/MagazineJournal articlepeer-review

Harvard

APA

Vancouver

Abbasiantaeb Z, Verberne S, Wang J. Tracing science-technology-linkages: A machine learning pipeline for extracting and matching patent in-text references to scientific publications. Information Processing & Management. 2025 Nov 30;62(6):104264. Epub 2025 Jul 2. doi: 10.1016/j.ipm.2025.104264

Author

Abbasiantaeb, Zahra ; Verberne, Suzan ; Wang, Jian. / Tracing science-technology-linkages : A machine learning pipeline for extracting and matching patent in-text references to scientific publications. In: Information Processing & Management. 2025 ; Vol. 62, No. 6.

Bibtex

@article{3a7cc39ad39945928f40889653321118,
title = "Tracing science-technology-linkages: A machine learning pipeline for extracting and matching patent in-text references to scientific publications",
abstract = "Patent references to science provide a valuable paper trail for investigating the knowledge flow from science to technological innovation. Research on patent–paper links has mostly concentrated on front-page references, often neglecting the more complex in-text references. Therefore, we developed a three-stage machine-learning pipeline to extract and match patent in-text references to scientific publications. Our pipeline performs the following tasks: (1) extracting reference strings from patent texts, (2) parsing fields from these reference strings, and (3) matching references to publications in the Web of Science (WoS) database. We developed a training dataset consisting of 3,900 (and 3,901) manually annotated references from 392 (and 319) randomly selected EPO (and USPTO) patents. The first stage, reference extraction, achieved almost perfect results with a precision of 98.9% and a recall of 97.7% at the reference level. Overall, the pipeline demonstrated robust performance, with a precision of 96.8% and a recall of 91.9% at the unique patent-paper-pair level. Applying this pipeline to EPO and USPTO patents granted between 1990 and 2022, we identified 5,438,836 (and 20,432,189) references from 492,469 (and 1,449,398) EPO (and USPTO) patents, 2,763,779 (and 11,069,995) of which are matched to WoS publications. This extensive dataset is a valuable resource for studying science-technology linkages. We offer open access to this dataset, along with the associated code and training data.",
keywords = "Text mining, Reference extraction, Science technology linkage, Citation analysis, Patent analysis",
author = "Zahra Abbasiantaeb and Suzan Verberne and Jian Wang",
year = "2025",
month = jul,
day = "2",
doi = "10.1016/j.ipm.2025.104264",
language = "English",
volume = "62",
journal = "Information Processing & Management",
publisher = "Elsevier",
number = "6",

}

RIS

TY - JOUR

T1 - Tracing science-technology-linkages

T2 - A machine learning pipeline for extracting and matching patent in-text references to scientific publications

AU - Abbasiantaeb, Zahra

AU - Verberne, Suzan

AU - Wang, Jian

PY - 2025/7/2

Y1 - 2025/7/2

N2 - Patent references to science provide a valuable paper trail for investigating the knowledge flow from science to technological innovation. Research on patent–paper links has mostly concentrated on front-page references, often neglecting the more complex in-text references. Therefore, we developed a three-stage machine-learning pipeline to extract and match patent in-text references to scientific publications. Our pipeline performs the following tasks: (1) extracting reference strings from patent texts, (2) parsing fields from these reference strings, and (3) matching references to publications in the Web of Science (WoS) database. We developed a training dataset consisting of 3,900 (and 3,901) manually annotated references from 392 (and 319) randomly selected EPO (and USPTO) patents. The first stage, reference extraction, achieved almost perfect results with a precision of 98.9% and a recall of 97.7% at the reference level. Overall, the pipeline demonstrated robust performance, with a precision of 96.8% and a recall of 91.9% at the unique patent-paper-pair level. Applying this pipeline to EPO and USPTO patents granted between 1990 and 2022, we identified 5,438,836 (and 20,432,189) references from 492,469 (and 1,449,398) EPO (and USPTO) patents, 2,763,779 (and 11,069,995) of which are matched to WoS publications. This extensive dataset is a valuable resource for studying science-technology linkages. We offer open access to this dataset, along with the associated code and training data.

AB - Patent references to science provide a valuable paper trail for investigating the knowledge flow from science to technological innovation. Research on patent–paper links has mostly concentrated on front-page references, often neglecting the more complex in-text references. Therefore, we developed a three-stage machine-learning pipeline to extract and match patent in-text references to scientific publications. Our pipeline performs the following tasks: (1) extracting reference strings from patent texts, (2) parsing fields from these reference strings, and (3) matching references to publications in the Web of Science (WoS) database. We developed a training dataset consisting of 3,900 (and 3,901) manually annotated references from 392 (and 319) randomly selected EPO (and USPTO) patents. The first stage, reference extraction, achieved almost perfect results with a precision of 98.9% and a recall of 97.7% at the reference level. Overall, the pipeline demonstrated robust performance, with a precision of 96.8% and a recall of 91.9% at the unique patent-paper-pair level. Applying this pipeline to EPO and USPTO patents granted between 1990 and 2022, we identified 5,438,836 (and 20,432,189) references from 492,469 (and 1,449,398) EPO (and USPTO) patents, 2,763,779 (and 11,069,995) of which are matched to WoS publications. This extensive dataset is a valuable resource for studying science-technology linkages. We offer open access to this dataset, along with the associated code and training data.

KW - Text mining

KW - Reference extraction

KW - Science technology linkage

KW - Citation analysis

KW - Patent analysis

U2 - 10.1016/j.ipm.2025.104264

DO - 10.1016/j.ipm.2025.104264

M3 - Journal article

VL - 62

JO - Information Processing & Management

JF - Information Processing & Management

IS - 6

M1 - 104264

ER -