Tracing science-technology-linkages - Research Portal

LANCASTER UNIVERSITY LEIPZIG

Text available via DOI:

https://doi.org/10.1016/j.ipm.2025.104264
Final published version
Available under license: CC BY: Creative Commons Attribution 4.0 International License

Keywords

Text mining, Reference extraction, Science technology linkage, Citation analysis, Patent analysis

View graph of relations

Tracing science-technology-linkages: A machine learning pipeline for extracting and matching patent in-text references to scientific publications

Research output: Contribution to Journal/Magazine › Journal article › peer-review

E-pub ahead of print

Standard

Tracing science-technology-linkages: A machine learning pipeline for extracting and matching patent in-text references to scientific publications. / Abbasiantaeb, Zahra; Verberne, Suzan; Wang, Jian.
In: Information Processing & Management, Vol. 62, No. 6, 104264, 30.11.2025.

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Bibtex

@article{3a7cc39ad39945928f40889653321118,

title = "Tracing science-technology-linkages: A machine learning pipeline for extracting and matching patent in-text references to scientific publications",

abstract = "Patent references to science provide a valuable paper trail for investigating the knowledge flow from science to technological innovation. Research on patent–paper links has mostly concentrated on front-page references, often neglecting the more complex in-text references. Therefore, we developed a three-stage machine-learning pipeline to extract and match patent in-text references to scientific publications. Our pipeline performs the following tasks: (1) extracting reference strings from patent texts, (2) parsing fields from these reference strings, and (3) matching references to publications in the Web of Science (WoS) database. We developed a training dataset consisting of 3,900 (and 3,901) manually annotated references from 392 (and 319) randomly selected EPO (and USPTO) patents. The first stage, reference extraction, achieved almost perfect results with a precision of 98.9% and a recall of 97.7% at the reference level. Overall, the pipeline demonstrated robust performance, with a precision of 96.8% and a recall of 91.9% at the unique patent-paper-pair level. Applying this pipeline to EPO and USPTO patents granted between 1990 and 2022, we identified 5,438,836 (and 20,432,189) references from 492,469 (and 1,449,398) EPO (and USPTO) patents, 2,763,779 (and 11,069,995) of which are matched to WoS publications. This extensive dataset is a valuable resource for studying science-technology linkages. We offer open access to this dataset, along with the associated code and training data.",

keywords = "Text mining, Reference extraction, Science technology linkage, Citation analysis, Patent analysis",

author = "Zahra Abbasiantaeb and Suzan Verberne and Jian Wang",

year = "2025",

month = jul,

day = "2",

doi = "10.1016/j.ipm.2025.104264",

language = "English",

volume = "62",

journal = "Information Processing & Management",

publisher = "Elsevier",

number = "6",

}

RIS

TY - JOUR

T1 - Tracing science-technology-linkages

T2 - A machine learning pipeline for extracting and matching patent in-text references to scientific publications

AU - Abbasiantaeb, Zahra

AU - Verberne, Suzan

AU - Wang, Jian

PY - 2025/7/2

Y1 - 2025/7/2

N2 - Patent references to science provide a valuable paper trail for investigating the knowledge flow from science to technological innovation. Research on patent–paper links has mostly concentrated on front-page references, often neglecting the more complex in-text references. Therefore, we developed a three-stage machine-learning pipeline to extract and match patent in-text references to scientific publications. Our pipeline performs the following tasks: (1) extracting reference strings from patent texts, (2) parsing fields from these reference strings, and (3) matching references to publications in the Web of Science (WoS) database. We developed a training dataset consisting of 3,900 (and 3,901) manually annotated references from 392 (and 319) randomly selected EPO (and USPTO) patents. The first stage, reference extraction, achieved almost perfect results with a precision of 98.9% and a recall of 97.7% at the reference level. Overall, the pipeline demonstrated robust performance, with a precision of 96.8% and a recall of 91.9% at the unique patent-paper-pair level. Applying this pipeline to EPO and USPTO patents granted between 1990 and 2022, we identified 5,438,836 (and 20,432,189) references from 492,469 (and 1,449,398) EPO (and USPTO) patents, 2,763,779 (and 11,069,995) of which are matched to WoS publications. This extensive dataset is a valuable resource for studying science-technology linkages. We offer open access to this dataset, along with the associated code and training data.

AB - Patent references to science provide a valuable paper trail for investigating the knowledge flow from science to technological innovation. Research on patent–paper links has mostly concentrated on front-page references, often neglecting the more complex in-text references. Therefore, we developed a three-stage machine-learning pipeline to extract and match patent in-text references to scientific publications. Our pipeline performs the following tasks: (1) extracting reference strings from patent texts, (2) parsing fields from these reference strings, and (3) matching references to publications in the Web of Science (WoS) database. We developed a training dataset consisting of 3,900 (and 3,901) manually annotated references from 392 (and 319) randomly selected EPO (and USPTO) patents. The first stage, reference extraction, achieved almost perfect results with a precision of 98.9% and a recall of 97.7% at the reference level. Overall, the pipeline demonstrated robust performance, with a precision of 96.8% and a recall of 91.9% at the unique patent-paper-pair level. Applying this pipeline to EPO and USPTO patents granted between 1990 and 2022, we identified 5,438,836 (and 20,432,189) references from 492,469 (and 1,449,398) EPO (and USPTO) patents, 2,763,779 (and 11,069,995) of which are matched to WoS publications. This extensive dataset is a valuable resource for studying science-technology linkages. We offer open access to this dataset, along with the associated code and training data.

KW - Text mining

KW - Reference extraction

KW - Science technology linkage

KW - Citation analysis

KW - Patent analysis

U2 - 10.1016/j.ipm.2025.104264

DO - 10.1016/j.ipm.2025.104264

M3 - Journal article

VL - 62

JO - Information Processing & Management

JF - Information Processing & Management

IS - 6

M1 - 104264

ER -

Research

Links

Text available via DOI:

Keywords

Tracing science-technology-linkages: A machine learning pipeline for extracting and matching patent in-text references to scientific publications

Standard

Harvard

APA

Vancouver

Author

Bibtex

RIS

Quick Links

Connect With Us

Faculties & Depts

Contact Us