Home > Research > Publications & Outputs > Tracing science-technology-linkages

Links

Text available via DOI:

View graph of relations

Tracing science-technology-linkages: A machine learning pipeline for extracting and matching patent in-text references to scientific publications

Research output: Contribution to Journal/MagazineJournal articlepeer-review

E-pub ahead of print
Close
Article number104264
<mark>Journal publication date</mark>30/11/2025
<mark>Journal</mark>Information Processing & Management
Issue number6
Volume62
Number of pages12
Publication StatusE-pub ahead of print
Early online date2/07/25
<mark>Original language</mark>English

Abstract

Patent references to science provide a valuable paper trail for investigating the knowledge flow from science to technological innovation. Research on patent–paper links has mostly concentrated on front-page references, often neglecting the more complex in-text references. Therefore, we developed a three-stage machine-learning pipeline to extract and match patent in-text references to scientific publications. Our pipeline performs the following tasks: (1) extracting reference strings from patent texts, (2) parsing fields from these reference strings, and (3) matching references to publications in the Web of Science (WoS) database. We developed a training dataset consisting of 3,900 (and 3,901) manually annotated references from 392 (and 319) randomly selected EPO (and USPTO) patents. The first stage, reference extraction, achieved almost perfect results with a precision of 98.9% and a recall of 97.7% at the reference level. Overall, the pipeline demonstrated robust performance, with a precision of 96.8% and a recall of 91.9% at the unique patent-paper-pair level. Applying this pipeline to EPO and USPTO patents granted between 1990 and 2022, we identified 5,438,836 (and 20,432,189) references from 492,469 (and 1,449,398) EPO (and USPTO) patents, 2,763,779 (and 11,069,995) of which are matched to WoS publications. This extensive dataset is a valuable resource for studying science-technology linkages. We offer open access to this dataset, along with the associated code and training data.