Automatic extraction of multiword expressions (MWEs) presents a tough challenge for the NLP community and corpus linguistics. Indeed, although numerous knowledge-based symbolic approaches and statistically driven algorithms have been proposed, efficient MWE extraction still remains an unsolved issue. In this paper, we evaluate the Lancaster UCREL Semantic Analysis System (henceforth USAS (Rayson, P., Archer, D., Piao, S., McEnery, T., 2004. The UCREL semantic analysis system. In: Proceedings of the LREC-04 Workshop, Beyond Named Entity Recognition Semantic labelling for NLP tasks, Lisbon, Portugal. pp. 7–12)) for MWE extraction, and explore the possibility of improving USAS by incorporating a statistical algorithm. Developed at Lancaster University, the USAS system automatically annotates English corpora with semantic category information. Employing a large-scale semantically classified multi-word expression template database, the system is also capable of detecting many multiword expressions, as well as assigning semantic field information to the MWEs extracted. Whilst USAS therefore offers a unique tool for MWE extraction, allowing us to both extract and semantically classify MWEs, it can sometimes suffer from low recall. Consequently, we have been comparing USAS, which employs a symbolic approach, to a statistical tool, which is based on collocational information, in order to determine the pros and cons of these different tools, and more importantly, to examine the possibility of improving MWE extraction by combining them. As we report in this paper, we have found a highly complementary relation between the different tools: USAS missed many domain-specific MWEs (law/court terms in this case), and the statistical tool missed many commonly used MWEs that occur in low frequencies (lower than three in this case). Due to their complementary relation, we are proposing that MWE coverage can be significantly increased by combining a lexicon-based symbolic approach and a collocation-based statistical approach.
The study of multiword expressions (MWE) has been a hot topic in computational linguistics in the last five years. This cross-disciplinary paper linked research from corpus-based natural language processing to corpus linguists and showed that rule-based semantic heuristics and statistical extraction techniques were complementary. This paper was the culmination of a body of work submitted to Association for Computational Linguistics (ACL) conferences over a period of three years. It was part of a special issue of the journal of Computer Speech and Language and was consistently among the top twenty CSL articles downloaded for twelve months from April 2005. RAE_import_type : Journal article RAE_uoa_type : Computer Science and Informatics