Natural-Annotation-Based Malay Multiword Expressions Extraction and Clustering

Wuying Liu; Lin Wang

Conference Proceedings

Natural-Annotation-Based Malay Multiword Expressions Extraction and Clustering

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2023) 13396 LNCS 143-152

DOI: 10.1007/978-3-031-23793-5_13

0Citations

1Readers

Get full text

Abstract

Multiword expression (MWE) is an optimal granularity of language reuse. However, no explicit boundaries between MWEs and other words causes a serious problem on automatic identification of MWEs for some less commonly taught languages. This paper addresses the issue of Malay MWEs extraction and clustering, and proposes a novel unsupervised extraction and clustering algorithm based on natural annotations. In our algorithm, we firstly use a binary classification for each space character to solve length-varying Malay MWEs extraction, secondly transfer natural document-level category annotations to MWE-level ones for Malay MWEs clustering, and finally distill out a general MWEs resource and several domain resources. The experimental results in the Malay dataset of 272,783 text documents show that our algorithm can extract MWEs precisely and dispatch them into domain clusters efficiently.

Author supplied keywords

Cite

CITATION STYLE

APA

Liu, W., & Wang, L. (2023). Natural-Annotation-Based Malay Multiword Expressions Extraction and Clustering. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 13396 LNCS, pp. 143–152). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-23793-5_13

Natural-Annotation-Based Malay Multiword Expressions Extraction and Clustering

Abstract

Author supplied keywords

Cite

Register to see more suggestions