Czech dataset for semantic textual similarity

Lukás̆ Svoboda; Tomás̆ Brychcín

Conference Proceedings

Czech dataset for semantic textual similarity

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2018) 11107 LNAI 213-221

DOI: 10.1007/978-3-030-00794-2_23

3Citations

4Readers

Get full text

Abstract

Semantic textual similarity is the core shared task at the International Workshop on Semantic Evaluation (SemEval). It focuses on sentence meaning comparison. So far, most of the research has been devoted to English. In this paper we present first Czech dataset for semantic textual similarity. The dataset contains 1425 manually annotated pairs. Czech is highly inflected language and is considered challenging for many natural language processing tasks. The dataset is publicly available for the research community. In 2016 we participated at SemEval competition and our UWB system were ranked as second among 113 submitted systems in monolingual subtask and first among 26 systems in cross-lingual subtask. We adapt the UWB system for Czech (originally for English) and experiment with new Czech dataset. Our system achieves very promising results and can serve as a strong baseline for future research.

Author supplied keywords

Cite

CITATION STYLE

APA

Svoboda, L., & Brychcín, T. (2018). Czech dataset for semantic textual similarity. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11107 LNAI, pp. 213–221). Springer Verlag. https://doi.org/10.1007/978-3-030-00794-2_23

Czech dataset for semantic textual similarity

Abstract

Author supplied keywords

Cite

Register to see more suggestions