Semantic textual similarity is the core shared task at the International Workshop on Semantic Evaluation (SemEval). It focuses on sentence meaning comparison. So far, most of the research has been devoted to English. In this paper we present first Czech dataset for semantic textual similarity. The dataset contains 1425 manually annotated pairs. Czech is highly inflected language and is considered challenging for many natural language processing tasks. The dataset is publicly available for the research community. In 2016 we participated at SemEval competition and our UWB system were ranked as second among 113 submitted systems in monolingual subtask and first among 26 systems in cross-lingual subtask. We adapt the UWB system for Czech (originally for English) and experiment with new Czech dataset. Our system achieves very promising results and can serve as a strong baseline for future research.
CITATION STYLE
Svoboda, L., & Brychcín, T. (2018). Czech dataset for semantic textual similarity. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11107 LNAI, pp. 213–221). Springer Verlag. https://doi.org/10.1007/978-3-030-00794-2_23
Mendeley helps you to discover research relevant for your work.