A hybrid approach toward biomedical relation extraction training corpora: Combining distant supervision with crowdsourcing

6Citations
Citations of this article
15Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Biomedical relation extraction (RE) datasets are vital in the construction of knowledge bases and to potentiate the discovery of new interactions. There are several ways to create biomedical RE datasets, some more reliable than others, such as resorting to domain expert annotations. However, the emerging use of crowdsourcing platforms, such as Amazon Mechanical Turk (MTurk), can potentially reduce the cost of RE dataset construction, even if the same level of quality cannot be guaranteed. There is a lack of power of the researcher to control who, how and in what context workers engage in crowdsourcing platforms. Hence, allying distant supervision with crowdsourcing can be a more reliable alternative. The crowdsourcing workers would be asked only to rectify or discard already existing annotations, which would make the process less dependent on their ability to interpret complex biomedical sentences. In this work, we use a previously created distantly supervised human phenotype–gene relations (PGR) dataset to perform crowdsourcing validation. We divided the original dataset into two annotation tasks: Task 1, 70% of the dataset annotated by one worker, and Task 2, 30% of the dataset annotated by seven workers. Also, for Task 2, we added an extra rater on-site and a domain expert to further assess the crowdsourcing validation quality. Here, we describe a detailed pipeline for RE crowdsourcing validation, creating a new release of the PGR dataset with partial domain expert revision, and assess the quality of the MTurk platform. We applied the new dataset to two state-of-the-art deep learning systems (BiOnt and BioBERT) and compared its performance with the original PGR dataset, as well as combinations between the two, achieving a 0.3494 increase in average F-measure. The code supporting our work and the new release of the PGR dataset is available at https://github.com/lasigeBioTM/PGR-crowd.

References Powered by Scopus

Gene ontology: Tool for the unification of biology

32273Citations
N/AReaders
Get full text

Interrater reliability: The kappa statistic

13037Citations
N/AReaders
Get full text

BioBERT: A pre-trained biomedical language representation model for biomedical text mining

3891Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Biomedical Relation Extraction With Knowledge Graph-Based Recommendations

21Citations
N/AReaders
Get full text

K-RET: knowledgeable biomedical relation extraction system

5Citations
N/AReaders
Get full text

COVID-19 recommender system based on an annotated multilingual corpus

1Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Sousa, D., Lamurias, A., & Couto, F. M. (2020). A hybrid approach toward biomedical relation extraction training corpora: Combining distant supervision with crowdsourcing. Database, 2020. https://doi.org/10.1093/DATABASE/BAAA104

Readers over time

‘20‘21‘22‘23‘2402468

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 5

50%

Researcher 5

50%

Readers' Discipline

Tooltip

Computer Science 5

50%

Agricultural and Biological Sciences 3

30%

Social Sciences 1

10%

Sports and Recreations 1

10%

Save time finding and organizing research with Mendeley

Sign up for free
0