Determining similarity of scientific entities in annotation datasets

3Citations
Citations of this article
20Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Linked Open Data initiatives have made available a diversity of scientific collections where scientists have annotated entities in the datasets with controlled vocabulary terms from ontologies. Annotations encode scientific knowledge, which is captured in annotation datasets. Determining relatedness between annotated entities becomes a building block for pattern mining, e.g. identifying drug-drug relationships may depend on the similarity of the targets that interact with each drug. A diversity of similarity measures has been proposed in the literature to compute relatedness between a pair of entities. Each measure exploits some knowledge including the name, function, relationships with other entities, taxonomic neighborhood and semantic knowledge. We propose a novel general-purpose annotation similarity measure called 'AnnSim' that measures the relatedness between two entities based on the similarity of their annotations. We model AnnSim as a 1-1 maximum weight bipartite match and exploit properties of existing solvers to provide an efficient solution. We empirically study the performance of AnnSim on real-world datasets of drugs and disease associations from clinical trials and relationships between drugs and (genomic) targets. Using baselines that include a variety of measures, we identify where AnnSim can provide a deeper understanding of the semantics underlying the relatedness of a pair of entities or where it could lead to predicting new links or identifying potential novel patterns. Although AnnSim does not exploit knowledge or properties of a particular domain, its performance compares well with a variety of state-of-the-art domain-specific measures.

Figures

  • Figure 1. Annotation graph of Clinical Trials from LinkedCT (blue ovals). Interventions are green rectangles; conditions are pink rectangles and CV terms from the NCIt are red ovals.
  • Figure 2. Annotation subgraph representing the annotations of Brentuximab vedotin and Catumaxomab. Interventions are green rectangles; conditions are pink rectangles and ontology terms in the NCIt are red circles. (a) Weighted bipartite graph for Brentuximab vedotin and Catumaxomab. (b) 1–1 maximum weight bipartite matching for Brentuximab vedotin and Catumaxomab
  • Figure 3. Bipartite graphs for drugs Brentuximab vedotin and Catumaxomab. For legibility, only the value of the highest matching edges is shown in (a). (a) Weighted bipartite graph for Brentuximab vedotin and Catumaxomab. (b) 1-1 maximum weight bipartite matching for Brentuximab vedotin and Catumaxomab.
  • Table 1. Description of the datasets
  • Table 2. Description of the ontologies used in the evaluation datasets
  • Table 3. Similarity measures for pairs of proteins in dataset 3
  • Table 6. Statistics of dataset 5 downloaded from http://web. kuicr.kyoto-u.ac.jp/supp/yoshi/drugtarget/ (21)
  • Table 4. Statistics of dataset 4 obtained from Perlman et al. (20)

References Powered by Scopus

Identification of common molecular subsequences

7722Citations
N/AReaders
Get full text

A Cluster Separation Measure

6596Citations
N/AReaders
Get full text

Shape matching and object recognition using shape contexts

5462Citations
N/AReaders
Get full text

Cited by Powered by Scopus

AnnEvol: An evolutionary framework to description ontology-based annotations

4Citations
N/AReaders
Get full text

Neurofuzzy semantic similarity measurement

3Citations
N/AReaders
Get full text

Demonstration: Mining sentence and annotation evidence for a cross genome study of the plant hormone ethylene

0Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Palma, G., Vidal, M. E., Haag, E., Raschid, L., & Thor, A. (2015). Determining similarity of scientific entities in annotation datasets. Database, 2015. https://doi.org/10.1093/database/bau123

Readers over time

‘14‘15‘16‘17‘20‘21‘22‘23‘24036912

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 11

69%

Researcher 5

31%

Readers' Discipline

Tooltip

Computer Science 7

50%

Agricultural and Biological Sciences 3

21%

Biochemistry, Genetics and Molecular Bi... 3

21%

Medicine and Dentistry 1

7%

Save time finding and organizing research with Mendeley

Sign up for free
0