Synthetic protein sequence oversampling method for classification and remote homology detection in imbalanced protein data

3Citations
Citations of this article
13Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Many classifiers are designed with the assumption of well-balanced datasets. But in real problems, like protein classification and remote homology detection, when using binary classifiers like support vector machine (SVM) and kernel methods, we are facing imbalanced data in which we have a low number of protein sequences as positive data (minor class) compared with negative data (major class). A widely used solution to that issue in protein classification is using a different error cost or decision threshold for positive and negative data to control the sensitivity of the classifiers. Our experiments show that when the datasets are highly imbalanced, and especially with overlapped datasets, the efficiency and stability of that method decreases. This paper shows that a combination of the above method and our suggested oversampling method for protein sequences can increase the sensitivity and also stability of the classifier. Synthetic Protein Sequence Oversampling (SPSO) method involves creating synthetic protein sequences of the minor class, considering the distribution of that class and also of the major class, and it operates in data space instead of feature space. We used G-protein-coupled receptors families as real data to classify them at subfamily and sub-subfamily levels (having low number of sequences) and could get better accuracy and Matthew's correlation coefficient than other previously published method. We also made artificial data with different distributions and overlappings of minor and major classes to measure the efficiency of our method. The method was evaluated by the area under the Receiver Operating Curve (ROC). © Springer-Verlag Berlin Heidelberg 2007.

References Powered by Scopus

CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice

58602Citations
N/AReaders
Get full text

SMOTE: Synthetic minority over-sampling technique

22849Citations
N/AReaders
Get full text

Measuring the accuracy of diagnostic systems

8456Citations
N/AReaders
Get full text

Cited by Powered by Scopus

EyeContext: Recognition of high-level contextual cues from human visual behaviour

52Citations
N/AReaders
Get full text

A boosting approach for object classification in biosonar based robot navigation

4Citations
N/AReaders
Get full text

Unbalanced sequential data classification using extreme outlier elimination and sampling techniques

1Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Beigi, M. M., & Zell, A. (2007). Synthetic protein sequence oversampling method for classification and remote homology detection in imbalanced protein data. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4414 LNBI, pp. 263–277). Springer Verlag. https://doi.org/10.1007/978-3-540-71233-6_21

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 6

60%

Professor / Associate Prof. 2

20%

Lecturer / Post doc 1

10%

Researcher 1

10%

Readers' Discipline

Tooltip

Computer Science 8

73%

Agricultural and Biological Sciences 2

18%

Environmental Science 1

9%

Save time finding and organizing research with Mendeley

Sign up for free