Pairwise document similarity measure based on present term set

37Citations
Citations of this article
55Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Measuring pairwise document similarity is an essential operation in various text mining tasks. Most of the similarity measures judge the similarity between two documents based on the term weights and the information content that two documents share in common. However, they are insufficient when there exist several documents with an identical degree of similarity to a particular document. This paper introduces a novel text document similarity measure based on the term weights and the number of terms appeared in at least one of the two documents. The effectiveness of our measure is evaluated on two real-world document collections for a variety of text mining tasks, such as text document classification, clustering, and near-duplicates detection. The performance of our measure is compared with that of some popular measures. The experimental results showed that our proposed similarity measure yields more accurate results.

References Powered by Scopus

Data Mining: Concepts and Techniques

5883Citations
N/AReaders
Get full text

Big data: A survey

2501Citations
N/AReaders
Get full text

Understanding inverse document frequency: On theoretical arguments for IDF

1101Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Subjective Answers Evaluation Using Machine Learning and Natural Language Processing

49Citations
N/AReaders
Get full text

A set theory based similarity measure for text clustering and classification

36Citations
N/AReaders
Get full text

On the integration of similarity measures with machine learning models to enhance text classification performance

19Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Oghbaie, M., & Mohammadi Zanjireh, M. (2018). Pairwise document similarity measure based on present term set. Journal of Big Data, 5(1). https://doi.org/10.1186/s40537-018-0163-2

Readers over time

‘19‘20‘21‘22‘23‘24‘2505101520

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 18

60%

Lecturer / Post doc 5

17%

Researcher 5

17%

Professor / Associate Prof. 2

7%

Readers' Discipline

Tooltip

Computer Science 26

76%

Engineering 3

9%

Business, Management and Accounting 3

9%

Agricultural and Biological Sciences 2

6%

Save time finding and organizing research with Mendeley

Sign up for free
0