Efficient Clustering of Emails into Spam and Ham: The Foundational Study of a Comprehensive Unsupervised Framework

Asif Karim; Sami Azam; Bharanidharan Shanmugam; Krishnan Kannoorpatti

Journal ArticleOPEN ACCESS

Efficient Clustering of Emails into Spam and Ham: The Foundational Study of a Comprehensive Unsupervised Framework

IEEE Access (2020) 8 154759-154788

DOI: 10.1109/ACCESS.2020.3017082

46Citations

92Readers

Abstract

The spread and adoption of spam emails in malicious activities like information and identity theft, malware propagation, monetary and reputational damage etc. are on the rise with increased effectiveness and diversification. Without doubt these criminal acts endanger the privacy of many users and businesses'. Several research initiatives have taken place to address the issue with no complete solution until now; and we believe an intelligent and automated methodology should be the way forward to tackle the challenges. However, till date limited studies have been conducted on the applications of purely unsupervised frameworks and algorithms in tackling the problem. To explore and investigate the possibilities, we intend to propose an anti-spam framework that fully relies on unsupervised methodologies through a multi-algorithm clustering approach. This article presents an in-depth analysis on the methodologies of the first component of the framework, examining only the domain and header related information found in email headers. A novel method of feature reduction using an ensemble of 'unsupervised' feature selection algorithms has also been investigated in this study. In addition, a comprehensive novel dataset of 100,000 records of ham and spam emails has been developed and used as the data source. Key findings are summarized as follows: I) out of six different clustering algorithms used - Spectral and K-means demonstrated acceptable performance while OPTICS projected the optimum clustering with an average of 3.5% better efficiency than Spectral and K-means, validated through a range of validations processes II) The other three algorithms- BIRCH, HDBSCAN and K-modes, did not fare well enough. III) The average balanced accuracy for the optimum three algorithms has been found to be ≈94.91%, and IV) The proposed feature reduction framework achieved its goal with high confidence.

Author supplied keywords

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Karim, A., Azam, S., Shanmugam, B., & Kannoorpatti, K. (2020). Efficient Clustering of Emails into Spam and Ham: The Foundational Study of a Comprehensive Unsupervised Framework. IEEE Access, 8, 154759–154788. https://doi.org/10.1109/ACCESS.2020.3017082

Readers' Seniority

PhD / Post grad / Masters / Doc 18

64%

Lecturer / Post doc 6

21%

Researcher 3

11%

Professor / Associate Prof. 1

Readers' Discipline

Computer Science 24

73%

Engineering 5

15%

Business, Management and Accounting 3

Psychology 1

Efficient Clustering of Emails into Spam and Ham: The Foundational Study of a Comprehensive Unsupervised Framework

Abstract

Author supplied keywords

References Powered by Scopus

A Cluster Separation Measure

Data Mining: Concepts and Techniques

BIRCH: An Efficient Data Clustering Method for Very Large Databases

Cited by Powered by Scopus

Covid-19 detection using deep learning algorithm on chest X-ray images

Implementation of Deep Learning Methods to Identify Rotten Fruits

A comparative study of different machine learning tools in detecting diabetes

Register to see more suggestions

Cite

Readers' Seniority

Readers' Discipline