Efficient Clustering of Emails into Spam and Ham: The Foundational Study of a Comprehensive Unsupervised Framework

46Citations
Citations of this article
92Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

The spread and adoption of spam emails in malicious activities like information and identity theft, malware propagation, monetary and reputational damage etc. are on the rise with increased effectiveness and diversification. Without doubt these criminal acts endanger the privacy of many users and businesses'. Several research initiatives have taken place to address the issue with no complete solution until now; and we believe an intelligent and automated methodology should be the way forward to tackle the challenges. However, till date limited studies have been conducted on the applications of purely unsupervised frameworks and algorithms in tackling the problem. To explore and investigate the possibilities, we intend to propose an anti-spam framework that fully relies on unsupervised methodologies through a multi-algorithm clustering approach. This article presents an in-depth analysis on the methodologies of the first component of the framework, examining only the domain and header related information found in email headers. A novel method of feature reduction using an ensemble of 'unsupervised' feature selection algorithms has also been investigated in this study. In addition, a comprehensive novel dataset of 100,000 records of ham and spam emails has been developed and used as the data source. Key findings are summarized as follows: I) out of six different clustering algorithms used - Spectral and K-means demonstrated acceptable performance while OPTICS projected the optimum clustering with an average of 3.5% better efficiency than Spectral and K-means, validated through a range of validations processes II) The other three algorithms- BIRCH, HDBSCAN and K-modes, did not fare well enough. III) The average balanced accuracy for the optimum three algorithms has been found to be ≈94.91%, and IV) The proposed feature reduction framework achieved its goal with high confidence.

References Powered by Scopus

A Cluster Separation Measure

6649Citations
N/AReaders
Get full text

Data Mining: Concepts and Techniques

5273Citations
N/AReaders
Get full text

BIRCH: An Efficient Data Clustering Method for Very Large Databases

4077Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Covid-19 detection using deep learning algorithm on chest X-ray images

127Citations
N/AReaders
Get full text

Implementation of Deep Learning Methods to Identify Rotten Fruits

62Citations
N/AReaders
Get full text

A comparative study of different machine learning tools in detecting diabetes

38Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Karim, A., Azam, S., Shanmugam, B., & Kannoorpatti, K. (2020). Efficient Clustering of Emails into Spam and Ham: The Foundational Study of a Comprehensive Unsupervised Framework. IEEE Access, 8, 154759–154788. https://doi.org/10.1109/ACCESS.2020.3017082

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 18

64%

Lecturer / Post doc 6

21%

Researcher 3

11%

Professor / Associate Prof. 1

4%

Readers' Discipline

Tooltip

Computer Science 24

73%

Engineering 5

15%

Business, Management and Accounting 3

9%

Psychology 1

3%

Save time finding and organizing research with Mendeley

Sign up for free