Everything is in the name – a URL based approach for phishing detection

Harshal Tupsamudre; Ajeet Kumar Singh; Sachin Lodha

Conference Proceedings

Everything is in the name – a URL based approach for phishing detection

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2019) 11527 LNCS 231-248

DOI: 10.1007/978-3-030-20951-3_21

25Citations

58Readers

Get full text

Abstract

Phishing attack, in which a user is tricked into revealing sensitive information on a spoofed website, is one of the most common threat to cybersecurity. Most modern web browsers counter phishing attacks using a blacklist of confirmed phishing URLs. However, one major disadvantage of the blacklist method is that it is ineffective against newly generated phishes. Machine learning based techniques that rely on features extracted from URL (e.g., URL length and bag-of-words) or web page (e.g., TF-IDF and form fields) are considered to be more effective in identifying new phishing attacks. The main benefit of using URL based features over page based features is that the machine learning model can classify new URLs on-the-fly even before the page is loaded by the web browser, thus avoiding other potential dangers such as drive-by download attacks and cryptojacking attacks. In this work, we focus on improving the performance of URL based detection techniques. We show that, although a classifier trained on traditional bag-of-words features (tokenized using special characters) works well in many cases, it fails to recognize a very prevalent class of phishing URLs that combines a popular brand with one or more words (e.g., www.paypalloginsecure.com and paypalhelpservice.simdif.com) among others. To overcome these flaws, we explore various alternative feature extraction techniques based on word segmentation and $$n-$$ grams. We also construct and use a phishy-list of popular words that are highly indicative of phishing attacks. We verify the efficacy of each of these feature sets by training a logistic regression classifier on a large dataset consisting of 100,000 URLs. Our experimental results reveal that features based on word segmentation, phishy-list and numerical features (e.g., URL length) perform better than all other features, as measured by misclassification and false negative rates.

Author supplied keywords

Cite

CITATION STYLE

APA

Tupsamudre, H., Singh, A. K., & Lodha, S. (2019). Everything is in the name – a URL based approach for phishing detection. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11527 LNCS, pp. 231–248). Springer Verlag. https://doi.org/10.1007/978-3-030-20951-3_21

Everything is in the name – a URL based approach for phishing detection

Abstract

Author supplied keywords

Cite

Register to see more suggestions