HIMALAYAS: Hierarchical Machine Learning Stack for Fine-Grained Analysis of Malware Domain Groups
Team: Vinod Yegneswaran , Shalini Ghosh (SRI International), Arindam Banerjee (University of Minnesota), Guofei Gu (Texas A&M University)
Abstract
The domain name system (DNS) protocol plays a significant role in
operation of the Internet by enabling the bi-directional association
of domain names with IP addresses. It is also increasingly abused by
malware, particularly botnets, by use of: (1) automated domain
generation algorithms for rendezvous with a command-and-control (C&C)
server, (2) DNS fast flux as a way to hide the location of malicious
servers, and (3) DNS as a carrier channel for C&C communications. This
project explores the development of a scalable, hierarchical
machine-learning stack, called HIMALAYAS, which specializes in
algorithms for automatically mining DNS data for malware activity. In
particular, we are interested in isolating both ordered and unordered
sets of malware domain groups whose access patterns are temporally and
logically correlated.
HIMALAYAS performs a task of increasing complexity at each level -
starting from scalable clustering and feature selection at lower
levels, to more advanced malware domain subsequence identification
algorithms at higher levels. It has multiple benefits, including
speed, accuracy, interpretability, and ability to use domain
knowledge, which makes it very well suited for malware analysis and
related tasks. The analysis by HIMALAYAS should accelerate the
identification and takedown of malware domains on the Internet and
improve services such as Google SafeSearch.
The machine-learning stack developed as part of the HIMALAYAS project
has broader application to many important data mining problems, e.g.,
in financial data analysis, and mining user patterns from web access
logs. The project provides opportunities for students to participate
in the development and transition of the technology.
Relevant Publications
Shalini Ghosh, Ariyam Das, Phil Porras, Vinod Yegneswaran, Ashish Gehani.
Automated Categorization of Onion Sites for Analyzing the Darkweb Ecosystem.
In Proceedings of 23rd ACM SIGKDD Conference on Knowledge Discovery and Data Mining, August 2017.
Shalini Ghosh, Phillip Porras, Vinod Yegneswaran, Ken Nitz, Ariyam Das.
ATOL: A Framework for Automated Analysis and Categorization of the Darkweb Ecosystem.
Proceedings of AAAI Workshop on Artificial Intelligence for Cyber Security (AICS), February 2017.
Xiang Pan, Vinod Yegneswaran, Yan Chen, Phillip Porras, Seungwon Shin.
HogMap: Using SDNs to Incentivize Collaborative Security Monitoring.
Proceedings of SDN-NFV Security Workshop, March 2016.
Jialong Zhang, Sabyasachi Saha, Guofei Gu, Sung-Ju Lee, and Marco Mellia. "Systematic Mining of Associated Server Herds for Malware Campaign Discovery." In Proc. of the 35th International Conference on Distributed Computing Systems (ICDCS'15), Columbus, OH, June 2015. BEST PAPER AWARD!
Hongyu Gao, Vinod Yegneswaran, Jian Jiang, Yan Chen, Phillip Porras, Shalini Ghosh and Haixin Duan. Reexamining DNS from a Global Recursive Resolver Perspective. Proceedings of IEEE Transactions on Networking, 2014.
Jialong Zhang, Jayant Notani and Guofei Gu. Characterizing Google Hacking: A First Large-Scale Quantitative Study. Proceedings of Securecomm, 2014.
Huahua Wang and Arindam Banerjee. Bregman Alternating Direction Method of Multipliers . Proceedings of NIPS, 2014.
Hongyu Gao, Vinod Yegneswaran, Yan Chen, Phillip Porras, Shalini Ghosh, Jian Jiang and Haixin Duan. An Empirical Reexamination of Global DNS Behavior . Proceedings of ACM SIGCOMM, August 2013.
Acknowledgments
This project is funded by a grant from the National Science
Foundation. Award Number CNS-1314956. Any opinions, findings, and
conclusions or recommendations expressed in this material are those of
the authors and do not necessarily reflect the views of NSF.