Recently, malware (short for malicious software) has greatly evolved and has became
a major threat to the home users, enterprises, and even to the governments.
Despite the extensive use and availability of various anti-malware tools such as antiviruses,
intrusion detection systems, firewalls etc., malware authors can readily evade
these precautions by using obfuscation techniques. To mitigate this problem, malware
researchers have proposed various data mining and machine learning approaches for
detecting and classifying malware samples according to the their static or dynamic
feature set. Although the proposed methods are effective over small sample sets, the
scalability of these methods for large data-sets is under investigation and has not been
solved yet.
Moreover, it is well-known that the majority of malware is a variant of previously
known samples. Consequently, the volume of new variants created far outpaces the
current capacity of malware analysis. Thus developing a malware classification to
cope with the increasing number of malware is essential for the security community.
The key challenge in identifying the family of malware is to achieve a balance between
increasing number of samples and classification accuracy. To overcome this
limitation, unlike existing classification schemes which apply machine learning algorithms
to stored data, (i.e. they are off-line algorithms) we propose a new malware
classification system employing online machine learning algorithms that can provide
instantaneous update about the new malware sample by following its introduction to
the classification scheme.
To achieve our goal, firstly we developed a portable, scalable and transparent malware
analysis system called VirMon for dynamic analysis of malware targeting the
Windows OS. VirMon collects the behavioral activities of analyzed samples in low kernel
level through its developed mini-filter driver. Secondly, we set up a cluster of three
machines for our online learning framework module (i.e. Jubatus), which allows to
handle large scale data. This configuration allows each analysis machine to perform
its tasks and delivers the obtained results to the cluster manager.Essentially, the proposed framework consists of three major stages. The first stage
consists of extracting the behavior of the sample file under scrutiny and observing its
interactions with the OS resources. At this stage, the sample file is run in a sandboxed
environment. Our framework supports two sandbox environments: VirMon
and Cuckoo. During the second stage, we apply feature extraction to the analysis report.
The label of each sample is determined by using Virustotal, an online multiple
anti-virus scanner framework consisting of 46 engines. Then at the final stage, the
malware dataset is partitioned into training and testing sets. The training set is used
to obtain a classification model and the testing set is used for evaluation purposes.
To validate the effectiveness and scalability of our method, we have evaluated our
method by using 18,000 recent malicious files including viruses, trojans, backdoors,
worms, etc., obtained from VirusShare, and our experimental results show that our
method performs malware classification with 92% of accuracy.
Keywords: Malware classification, dynamic analysis, online machine learning, behavior
modeling |