Packed malware variants detection using deep belief networks

Malware is one of the most serious network security threats. To detect unknown variants of malware, many researches have proposed various methods of malware detection based on machine learning in recent years. However, modern malware is often protected by software packers, obfuscation, and other technologies, which bring challenges to malware analysis and detection. In this paper, we propose a system call based malware detection technology. By comparing malware and benign software in a sandbox environment, a sensitive system call context is extracted based on information gain, which reduces obfuscation caused by a normal system call. By using the deep belief network, we train a malware detection model with sensitive system call context to improve the detection accuracy.


Introduction
With the rapid growth of malicious software (malware) variants, traditional signaturebased detection methods are failed to detect malware. In recent years, a series of malware variants detection methods have been proposed by using machine learning.
However, modern anti-detection techniques such as software packing, obfuscation, etc. can prevent detections by compressing, encrypting and obfuscating malicious programs which reduce the accuracy of detection. Although some unpacking techniques can recover the original programs, there are remain some limitations: Since a variety of public and private packers, the unpacking techniques cannot be always efficient.
To address such a problem, we prefer to adopt a dynamic analysis method to detect malicious behaviors in runtime. However, the malicious behavior of malware will be hidden and obfuscated by the normal behavior and the packing/unpacking in packed malware, which brings a challenge to detect packed malware.
To detect packed malware efficiently, we propose a packed malware variants detection method based on sensitive system calls and a deep belief network. We first gain the system call sequences of executables in a sandbox, then extract the sensitive system call by using information gain to reduce the obfuscation caused by normal behaviors and packers, and finally adopt the deep belief network to adaptively train a detection model to detect the abnormal malicious behaviors.
The main contributions are organized as follows: 1) In this paper we propose a packed malware detection method by using Deep Belief Networks (DBN). 2) We propose a sensitive system call extraction method to reduce the obfuscation caused by packing/unpacking behaviors and normal behaviors. 3) Theoretical analysis and experimental results show that the proposed method can detect packed malware, which achieves more than 92% of accuracy and takes less than 0.001 seconds of detection time..
The rest of this paper is organized as follows: Section 2 introduces the related works. Section 3 proposes our methods. Section 4 presents the experiments and Section 5 concludes this paper.

Static analysis-based detection methods
Static analysis-based detection methods usually extract operation code (op-code) by disassembly tools, and detects malware by analyzing the features of the code. McLaughlin et al. [1] embedded malware opcodes and trained a malware detection model with the N-Gram Convolutional Neural Network (CNN). Ming et al. [2] proposed a malware classification method that extracts API calls to construct API call subgraphs and classifies the family of malware based on the features of the API call subgraphs. Zhang et al. [3] proposed a featurehybrid malware detection method to integrate op-codes and API calls by merging the hidden layers of the CNN and the back-propagation neural networks. Cesare et al. [4] proposed to extract the control flow graphs of executables and search the similarities between the unknown software and the malware by edit string distance. Zhang et al. [5] proposed an Android malware detection method which builds a graph of op-codes and extracts the globle topology features.

Dynamic analysis-based detection methods
Dynamic analysis-based detection methods usually analyze the malicious behaviors by a sandbox, virtual machine, etc. Huang et al. [6] proposed to analyze the behaviors of users and the behaviors of programs, and detect stealing behavior in Android system by searching similarities. Canzanese et al. [7] used N tuples of system calls to represent the system call sequence and used support vector machine to train the malware detection model. Yang et al [8] proposed to extract sensitive behaviors of software that affect system security and detect malware by comparing with sensitive behaviors in normal software and malware. Shabtai et al. [9] proposed to extract device states that affect system security and compare the differences between normal states and abnormal states in runtime to detect malware. Rieck et al. [10] proposed an automatic malware detection method by adopting system call and machine learning.

Overview of our method
The overview of our malware detection method is shown in Figure 1. To capture the behavior of executables, we first run executables in a sandbox Cukoo and get the logs of each executables, The log contains a time series of system calls which represents the interactions between executables and the operating system. Then we use a Bi-gram model to build system call bi-grams to represent the local semantic among system calls. Base on information theory, we use information gain to extract the sensitive system call bi-grams. Finally, we adopt Deep Belief Networks (DBN) to train a malware detection model with these sensitive system calls.

Sensitive system call extraction
We collect the log of runtime system call through a sandbox, named cuckoo, and the log of system calls includes system call, time stamp, input and output data, etc. Since we find that the probability distribution of system call context in malware is quite different from that in benign, we propose a sensitive system call extraction method based on infromation gain.
In order to extract the sensitive system call context, we analyze the probability distribution of the system call bi-grams in malware and benign, and calculate the information gain for each system call bi-grams, according to Eq. (1), where ij x is the system call bi-gram, y is the class of an executable. Let ) ( ij x p be the probability of the system call bi-grams ij x , according to Eq. (2), where ) ( y p is the probability of the malware samples, is the conditional probability of ij x when the executable is malware. When Gain approaches 1, it means that the system call bi-grams is sensitive. To reduce the obfuscated system calls caused by normal behaviors and packing/unpacking behaviors. We maintain the sensitive system call context and remove the rest system calls. After that, we represent the sensitive system call context by a vector of probabilities of the sensitive system calls. Let )} ( ),...., is the probability of the sensitive system call in the sensitive system call context.

Training and detection
Once we have represented the sensitive system calls, we use Deep Belief Networks (DBN) to detect packed malware.

Architechture
The DBN consists of a Restricted Boltzmann Machine (RBM) and a multi-layer perceptron (MLP). The RBM is used to embed the original statistical representation of sensitive system calls to optimize the inputs of MLP, that improves the convergence accuracy and speed.
The RBM has two layers: the visible layer and the hidden layer. The visible layer inputs the statistical representation of the sensitive system calls and the hidden layer outputs the embedding vector of statistical representation of the sensitive system calls. The neurons between the visible layer and the hidden layer are fully connected. In the process of two-way information transmission, the two layers of neurons share the connection weights. In this paper, the RBM by N times of Gibbs sampling. The neurons in each layer are randomly activated. The activated neurons are calculated by the Sigmoid function, which is shown in Eq. (3). The output of the hidden layer of the RBM will be sent to the next MLP.
The MLP used in this paper has three layers: an input layer, a hidden layers and an output layer. The input layer fully connected to the next hidden layer and the hidden layer fully connect to the next output layer according to Sigmoid function as shown in Eq. (3). The connection weights between the two adjacent layers are initialized by random values. The output layer outputs the probability of malware and the probability of benign.

Training process
Since the output of the RBM is the input of the MLP and the training processes of the RBM and the MLP are independent, we train the RBM and the MLP respectively.
We train the MLP by the gradient descent method. In this paper, we use the minimum square error as the loss function of MLP, according to Eq. (6), Eq. (7) and Eq. (8), , where x is the input, y is the label, ) (x h is the calculated value of neural network, w is the connection weight between two adjacent layers of neurons, and  is the step length of each iteration. Through back propagation, we update the weights between two adjacent layers based on the chain rule, according to Eq. (9).

Detection process
In this paper, the detection model is formed by setting the fixed weights between two adjacent layers of neurons, the fixed number of network layers, the fixed number of neurons in each layer and other parameters after training the RBM and MLP. When detecting an unknown packed executable, we first extract the sensitive system call context and use a statistical vector to represent the probabilities of sensitive system calls, and then send the statistical vector as an input to the detection model. The detection model outputs the probabilities of malware and benign through forward passing. If the probability of malware is large enough and bigger than the probability of benign, then the packed executable is malware.

Experiments
In this paper, different groups of experiments are designed, and different groups of sample data are used for analysis. 10-fold cross validation method is used for verification. The experimental results are the average of 10 groups of experimental results.

Experimental setup
All of the methods for comparison are implemented in the same environment and the same configurations, such as CPU, Memory, Hard disk, Operating system (OS), Java virtual machine, as shown in Table 1.

Data sets
The malware data sets used in this paper are collected from vxheaven website [15] and the benign data sets are collected from our personal computers. Some of the malware samples and some of the benign samples in the data sets are packed by several packers, such as ASPack [11], UPX [12], VMProtect [13], ZProtect [14], which will be used for packed malware detection. A part of the malware samples and a part of benign samples are used for training the others are used for detection, as shown in Table 2.

Pre-processing
For each malware sample and benign sample in the data sets, we capture the log of runtime system calls as the behaviors of executables by using the sandbox Cuckoo. From our data sets, we capture 139 kinds of Windows system calls. Through the analysis of the probability of each system call in malware and benign software, we find that the system calls with a significant distribution in malware mainly include Regopenkeyexw, RegCloseKey, Regopenkeyexa, Regqueryvalueexa, Regqueryvalueexw, Findfirstfileexw, Ntdelayexecution, etc.

Accuracy analysis
In this paper, we compare with several machine learning methods, such as DBN, DBSCAN clustering method and support vector machine (SVM) method. The experimental results are shown in Table 3. From the results, we find that the DBN method achieves 92.6% of accuracy while DBSCAN method only achieves 78.3% of accuracy and SVM method achieves 86.3% of accuracy. The experimental results show that our proposed method can detect packed malware from different malware families.

Time cost analysis
The training and detection time cost experimental results of several machine learning methods are shown in Table 4. The results show that the DBN method takes less than 0.001 seconds of detection time and 277.4 seconds of training time, while the DBSCAN method takes less than 3.1 seconds of detection time and does not need any training time, and the SVM method takes less than 0.04 seconds of detection time and 46.0 seconds of training time.

Conclusion
In this paper we propose a packed malware detection method which first capture the runtime system call sequences of executables in a sandbox Cukoo, then extract the sensitive system call by using information gain to reduce the obfuscation caused by normal behaviors and packing/unpacking behaviors, and finally adopt the deep belief network to adaptively train a detection model to detect packed malware. Theoretical analysis and experimental results show that the proposed method can detect packed malware, which achieves more than 92% of accuracy and takes less than 0.001 seconds of detection time.