Permission-based Analysis of Android Applications Using Categorization and Deep Learning Scheme

As mobile devices grow in popularity, they have become indispensable in people’s daily lives, keeping us connected to social networks, breaking news, and the entire Internet. While there are multiple competing platforms, Google’s Android is currently the most popular operating system for mobile devices. This popularity has drawn attention of hackers as well. Thus far, research works have analysed Android permissions individually, which makes analysis complex and time consuming. In this work, we propose categorizing Android permissions based on Google’s recommendation and perform LSTM analysis on data. The used datasets are Drebin and AndroZoo, which are the most complete and well-respected among research community. The experiment results show that LSTM achieved 91% of true positive rate.


Introduction
As mobile devices grow in popularity, they have become indispensable in people's daily lives, keeping us connected to social networks, breaking news, and the entire Internet. While there are multiple competing platforms, Google's Android is currently the most popular operating system for mobile devices.
Google takes an open stance toward Android. The thirdparty applications are not rigorously scrutinized before they are made available to users on the Play Store. This open approach has led to security challenges and privacy concerns, and Google relies on community reviews (e.g., user ratings and application flagging) to mitigate security threats. Google can also remove any detected or reported malware from the Play Store and remotely from affected devices [1].
To better defend against malware, Google launched Bouncer [2], a server-side security service that performs a series of analyses to detect hidden, malicious behaviour when applications are uploaded to Google Play. However, it is found that Bouncer may be bypassed by sophisticated malware from knowledgeable adversaries [3]. More recently, Google launched a new application scanning service as an extension to Bouncer with the release of Android 4.2. This new security service is built into the platform to detect potentially malicious or harmful code when side-loading applications onto devices.
To implement rich features and achieve better user experience, applications frequently collect and compile sensitive data available on mobile devices. Many of these applications access users' personal information, such as geolocation, device identifiers and contact lists. Ideally, such accesses should be controlled to protect user privacy.
Although applications must ask for permissions to use sensitive information, it may not be clear to the end-user how such information is actually used after granting access. For instance, a photo-sharing application needs to access the network to transmit photos and acquire the user's location for geotagging the photos. In this example, the user needs to trust that the application only uses this privacy-sensitive information as indicated, which is not always the case. If the application is adsupported, it may send the user's location along with other privacy-related data to advertisers' servers to display targeted ads. To mitigate this issue, recent legislation mandates that any application that collects personal data from users must conspicuously post a privacy policy describing how such information is collected, used and shared [4]. Unfortunately, privacy policies often contain broad, vague legal language, and can be inaccurate or out-of-date [5,6]. Therefore, permission is our selected feature for this work.
The use of Android permission in this work differs from other works, since we group requested permissions based on Google's categories, as normal and dangerous. This way, analysis of data is faster and more effective. Additionally, this work analyses 100,000 Android applications, which is more than other related works and makes the results more reliable.

Related works
We know that the Android operating system has a Linux core, from which it inherits important parts of the Linux security architecture. Prior to installation of an application, the Android provides a list of requested permissions to the user. Upon the permissions being granted, the application installs itself on the device. There are 130 official Android permissions [7]. Google categorizes them into four groups, namely, normal, dangerous, signature, and signatureOrSystem [8].
Researchers take different approaches in analysing Android permissions.
Android builds a part of its security on a "permission restricted access model" on sensitive sources (e.g., SD card, contacts). This means that applications, to gain access to such resources, should declare in the manifest the appropriate permissions, which users should grant during the installation. However, in such a model application might "manipulate" the requested permissions and gain access to private information, without users' consent at all.
A typical example is that of those applications that request more permissions than what they actually need, named as over-privileged [9,10]. Such applications can be transformed silently into malware, whenever an operating system or an application update occurs. Furthermore, Android OS permission number expansion from its first version (100 permissions) to the latest version (170), as Figure 1 illustrates, is indeed making, in a way, larger the attack surface exposed to an adversary.

Fig. 1. Android permission evolution
RiskMon [11] introduces an automated service to assess the security and privacy risk of a given application taking into account legitimate users normal behaviour. RiskMon leverages on (a) machine learning and (b) trusted applications different run-time features to build user's legitimate model. RiskRanker [12] detects zero day related Android malware by analysing whether a particular application exhibits dangerous behaviour based on static analysis.
DroidAnalytics [13] develops a solution to scrutinize Android application at the byte code level, and generates the corresponding signatures that can be used by antivirus software. In the same direction, Shahzad et al. rely on bi-grams sequences of op-codes retrofitting in machine learning classifiers to detect malware [14], while Permlyzer analyses application's permission usage based on both static and dynamic analysis [15]. Barrera et al. accomplish a permission analysis based on Self-Organising Map (SOM) [16], while Xuetao et al. study Android's permissions evolution [17].
Other solutions such as Whyper [18] reason about the necessity of requesting an access to specific permission. To do so, Whyper relies on Natural Language Processing (NLP) by extracting information from the keywords and description defined in the application. Similarly, TatWing et al. build a permission based abnormal model leveraging on application description and its permission [19].
Yajin et al. introduce a tool for the systematic study of applications that might passively leak private information, due to vulnerabilities stemming from builtin Android components, such as read/write operations to content provider [20].
Applications are statically analysed to identify such data flows. Analogously, Liang et al. propose a malware detection engine that relies on the semantic analysis of an examined application [21]. Sbirlea et al. develop techniques for statically detecting Android application vulnerabilities to attacks aiming at obtaining unauthorized access to permission-protected information [22].
Thus, we believe that permissions and other related information (i.e., APIs) residing in applications should be considered as an important information source for detecting malicious applications. In this work, we introduce an analysis approach to enhance the performance of anomaly machine learning based techniques used to assess whether an application is malicious or not, based on applications permission related information, elaborating on previous research works [23][24][25] in the direction of achieving higher accuracy. To do so, we group permissions based on Google's documentation a .

Dataset analysis
This section consists of introducing our dataset and its descriptions. In addition, we present some insights on the dataset using statistical analysis. a http://developer.android.com/guide/topics/manifest/per mission-element.html

Datasets
Each investigation requires a dataset in view of which the creators assess their proposed framework. Android malware is a moderately new research zone. The principal Android malware was found in 2010 [26]. At first, scientists did not have a strong and standard dataset to work with.
The Drebin data sample was published in 2014 by Arp et al. [24]. It is a collection of 5,560 Android malware categorized into 179 different families. It was collected between August 2010 and October 2012. The authors scanned the Drebin with antivirus applications. They report that while the best scanners detected over 90% of the malware, others detected less than 10% of the data sample. The Drebin was well-accepted among researchers [27,28]. We acquired it for this study.
Not at all like the specified data samples, AndroZoo is a growing collection of Android applications from a few sources, including the official Google Play. As of composing this work, AndroZoo contains more than five million Android applications. Not exclusively does this data sample contain Android malware, yet it contains clean applications too [29]. Crawling different sources began in late 2011 and has proceeded with from that point onward.
The 14 sources incorporate Google Play, Anzhi, AppChina, 1mobile, AnnGeeks, Slideme, HiApk, ProAndroid, and so on. The AndroZoo sends all the downloaded applications to the VirusTotal for checking. The quantity of antivirus programming that identify an application as malignant is put away in the metadata document, as vt_detection.
The metadata record is accessible on the AndroZoo site and is refreshed consistently. Thus, if vt_detection is zero, at that point the application is perfect. Else, it is considered as malware. This element enables analysts to utilize AndroZoo as a malware store, as well as a clean application archive. Thus, we chose this data sample and the Drebin for our experiment.

Data exploration
In this section, we analyze our dataset in terms of static analysis and in particular, permissions. Permissions are classified according to protection levels. The purpose of a protection level is twofold. First, it characterizes the general risk that is implied by the respective permission. Second, it determines how the Android platform handles the process when applications request the respective permission.
In this work, we use four levels to categorize each permission into respective group. For each application, we determine requested permissions to be in each of the four groups. Then, we analyze the application to identify it as clear or malware. This contribution makes analysis of Android applications faster, since we just deal with four groups of permission levels, versus all 130 Android permissions. Based on our analysis, the applications requested 6,079 permissions, which is an average of 5.5 permissions per application. However, this set includes many duplicates, meaning that many applications tend to request similar permissions. In total, the installed applications requested 283 different permissions.
However, more than 40% of them were only requested by just one of the installed applications at a time. This is due to the fact that many permissions are vendor respectively application specific. For example, the permission com.symantec.permission.ACCESS_NORTON_SECUR ITY was only requested by a single app in the whole dataset.
In contrast to those permissions that are only requested once, some permissions are requested far more often compared to others. The ten most frequently requested permissions are listed in Table 1.

Experimentation
This section focuses on performing experiment and evaluation of the proposed method. First, the deep learning scheme is discussed. Then, results of the experiment is presented, and comparison with related works is conducted.

Deep learning Scheme
Since this work proposes usage of deep learning scheme in conducting experiments, it is beneficial to explain deep learning history, its structure, and different available algorithms.
The recurrent neural networks and LSTM are proposed to deal with series of related data. One of the appeals of RNNs is the idea that they are able to connect previous information to the present task, such as using previous video frames might inform the understanding of the present frame [30].
General and basic structure of deep learning scheme is shown in Figure 2. It consists of input layer, which is input data to algorithms. There are hidden layers in which the algorithm makes numerous mathematical calculations. Finally, the output layer is results of the calculations of the algorithms.

Fig. 2. Structure of Deep Learning Scheme
As stated before, the concept of deep learning has been around for decades. However, they gained momentum in recent years, for two reasons. First, availability of massive data. Deep learning algorithms work well when dealing with massive amount of data, such as credit card transactions, geospatial data. The more data is available, the better they are trained. In case of this study, we gather network traffic and permission of millions of applications, which is considered good input for deep learning algorithms.
The second reason is availability of computing power. In recent years, appearance of the cloud computing has dramatically increased the computing power. It is possible to rent up to 64 cores of CPU from Google Cloud or Amazon AWS services at affordable price. The deep learning algorithms are able to process terabytes of data using massive computing power to achieve high accuracy.
In this work, we experiment with two types of deep learning algorithms. The classic deep neural network (DNN) and the recurrent neural network (RNN). The DNN follows the architecture depicted in Figure 4.1, by having input, hidden, and output layers. The hidden layer is where the calculations happen. Each neural network involves the following functions and calculations.
 The first step in a neural network is the forward propagation in which the inputs are propagated across the layers. In addition, the network predicts the output based on inputs. Based on the explanations, the following functions are defined in the forward propagation. (1) The equations and are functions that take as input and use and as weight and bias respectively. The is an activation function that takes as input and passes the result to the next layer. In the output layer, the function is used to calculate that is the prediction of the neural network.

Results and Discussion
This experiment starts with extracting permissions from Android applications. This process is done using APKtools. The permissions are extracted for both normal and malicious applications. The end result is a dataset of all permissions labelled as normal or malicious.
 The next stage is to associate permissions to normal or dangerous. This way, the processing time and, consequently, the detection time is faster. This method is contribution of this study.
The data is divided into training set and test set. The training set is used to train an algorithm. The test set is used to test the learned algorithm. Researchers commonly dedicate 70% of data to training and 30% to testing. We followed the same procedure to employ standard methods. Then, the training data is fed to deep learning algorithms. Table 2 shows results of the experiment, in form of true positive rate (TPR), false positive rate (FPR), and loss. analogous to human brain in which decisions are based on data from early encounters.
In order to establish contribution of this work, it is necessary to compare the results to most related research works. Table 3 shows the comparison of recent works with this study. This study 91% 100,000 As it can be seen from Table 3, this study achieved high detection rate compared to other related works. Although the difference between this study and other related works looks small, it is necessary to consider dataset size as an important factor.
As dataset size grows, the results of experiment look generalized and more promising, since it includes more clean and malicious data. This work used 50,000 clean applications and 50,000 malicious application. It provides more general results compare to other related works.
On the other hand, this work focuses on complexity of data analysis and detection. In order to analyze and develop a detection method for 100,000 application, we proposed categorizing Android permission based on the Google suggestions. This way, the effectiveness of detection method is better, as the results in Table 3 present.

Conclusion
In this work, we proposed an analysis methods for Android applications based on permissions. This method uses categorization suggested by Google, to facilitate analysis of Android application data. This way, it is easier to analyze 100,000 application, since we categorize their permissions to normal and dangerous. The dataset was gathered from Drebin and AndrZoo datasets, and both clean and malicious applications are included. The experiment was performed using deep learning scheme, and particularly the LSTM algorithm, since it is capable of making decisions based on previous data much like human brain. The results show that the experiment achieved 91% of true positive rate.