Target function location based on code coverage analysis

How to locate the target function faster and more accurately is a key problem of Automatic Reverse-Engineering of Software Programs. In order to solve this problem, a target function location method based on code coverage analysis is proposed. Firstly, it obtains function call information of the program and calculate the suspicious rate of each function. And then a stability factor is proposed to reduce the noise functions. Finally, the target function is successfully located. The experimental results show the method proposed has linear time complexity. For software programs with million function calls, it can accurately locate the target function within several minutes, the performance and accuracy are greately improved compared to the contrast methods.


Introduction
In the process of Software Reverse-Engineering, the first job is to determine the cut-in point of the reverse analysis.Usually, the function containing cut-in point is regarded as the target function.How to locate the target function faster and more accurately is an important guarantee for the efficient completion of reverse analysis.In recent years, some automatic location methods have been developed, including difference set calculation, API sequence alignments, dependency analysis, and functional feature recognition.
Based on the method of difference set calculation, Renieris M et al [1] proposed "nearest neighbor" model which is locating the target position by calculating the difference set of the execution paths.Wang et al [2] executes the program 1+N(N∈N + ) times and find the target function by clustering the difference set.Huang et al [3] filters the redundant execution paths by clustering and improve the locating efficiency.But this kind of method has a lot of noise function.
Based on the method of API sequence alignments, Ye et al [4] proposed a method of generating harmonic sequence, which avoided the "blind area" caused by nearest neighbor path selection and reduced the space of subsequent search.But this kind of method had higher time complexity than difference set calculation.
Based on the method of dependency analysis, Chen et al [5] proposed using DG (Dependence Graph) method to analyze function call dependency and data dependency of program, and programmer is able to search DG to locate the feature code.Zhang et al [6] proposed CP(Capture Propagation) method to establish the CFG(Control Flow Graph), it can locate target code by calculating the suspicious rate of every edge.Baah et al [7] proposed PPDG(Probabilistic Program Dependence Graph) statistics conditions between program procedure and entity, and compare the differences of cases to locate the target module.There are similar methods by analyzing the dependence [8], [9] like program slice technology.
Based on the method of feature recognition, AD Eisenberg et al [10] sort the characteristic correlation of function for program execution trace to locate feature code; Xie et al [11] use the working mechanism and encapsulation characteristics of the MFC(Microsoft Foundation Classes) framework to quickly locate the message response function of the MFC program.In addition, there are many studies on encryption algorithm identification based on feature [12], [13].This kind of method is very good for identifying specific function, but not universal.
Aiming at shortcomings of the above methods, we propose a method of code coverage analysis to locate target function.It makes whole locating process reache the linear time complexity and shorten the locating time.In addition, we propose the stability factor according to the modular design characteristics of the program.It reduces the interference of the noise function, and improves the accuracy of the location range.For our method is only related to the function call path, it is universal.

Principle
According to the location principle, that means, finding the closest trace from the trace set without target function to compare with trace containing target function, ideally, the different part of the trace is the target function.
First, trigger the target function and run the program, we will get a function call trace by using DBI (Dynamic Binary Instrumentation).Then, do not trigger the target function, run the program several times according to the previous operation and get the corresponding function call traces.It is assumed that each function may be target function, and the function appearing on the trigger path is more likely to be target function.We statistics the coverage times of all the function nodes in CG (Call Graph), and calculate the suspect probability of each function node according to the designed suspicious rate formula.From the perspective of probability analysis, we select the candidate list of the target function.According to the location principle, the target function must exist in the candidate list, but there are also some noise functions.In order to filter noise functions, the stability factor is designed to reduce the influence of the noise function on target function location.Finally, all the locating process is shown in figure 1.By analyzing the program execution information, we can obtain the control flow and data flow of the program.And DBI is the method that is transparent for monitored program, and does not affect the program running results.We will use DynamoRio [14] to obtain information of program.{ ( , , , ) | , , } E e x y depth od e E depth od Z     , x is the set of predecessor nodes, and y is the set of successor nodes.depth is the minimum number of nodes that from the initial node e passes through.od represents the out degree of the node, that is the size of the node set y.

Fig. 2. A sample of CG.
The relationship between the source code and the CG is shown in figure 2. The main function calls the function f1 first, then calls the f2 or f3 according to the condition, and finally calls f4.However, this is just a static CG based on source code.Actually, in the execution of the program, part of the code may not be run, and the functions in this part will not appear in the CG.For example, if condition is true, the function f3 is not executed, so there is no f3 node in the dynamic CG.
Define 2 FCP(Function Call Path), in CG, each node is a function of the program, the program from the initial node state transfer to the end node state will form a node sequence i, called FCP, record as Wi, 0 1 ( , ,..., ) In order to get the FCP of the program at runtime.When the program is running, the following three kinds of instructions need to be processed: (1) call instruction.Function identification needs to match the call instruction and ret instruction.When program meet call instruction, only do push operation.Meanwhile, we record the instruction register ESP's value and push the address of the next instruction into the shadow stack.If the target instruction is in system DLL space, set Flag = true; (2) ret instruction.If the current value of ESP is equal to the value of ESP which on the top of the shadow stack, the stack meet the balance and the function is identified, record the function information.Otherwise, we need to check if the return value of program's stack is equal to the ret instruction's target address.If they are equal, stack imbalance function is identified, if they are not equal, the ret instruction is judged to an ordinary jump instruction.
If the destination address is in system DLL space, set Flag=true; (3) other instructions.When the first instruction of trace is not call or ret, if Flag=true, we first set Flag to false, and check if the return value on the top of the shadow stack is equal to the current instruction's address, if they are equal, API call is identified.Otherwise, callback function is identified, and then push the current instruction's ESP and the ESP's target address into the shadow stack.
The pseudo code of CG generating algorithm is as follows.

Algorithm 1：FunctionCallGraphGeneration
  Suppose that each function in the program P is likely to be target function, then the approximate degree between each function and target function is called the suspicious rate of the function, which is denoted as S(f).S(f)∈[0,1], and the greater the value is, the more likely the function f is target function f*.
For test cases set T, we can get the WT by DBI.Then we can statistics the total number of the function e is called, record as N(e).There is the following relationship: Np(e)= N(e|e∈WTP) means the total number that function e in the test cases set which target function is triggered, Nf(e)= N(e|e∈WTN) means the total number that function e in the test cases set which target function is not triggered, and there: For each function node e, there is a two-element feature e(Executed, WT), where Executed is the Boolean type, WT∈{WTP, WTN}.We can get a binary set e'= (e1', e2', e3', e4'), there are: For the two-element feature e (x, y), if x =1 and y = 1, aep means the number of test cases that trigger target function when the function node e is executed.if x =1 and y = 0, aef means the number of test cases that do not trigger target function when the function node e is executed.if x =0 and y = 1, anp means the number of test cases that trigger target function when the function node e is not executed.if x =0 and y = 0, anf means the number of test cases that do not trigger target function when the function node e is not executed.The relationship between them are shown in table 1

end for END
The time complexity of algorithm is Ο(k*n)+Ο(s*n).The sum of k and s is a constant, record as m.So the final time complexity is m*Ο(n), it is a linear complexity.

Stability Factor
Actually, modular software design concept makes the functional modules have good independence, the function division is very clear, the public module is more frequently reused.Based on such characteristics, we propose stability factor to describe the stability of software function modules, and find target function from candidate function list.
We expect functional module to have less impact on other modules, which requires functional module to have strong independence.Therefore, the functions which in module should have a larger value of depth and smaller value of od.It is similar to the leaves and nearby nodes which are deep in the Call Tree structure.They are different from other functions, in the whole process of software execution, the influence on other functions is relatively small, which satisfies the independence of functional modules.On the other hand, target function usually is not leaf node, but the public function is of the reuse rate very high.According to the call number, we can filter this kind of noise function, reduce the size of the candidate function list.Stability factor is given as follow:

Functional testing
This section gives a detailed test of the WinRAR3.7 Beta version.WinRAR provides the window interactive interface and command line interface.In this paper, we will use command line interface to locate the original encryption function in WinRAR.
First of all, close the ASLR (Address Space Layout Randomization), and let the software perform 1 encryption operation and 3 compression operations, the operation object is two *.txt format file.The function call information of WinRAR as shown in table 2. After the function coverage statistics, calculate the suspicious rate S(f) for each function, and set threshold to 0.75, the function whose S(f) is greater than 0.75 will enter the candidate list, Finally, a candidate list with 7 elements is obtained, as shown in table 3. Using DBI to get the function call information to build CG.Calculate stability factor for 7 candidate functions and sort them (There θ = -0.5).The result table 4 shows.According to the results of sorting, we can judge the encryption function is 0x40b604 or 0x40d088, using Ollydbg to further analysis we can find that sub_40b604 is responsible for password import for file encryption, and sub_40d088 run after sub_40b604, it is target function that implements file encryption.
In order to verify the versatility and comprehensiveness of this method, we select more different types of software to test, and the specific test list is shown in table 5. From table 6, in general, the larger a software, the larger the candidate list.When the interaction between user and software becomes more complex, leading to an increase in the difference between WTP and WTN, and the candidate list becomes larger.However, the stability factor can further filter the candidate list and reduce the number of noise functions.For example, the software Filezilla Client, its candidate list has 204 functions, but its stability factor of target function only ranks 27.
The effect of DBI on the program running time is analyzed by recording the free running time and the running time under DynamoRio in each test case, and the results are shown in figure 3.In figure 3, The multiplier of the vertical axis data represent the extension of the running time.Generally, the influence of DBI on running time is less than 10 times, the total run time is only a few minutes, it can satisfy the actual locating needs.
The following are the experimental results and comparison of our methods and other locating methods.The result is shown as figure 4. The upper picture is our method, its time overhead satisfies the linear time.The other picture is a comparison of the three methods, we can see that both our method and difference set calculation satisfy the linear time.But the latter exist a lot of noise functions.And the time complexity of API sequence alignments is Ο (n1*n2*…*ni), where i is the number of sequences.Dependency Analysis has a higher time complexity due to the need to parse the dependencies between elements.

Conclusions
Experiments show that our method can determine the candidate list in linear time, and the stability factor can reduce the noise functions.But this method has some disadvantages, such as selection of test cases, how to trigger target function or how to avoid triggering target function, this is a problem worthy of further study.In addition, when the interaction between user and program is too complex, the candidate list is too large, there still a lot of work to be done even though calculating stability factors.In view of the above problems, how to achieve a more rapid, precise target function location method need further research.

Define 1 Function
Call f, : f b   means function a is caller, and function b is callee.a, b are the names of function.Define 2 CG (Call Graph), expressed in G<V,E>, V is node set of G.And E is edge set of G.

MATEC
Web of Conferences 189, 04005 (2018) https://doi.org/10.1051/matecconf/201818904005MEAMT 2018 Define 4 the set of FCPs: in software testing process, there is a relationship between the program P and test cases T, record as WT:The program P={f1, f2,..., fn} are used to locate the target function, and input test cases set T={t1, t2,... tm} into program P one by one.According to whether the test case trigger target function or not, we can get a result set F={r1, r2,..., rm}, and ri (i=1, 2,... ,m) is Boolean type.When ri is True, it means target function is triggered, conversely, means target function is not triggered: if instruction is call then push ESP and address of the next instruction; if instruction's target address is in System Dll Space then Set flag to true; end if else if instruction is ret then pop ESP; if ESP match current instruction's ESP then //we can use Graphviz to draw FCG, information including length and od recognize function and record its information; if instruction's target address is in System Dll Space then

Table 1 .
. Relationship between function node and test case.According to S (e), a suspicious candidate function list can be generated.We sort the suspicious functions in the candidate list, and finally find target function.In particular, the pseudo code of the code coverage analysis algorithm is shown as follows.Number of test samples that execute f* when function node e is executed; NP -Number of test samples that execute f* when function node e is not executed; EF -Number of test samples that don't execute f* when function e is executed; NF -Number of test samples that don't execute f* when function node e is not executed;

Table 2 .
Operation information of WinRAR.

Table 3 .
Candidate list of WinRAR.

Table 4 .
Location result of WinRAR.

Table 5 .
Test list.The coverage analysis method is used to locate target function of each test case, and the experimental results are shown in table 6.

Table 6 .
Location results of test software programs.