An Approach to Source Code Plagiarism Detection Based on Abstract Implementation Structure Diagram

Source-code plagiarism detection in programming, concerns the identification of source-code files that contain similar and/or identical source-code fragments. Based on the analysis of the characteristics and defects of the existing program code similarity detection system, a method of source code similarity detection based on Abstract Implementation Structure Diagram (AISD) is proposed. The source code modelling and format into an abstract implementation structure diagram, and forming structural feature strings and variable reference relationship sequences by extracting structural features and variable position features. We calculate the overall similarity by calculating structural similarity and variable similarity. The results demonstrate that the performance of the proposed AISD-based approach overcomes other approaches on the same source code datasets, and reveals promising results as an efficient and reliable approach to source-code plagiarism detection.


Introduction
Plagiarism of source-code is a growing problem due to the growth of source-code repositories, and digital documents found on the Internet. In the field of computer science education, the phenomenon of students copying each other is widespread, which seriously affects the cultivation of students' abilities. About 33% of students in recent foreign studies admitted to having plagiarism [1] . Plagiarism has seriously affected the quality of computer science education. In order to curb bad academic style, research scholars have become increasingly necessary to study code plagiarism detection methods.
The Abstract Implementation Structure Diagram (AISD) [2] , as an external representation of the implementation layer of the process blueprint [2] , uses the logic control constructs and operational expressions of the programming language to accurately represent the process control flow and data flow, including the generation process. Liang [3] has done a series of research work on repeated code detection based on process blueprint, pointing out the repeated code detection method based on process blueprint, avoiding the complicated process of transforming source code into suffix tree and reducing its complexity. The static information of the structure statement and the variable position in the program source code can be directly obtained by analyzing the node type of the abstract implementation structure diagram and the operation expression with the data stream.

Related work
There has been lot of work on plagiarism detection done by several researchers working in this domain. In this section, we summarize different types of approaches and tools that exist within the literature for plagiarism detection. There are three main categories of plagiarism detection approaches: attribute-based, structure-based.
As early as 1976, Ottenstein [4] used the basic Halstead metric, identified only four key properties 1 2 1 2 ( , , , ) H n n N N . Faidhi et al. [5] introduced a minimum set of 23 different metrics. The indicators used include the average identifier length, the number of comment lines, the number of code blocks, the proportion of conditional statements, and more complex structural indicators such as the complexity of the McCabe circle. The attribute counting based measurement method does not consider the program structure information in the abstract process of the program code, and the false positive rate and the false negative rate of the detection result cannot be reduced by increasing the vector dimension [6] .The structure-based approaches adds the internal structure of the program to the analysis and comparison. The common methods are the token-based method and the abstract syntax tree-based method. Plagiarism detection systems are JPlag [7] , Sherlock [8] , MOSS [9] , Plaggie [10] , XPlag [11] , PGDT [12] ,. Because the structure tag string extracted by this method is too simplified and does not consider the statement information of the program, it cannot flexibly adapt to advanced code obfuscation methods such as expression splitting. Guo [13] used the AST of lex and yacc constructors to calculate the hash value for each node by bottom-up cumulative operation and traversing the AST node, and the similarity by the node hash matching and matching node proportion. Resmi [14] used a modified grammar to construct an AST. The preamble traversed the AST to generate a sequence of nodes, and then used the Needleman-Wunsch algorithm and the LCS algorithm to measure the similarity. To summarize, there are different tools and approaches in the literature to detect plagiarism in source code. Therefore, a more robust approach is necessary to handle these code transformations during plagiarism detection.

Detection process and algorithms
This section introduces an innovative computational intelligence framework for the purpose of analyzing source-code in the context of source-code plagiarism detection. We implement our approach in four steps: In the first step we preprocess the source program. In the second step, we construct structural feature strings by extracting structural features from AISD. In the third step, we perform variable identification and extracting structural variable position features, where for each program statement, the algorithm 2 is used to identify and analyze variables in the program statement, constructing a sequence of variable references. Finally, the structural similarity is calculated by the GST algorithm, and the variable similarity is calculated by the algorithm 3. The overall similarity is calculated by structural similarity and variable similarity. Figure 1 details this process: In the following sections we describe our approach in detail. we describe our approach in detail.

Source Code Preprocessing
The operation object extracted by the program structure feature is an abstract implementation structure diagram of the process blueprint, which decomposes the program code into a process blueprint and takes its abstract implementation structure diagram view. Since the program statement feature contains the hierarchical relationship of the program action, it represents the positional relationship between the nodes in the abstract implementation structure diagram. The depth-first traversal can express the hierarchical relationship of the program statement and ensure the completeness of the program meaning, while the breadth-first traversal the nested structure of the program code will be lost. Therefore, the abstract implementation structure diagram is depth-first traversed, and the sequence of nodes is represented by parentheses.

Construction of structural feature strings
By transforming the program code formatting into an abstract implementation structure diagram, deep traversal the nodes and extracts structural features. The specific steps of the program conversion to the AISD are detailed in the literature [2] , and will not be elaborated in this paper. In an abstract implementation structure diagram with a statement expression that shows the implementation node of the data stream, statement expressions with the same variable type may have different variable names. This paper formalizes the structural features and introduces the parameter element parameterization process in the structural features of the extraction program. This paper considers the characteristics of the statement, the information of the statement element and the nesting relationship between the statements, to avoid the influence of the identifier on the grammatical structure of the statement. The resulting structural feature string is the basis of the comparison of the program structure.
Definition 1 (Structural Feature Strings) structural feature strings, including the hierarchical relationship between program statements and the sequence of statement features.
Algorithm 1 describes the construction process of a structure statement feature string based on an abstract implementation structure diagram. The sequence contains hierarchical relationships, and each node is operated as follows: Extract the control structure type, attribute, and serial number to get the control structure of the program (3-6 lines). The statement expression of the data flow of each node is processed, traverses each element of the statement expression, retrieves the variable dictionary, performs element-to-one mapping, and obtains the statement pattern (7-9 lines). The sequence of the type of the statement element in the control structure type and the statement mode attribute is represented as the node statement content, and the sequence formed by the parenthesis representation of the implementation node and the content of the node statement is the program structure feature string (11 line

Variable reference sequence construction
This paper uses the depth-first search algorithm to recursively traverse each node of the AISD to obtain some information we are interested in, such as type, number of sequence types, tags, number of positions, etc., and use the obtained information to construct a variable position set of variable position features. In this set, a variable pattern that satisfies the constraint is obtained, and a sequence of variable reference is constructed. In a variable feature set, if the element type is a custom type in all elements of the set, the element belongs to the variable pattern set, otherwise it is not a variable pattern set. In the variable pattern set, if the types of the set elements are the same, the sequence formed by the position numbers of the set elements is a sequence of variable reference. Algorithm 2 describes the construction of a sequence of variable reference based on an abstract implementation diagram. Algorithm 2 is mainly divided into three stages. The first stage is the stage of custom variable position recording. The variable pattern is constructed by obtaining the elements of the variable identifier in the variable positional feature (2-6 lines). The second stage is the Scanmark stage, which scans all the elements of the variable position feature set. If the type of the j element in the variable pattern Str is equal to the type of the variable position feature set element and the element is not marked. Put all of them above into the varloc and the current element is marked (8-20 lines). The third stage is the build phase, building the variable reference sequence varSeq (17-18 lines).

Similarity calculation
The overall similarity is calculated by the structural similarity and variable similarity of the program code, and the calculation formula is as shown in (1).The number of w 1 and w 2 values represent the weights in the comparison, and the overall similarity calculation formula is: structsim X Y is the structural similarity of the program files X and Y, and var ( , ) sim X Y is the variable similarity of program files X and Y.
The GST algorithm [15] is a greedy string matching algorithm that finds the largest common substring of two strings by greedy search. The calculation formula of structural similarity is shown in (2), where |X|, |Y| is the length of the structural feature string in the program files X and Y, x, y is the starting position of the structural feature string.

( )
, , Match x y length represents the same substring with x and y lengths of length.
A sequence consists of an array of elements arranged in a regular order, and a sequence of strings is also an array of individual characters. The string matching calculation process is that a single character of a string sequence is equivalent to a single character of another string sequence, and the variable reference sequence also has the same properties and operation methods.
Definition 3 (Variable Similarity) variable similarity refers to the ratio of the matching length of a variable reference sequence to the product of 2 and the sum of the lengths of the two sequences. The calculation formula of variables similarity is shown in (3), where |X|, |Y| is the length of the Variable reference sequence in the program files X and Y and maxslen is the method for solving the maximum number of matching elements in two sets of sequences.
Algorithm 3 is a calculation process of variable similarity. It is mainly divided into three phases. The first phase is the construction phase. The variable position sequence is constructed by the node information of the data flow, and the same variable reference relationship sequence set varSeq of the custom identifier in the program statement is obtained (2-9 lines). The second phase is the comparison phase, comparing whether each element of the sequence set is equal (12-16 lines). The third stage is the calculation phase, which calculates the variable similarity by comparing the number of identical elements sim (17 line).

Datasets and Metrics
The proposed AISD-based system was tested on two Java source-code datasets. These datasets are described in subsection 4.1.1. The performance of the proposed is evaluated against the JPlag by means of the evaluation measures described in subsection 4.1.2.

Datasets
The evaluation assembly consists of two Java sourcecode datasets A and B.Basic information about the data set is shown in Table 1.

Performance evaluation measures for plagiarism detection
Due to the fact that the similarity values provided by these approaches are not directly comparable, Table 1 shows the characteristics of each dataset, where performance evaluation measures for plagiarism detection. This section describes the performance evaluation measures for comparing the performance of the proposed. In order to verify the validity of the test results, the measurement methods proposed in [7] were combined with manual methods for analysis. The evaluation will mostly be based on the measures "precision" (P), "recall" (R) and "F-measure" (F), defined as follows.
Assume we have a set of n programs. This set allows to form ( ) 1 2 p n n = ⋅ − pairs. Assume further that g of these pairs are plagiarism pairs, i.e., one program was plagiarized from the other or both were (directly or indirectly) plagiarized from some common ancestor that is also part of the program set.Now assume that we let our fully automatic plagiarism detector run and it returns f pairs of programs flagged as plagiarism pairs. If t of these pairs are really true plagiarism pairs and the other are not, then we define precision and recall and Fmeasure as 2 , , P R P t f R t g F P R precision is the percentage of flagged pairs that are actual plagiarism pairs and recall is the percentage of all plagiarism pairs that are actually flagged. F-measure is a measure of the performance of the system, related to precision and recall.

Experiment result
This section describes the results from the experiments performed on two datasets. The test results are shown in Table 2 and Table 3.The retrieved pairs are obtained by the plagiarism system when the similarities of the pairs exceed a defined cut-off threshold (CT). The CT separates plagiarized program pairs from non-plagiarized ones. If the similarity value of two programs is larger than cut-off threshold value CT, then the program pair is marked as suspect. We used CT values from 10% to 90%.The CT value is an artificially set threshold value for determining whether or not to copy. The last behavior is the average of the P, R and F values of For the P and F values, the detection result of the AISDS system is higher than that of the JPlag system.Especially for assembly B with simple structure, the average precision of JPLAG is 36.3%. To further explore the reasons for this, let the two students complete the output separately " Two Exchange " simple program, although the two have no plagiarism, but the similarity detected by JPLAG is 100%, which means that even if the CT value is set to 90, it will be misjudged as plagiarism. The root cause is this program. The structure and logic are very simple, and there are not many variable definitions. If only the structural similarity is analyzed, it will definitely lead to misjudgment. Using the system analysis of this paper, the structural similarity and variable similarity are 55.48% and 45%, respectively, and the overall similarity is 48.93%, so that an accurate CT result can be obtained by setting an appropriate CT value, such as CT = 50.

Conclusion and future work
This paper proposes a method for detecting code similarity based on abstract implementation structure diagram. The experimental results show that the above method can effectively detect plagiarism between program pairs, avoiding the misjudgment caused by previous detection methods and improving the precision value. The method is transformed into an abstract implementation structure diagram and the feature quantization calculation amount is large, and the subsequent work can continue to study how to improve the efficiency of the method.