Multicriteria methods for identifying patterns in the analysis of the flow of " dangerous financial documents "

The article outlines a concept of applying the methodology for identifying patterns used for detecting documents suggesting the execution of some criminal financial transactions. The analogies of diagnostic processes for disease classification in medicine were used in the method. The idea of the described method consists in defining model patterns of financial documents, suggesting criminal activity in the form of the financial flow and developing mathematical models of actual financial documents which shall be used for the comparisons with the patterns at a later stage of the process. The next step is to develop similarity indicators of documents to appropriate patterns, to define and develop a multicriteria detection area for the documents and to develop a method for dividing the set of monitored documents into similar document classes. The final stage is the development of multicriteria rankings that allow to organize the set of transaction documents according to the degree of similarity to the relevant patterns and to determine the optimal cut-off threshold in the ranking of documents intended for a more detailed analysis. The described method may be used in counteracting financial crimes, and in particular in combating money laundering.


Introduction
The detection of criminal financial transactions (financial crimes, supporting of terrorism, money laundering) is a difficult and complex issue.The total number of financial transactions in the national financial market reaches several millions of operations per day.The scope of diversity of financial transactions is really extensive.The financial transactions are executed in different transactional environments: physical environments, local networks, postal networks and global networks such as the Internet.To "fish out" criminal transactions is, therefore, a really complex, but also important process, as it may lead to quick and efficient identification of criminal groups and prediction of financial crimes, hence, allowing their earlier detection and prevention.Therefore, particular countries (including the whole European Union) introduce a number of regulations (including regulations equivalent to Acts of Parliament) imposing certain obligations (procedures) on financial institutions to ensure the undertaking of efficient actions preventing such events [1].In compliance with the provisions of relevant Acts, the financial institutions shall, as part of applying financial security measures, undertake the following actions: "monitor economic relations on an ongoing basis, including the inspection of transactions executed during such relations to ensure that the transactions are in accordance with the knowledge of a given institution regarding the client, business profile and the risk, including, where possible, the sources of funds, and to ensure that the documents, data or information held by the institutions are being updated regularly" [1].Due to the scope and complexity of the financial flow system and, above all, due to the special role of detecting criminal financial transactions, it has become necessary to use appropriately designed IT systems supporting such activities.The appropriately designed and implemented IT systems may turn out to be an efficient tool supporting the detection of criminal financial transactions.The main module of such systems is the subsystem for detecting patterns of transactional documents used in criminal financial activities, which includes the multicriteria module for the similarity analysis [2][3][4][5].The automatically selected set of "suspicious financial documents" may be then

General concept of the method for identifying financial documents
The methods for identifying financial flow of funds as well as parties and criminal groups involved in financial transactions may come down to methods for identifying financial documents related to criminal activity.The identification and detection of such documents usually allow to quickly and clearly define the offending parties and types of financial crimes.Therefore, this study shall be mainly devoted to the so-called computer methods for identifying financial documents.Such documents shall be identified based on the degree of similarily of the monitored actual documents according to specific patterns of financial documents suggesting criminal activity [1][2]6].The method consist in: 1) defining model patterns of financial documents suggesting criminal activity in the form of the financial flow [2][3][4][7][8], i.e. the "dangerous documents"; 2) developing mathematical models of actual financial documents, which shall be used for the "model comparisons" with the patterns at a later stage of the process [9]; 3) developing similarity indicators for comparing the documents with appropriate patterns [4][5]10]; 4) defining and developing the so-called multicriteria detection area for the documents [2-3,11]; 5) developing a method for dividing the set of the monitored documents into clusters (classes) of similar documents (the use of the so-called Recurrent Pareto Filter (RPF) [3,12]); 6) developing multicriteria rankings allowing to order the set of the transactional documents according to the degree of similarity in comparison with appropriate patterns [6, 10, 12-13]; 7) determining optimum cut-off threshold in the ranking of the documents meant for a more detailed analysis [3][4][14][15].

Modeling of patterns of the selected transactional document classes
In its essence, the general concept of the methodology for identifying patterns of criminal financial documents is very similar (therefore, may be successfully used) to the medical diagnostic procedures [4][5]14], which consist in identifying "disease patterns" on the basis of: a) external symptoms (manifestation), b) risk factors (circumstances), c) additional specialist studies.
In case of procedures for identifying (detecting) criminal financial documents, a similar methodology is used.The disease patterns correspond to appropriate classes of criminal financial documents.Such patterns must be also defined in terms of characteristic attributes of a given class of documents: a) external similarity of the documents (external symptoms) -characteristics of the transaction, b) risk factors (circumstances of the financial transaction), c) additional specialist studies (arrangements of the expert).
Similarly, "current transactional documents" monitored to find unusual documents correspond, in this case, to patients (in medical diagnosis).The diagnosis in the process of detecting dangerous documents consists in the selection of a subset of documents most similar in their classes to specific patterns of dangerous (criminal) documents [3][4]14].
Therefore, a typical model (pattern) of a dangerous document should contain three segments of characteristic information: • description of characteristics (symptoms) typical for a given type of the document [1,6,11,16], • description of risk factors (circumstances) accompanying the generation and circulation procedure of such documents [1,6,17], • description of types of potential, additional specialist studies (forensic examination).
Formally, the mathematical model of the M type document may be as follows [3]: where: The S -set is the set of number of all characteristics (i.e.

( ) m K
-a number of symptoms for the m ∈ M type pattern [1].https://doi.org/10.1051/matecconf/201821004010CSCC 2018 m R -a set of number of the risk factors (circumstances), including the generation of the m ∈ M document [1].
The R set is a set of numbers of all types of circumstances, in which the dangerous documents included in the ( ) register are generated.for the m ∈ M document [1].
The P set is a set of numbers of all types of the ( ) studies (whose values may be determined during the specialist studies).

( ) m N
− a number of all types of specialist studies concerning the m ∈ M type document.
While identifying each type of a dangerous document, particular features (symptoms), risk factors and results of appropriate specialist studies have different meaning (different "characteristic gravity") [1, 5-6, 11, 16-18].Therefore, the numbers (as specified by the experts):

( ) [ ]
where ( ) s x w , − a rate of "occurrence" of the S s ∈ symptom (determined by the inspector (expert) on a [0,1] scale (often, the aforesaid rate has binary values: 0 or 1).Similarly, ( ) r x w , − a rate of occurrence of risk factor no. r (also on a [0,1] scale).

Multicriteria model of the process for detecting patterns of "dangerous" documents
The ( ) set of the occurring symptoms as suggested by the set shall be defined in the following manner: Similarly, the set of the documents related to the occurring risk factors shall be defined in the following manner:

M
Another step shall be to determine the total set of dangerous documents.The set may constitute preliminary estimation However, such an approach to initial identification of the documents is quite risky due to a possibility of occurrence of risk factors or symptoms simultaneously for several types of dangerous documents and difficulties in their precise definition.With the data on the X x∈ document regarding occurrence of the risk symptoms and factors in the form of ( ) ( ) numbers, it is possible to determine the "distance of document x " from appropriate patterns of dangerous documents included in ( ) It may be done in the following way.
The model of the X x ∈ document, defined on the basis of the occurring risk symptoms and factors is in the form of a pair: where: The ( ) ( ) set of "the most probable" patterns "matching" the symptoms shall be established in the following manner: On the other hand, the ( ) ( ) x Ro M set of "the most probable" patterns in terms of the occurring risk factors shall be established in the following manner: https://doi.org/10.1051/matecconf/201821004010CSCC 2018 An empty set is often a common part of such [7].Below is an algorithm for setting preliminary similarity drawing based on the idea of multidimensional similarity described in the previous point of this study.The similarity indices shall be defined as properly understood distances of the document from the patterns of dangerous documents.They constitute a certain modification of the Jaccard distance (similarity).The computer assistance process is executed on the basis of the software algorithms for diagnostic conclusion.The basis for the construction of such algorithms are document models and models (patterns) of dangerous documents.The suggestion (proposal) of subsequent detection activities (if necessary) constitutes the result of the implemented algorithm.The general idea of the supporting mechanism, depending on the adopted modeling concept (e.g.Bayesian network [12,19], fuzzy sets [3,[20][21], proximate sets, cobweb models or the concept of patterns [4]), consists in selecting the list of the most probable identified documents and then choosing the optimum set of additional specialist studies.The theory of multicriteria optimization and relational structures [7,22] is an interesting proposal in terms of identifying sets of patterns, most probable from the point of view of the set of the occurring risk symptoms and factors.When establishing an appropriate model of "detection preferences" R , such task may be defined in the form of ( ) ( ) where d(m) function is a vector function measuring the distance (similarity) of the document from the pattern of dangerous document no.m The distances for document x are defined in the following manner [3]: R − model of detection preferences (similarity relation, e.g.Pareto [7]).In practice, the following three options of the detection preferences are taken into account: 1) risk symptoms and factors are equally important (Pareto relation), 2) risk symptoms are more important (hierarchical relationship), 3) risk factors are more important (hierarchical relationship).
In case of two criteria and the M o set with relatively "small numbers", the above sentence may be easily illustrated in graphic form.The illustration is in fig. 1.
The image of the set of M o patterns in terms of distance from document x shall be set Y (fig.1): Therefore, the solution to the task shall be the socalled Pareto's set [7], i.e. the set of patterns from the set of initial estimation M o , with respect to which no "more similar" objects can be found.The set shall be marked with the following symbol: In this case, the so-called "compromise solution" [22], which usually leads to an unambiguous solution, may be the final determinative factor.the "most probable" (in terms of ascertained risk symptoms and factors) pattern, it is possible to create a ranking of potential patterns for further detection activities.
The closest "most probable pattern", as resulting from the ascertained risk symptoms and factors, is the pattern of dangerous document type no. 4.However, in practice, for the expert to be able to make the final decision, the "whole Pareto's set" and ranking of its elements are necessary.An important aspect of the modeling process is the choice of the forms of similarity functions ( d1 and d 2 distances) and decision on whether to accept the appropriate R preference model.Specific mathematical formulas defining the so-called "distance functions" are based on the adopted modeling concepts [2][3]19].For example, in case of models based on the Bayesian networks, such concepts are in the form of appropriate distributions of conditional probabilities.In case of models based on the theory of fuzzy sets [20][21][23][24], the concepts refer to the functions of belonging to the set of initially identified documents, and in case of models based on patterns -the appropriately defined metrics in the so-called detection area [2][3].In special cases, the models of detection preferences do not have to be based on Pareto's relations or "lexicography".The relations may be based on the pessimist (optimist) model or the so-called "collective preference relations".This article was aimed at presenting the model of initial detection in such a manner so as to enable the use of broad and efficient set of possibilities offered by the theory of multicriteria optimization in further studies.The procedure outlined in the article may be considered the preliminary detection process initiating each identification process of dangerous documents [25][26][27].The procedure leads to the generation of the set (relatively with small numbers) of the so-called patterns, with respect to which there are no more probable documents.Another step of the detection process is to potentially (if necessary) choose the "optimum set" of additional specialist studies allowing to make a final decision in terms of identifying the dangerous document, and then to choose the optimum strategy for further operational (preventive) activities.

Summary
The above-described filtering methodology (the application of the so-called Pareto's filter [3,6]) of the set of actual documents allows obtaining the detection area of to compare documents with particular patterns.The result of such procedure is the so-called "Pareto's front" [2][3]7], which constitutes the subset of identified patterns of criminal documents, with respect to which there are no more similar documents (in terms of the Pareto's relations) to the analyzed document.In practice, it is usually a quite extensive set.In such case, it seems reasonable to use the methodology for filtering transactional documents, based on the appropriately designed multicriteria-ranking algorithm [3,6,14].The final subset of the "most suspicious" documents may be selected using the so-called optimum cut-off threshold [3,[5][6].The multicriteria ranking methods outlined in the article [2][3][6][7]] may be successfully used, upon modification, for the purpose of the monitored filtering of the set of transactional documents in terms of their similarity to the applied patterns of dangerous documents.
of risk factors for the m ∈ M type document.m P − a set of numbers of the types of specialist studies mean "priority" of particular parameters concerning the characteristics, risk factors and additional studies in the field of detection of documents types no.m ∈ M .Let us assume that as a result of the preliminary stage, the ( ) S x S o ⊂ set of symptoms and set of risk factors were identified in document x

*M
symbols are used to designate patterns no.m , respectively, in terms of the risk symptoms and factors [symbol is used to designate distance (similarity) of document x (as resulting from the occurring symptoms) from the M m ∈ pattern, defined on the basis of the symptoms andfor marking the distance of document x, (as resulting from the occurring risk factors) from the pattern of a dangerous document type m ∈ M , defined on the basis of risk factors.

Fig. 1 .
Fig. 1.The concept of determining the M R N set of patterns of dangerous documents, with respect to which there are no more similar documents [6]The

Fig. 2
Fig.2Typical image of the set of patterns of the documents, with respect to which there are no "more similar" documents using a computer example.