Human action recognition based on mixed gaussian hidden markov model

Human action recognition is a challenging field in recent years. Many traditional signal processing and machine learning methods are gradually trying to be applied in this field. This paper uses a hidden Markov model based on mixed Gaussian to solve the problem of human action recognition. The model treats the observed human actions as samples which conform to the Gaussian mixture model, and each Gaussian mixture model is determined by a state variable. The training of the model is the process that obtain the model parameters through the expectation maximization algorithm. The simulation results show that the Hidden Markov Model based on the mixed Gaussian distribution can perform well in human action recognition.


Introduction
Due to the development of sensors, the data used for human action recognition has become more abundant [1]. People are gradually turning their attention to the field of human action recognition. Many methods have tried to solve the problem of human action recognition, such as random forest [1], variants of random forest [2], graph convolutional neural networks [3], Deep Progressive Reinforcement Learning [4], Directed Graph Neural Networks [5] and so on. Most of these methods are based on constructing action features, and use discriminative methods to complete the recognition task. Hidden Markov model establishes the joint probability distribution of hidden variables and observation variables and introduces mixed Gaussian distribution to approximate the real distribution of human movements, which can express the distribution of the data themselves better. It can perform well in human action recognition.

Mixed gaussian hidden markov model
Hidden Markov model is a kind of dynamic Bayesian network. It is used to process time series data. The model consists of the initial state probability vector π, the state transition probability matrix A={aij} (i=1,..,m;j=1,…,n) and the observation probability matrix B={bjk} (j=1,..,m;k=1,…,n). m is the number of state variables Z={ z1，…，zm} in the model. When the hidden Markov model is based on Gaussian mixture distribution, n is the number of Gaussian distribution in Gaussian mixture model determined by state variable zj (j=1,…， m) . The process of model training is to determine the values of π, A, and B. In view of the complexity of human actions, this paper applies Gaussian mixture model to approximate the real distribution of human action data. The Gaussian mixture model is as formula (1).
The Gaussian mixture distribution is composed of n Gaussian distributions. The Hidden Markov Model assumes that human actions are controlled by a set of unobserved hidden variables, namely state variables Z. The state variables Z may be a set of postures in practice, but may be other unobservable factors. π and A determine the state time series Z={z1， …， zt}, and B determines the observation time series X={x1， …， xt}. The generation process of the observation sequence is to select a state variable zt at a certain time t. Next, select a specific Gaussian mixture model according to zt , then an observation sample xt is generated from the selected Gaussian mixture distribution. The generation process of observation xt+1 at time t+1 is as follows. Firstly, zt and the state transition probability matrix A determine zt+1, and then xt+1 is generated by the Gaussian mixture distribution of zt+1. The initial state is generated from the initial state probability vector π. It can be seen that the hidden Markov model uses random sampling controlled by a hidden Markov chain to explain the generation of observation sequences.

Expectation maximization algorithm
In order to solve the model parameters, this paper uses the Expectation Maximization(EM) algorithm. The EM algorithm is essentially an iterative update algorithm. The iteration process is divided into E step and M step. E step is to find the expectation of the conditional probability of the hidden variable Z of the log-likelihood function of the complete data, as in formula (2). M step finds the parameter value θ that maximizes the expectation in E step, as in formula (3).
) (i θ represents the parameter value obtained in the i th iteration in equations (2) and (3).

The simulation of algorithm
The dataset used in this simulation is the MSRAction3D dataset [6]. The MSRAction3D dataset is 3 dimensional skeleton data with respect to human actions. Because the posture of an action at a certain moment is composed of 20 joint point samples, and each joint point is composed of 3 dimensional space coordinates, the sample at a certain moment of the human body motion is a 60 dimensional data. The visualization of skeleton data is shown in Figure  1. It can be seen that the human skeleton is composed of 20 joint points. The data cost is greatly reduced by the use of skeleton sequences to express human actions. This paper uses 8 actions as the training set, and the corresponding actions made by the participants in the non-training set as the test set. Actions are classified by number labels, and the selected action type labels are 2, 3, 5, 6, 10, 13, 18, and 20. For example, the action labeled 2 is shown in Figure 2. The action is "hand wave". Since it needs to be assumed that the number of hidden state and the Gaussian mixture number corresponding to the state in the mixed Gaussian hidden Markov model, this paper selects the number of states as 5 and the Gaussian mixture number as 3. The simulation trains a model for each type of action, and then substitutes the test action into each model to calculate the likelihood probability. The test action is recognized as the action corresponding to the model with the highest likelihood. Comparing the real label and the test action label, we can reach the recognition accuracy of each test action and the overall accuracy. Table 1 shows the recognition accuracy of each action, and Table 2 shows the recognition accuracy of human actions by different methods. This paper assumes that it is consistent that the number of states and the Gaussian mixture number for training all actions.   Table 1. The main reason for the large difference in the recognition accuracy of each action is that the action itself has fewer features that can distinguish other actions. For example, action 5 can be easily judged as actions 2, 3, and 18 in simulation, because it has fewer unique features such that can be easily judged as other types of actions. One way to improve accuracy is to extract more new and detailed features from the action. For example, consider the skeleton as a directed graph to construct the relationship between points and edges [5], that is, the relationship between joint points and limbs. The new features may be more conducive to correctly judge the type of actions. For the overall recognition accuracy, Table 2 lists the accuracy of the three methods, showing that the model has well performance on human action recognition.

Conclusion
The mixed Gaussian Hidden Markov model has a good effect on human action recognition. The application of the method requires that the number of the state and Gaussian mixture need to be specified firstly. In future work, non-parametric methods can be introduced to determine the number of states and Gaussian mixture suitable for the model to improve the recognition accuracy.