Construction and Application of Indoor Video Surveillance System Based on Human Activity Recognition

With the growth of building monitoring network, increasing human resource and funds have been invested into building monitoring system. Computer vision technology has been widely used in image recognition recently, and this technology has also been gradually applied to action recognition. There are still many disadvantages of traditional monitoring system. In this paper, a human activity recognition system which based on the convolution neural network is proposed. Using the 3D convolution neural network and the transfer learning technology, the human activity recognition engine is constructed. The Spring MVC framework is used to build the server end, and the system page is designed in HBuilder. The system not only enhances efficiency and functionality of building monitoring system, but also improves the level of building safety.


Introduction
With the growing of monitoring network and the development of computer image processing technology, the intelligent monitoring system which includes the human activity recognition function is becoming mature gradually. It is can be applied in every aspect of the monitoring system. This way of monitoring aims to ensure safety in buildings. Nowadays, most of the buildings are equipped with video surveillance cameras in rooms and corridors. It is a great significance to prevent violence by making use of the monitoring system in the building. Combined with human activity recognition, the alarm will ring out when search out abnormal actions such as fighting and falling down. Due to the daily increasing monitoring network, human resources find it more difficult to do observation task. Moreover, the use of human activity recognition technology in monitoring the abnormal action will visibly improve the security level. This paper proposes a solution to the construction and application of indoor intelligent monitoring system, with the help of convolution neural network, human activity recognition, Spring MVC and some other technologies.

System functional requirements analysis
Through previous research, it is found that violence usually occurs in some corners in the building, such as the end of the corridor and so on. Through the comprehensive analysis of the factors of indoor violence and the existing security system in building, the functional modules of this system should include four parts: the display of monitoring information; the setting of monitoring; the options of the Human activity recognition engine; the system setting. The system function module diagram is shown in Figure 1.

Monitoring information display
All kinds of information collected by the system and the current state of the various settings of the system need to be known by the security administrator. The security administrator can see the current picture taken to monitor and shot this module, they also can look over the history of the activity recognition records and the history of the waring records.

Other settings
In the settings of monitoring, there are select of monitoring cameras, switches of cameras and so on. In the human activity recognition setting, it includes the switch of activity recognition warning, the update of human activity recognition library, the setting of sensitivity of activity identification, etc. The system settings include the setup and management of account information. Such as the unit name, address, login password, and so on.

System structure design
The indoor security video surveillance system based on human activity recognition mainly includes four parts: video data acquisition by camera; construction of human activity recognition and analysis engine; processing and storage of video data by server; design of front-end page. The core as well as difficulty of the system lies in the construction of the human activity analysis engine. The four parts of the whole system are independent but still closely related. The first step of the system is to use a camera in the data acquisition section and collect a large number of experimental personnel to simulate the video fist, kicking, landing, waving, jogging, walking, talking and jumping. And then combining the public data set KTH, each video clip is placed in the folder by category. In the part of the human activity recognition engine, the training set is put into the network for training. When the model training is completed, the generated H5 file can be put into the program on the server. On the server side, to write the page configuration file, the database configuration file, and write the Java file according to the Spring MVC framework. The video stream is introduced into the server in real time when the system is running, and the video is cached and analysed on the server. The results are stored in the database and read on the web page. The structure of campus safety monitoring system based on human activity recognition is shown in Figure 2.

Data acquisition architecture design
The main source of data in the system is the existing video surveillance system. The main purpose of the system is to warn the abnormal human actions, so the real-time requirement is very high. Early most of people use analogue video surveillance system, now the digital video surveillance system is used widely. The transmission of digital video surveillance can be transmitted by wireless or wired [1]. The digital video surveillance system uses embedded video web [2] in the data collection. The video signal which sent by the camera can be directly transmit to the server after the compression of the video signal. In this way, all the devices can be identified by IP address, directly connected to the LAN, without the limit of the length of the cable. It can easier arrange the complex monitoring network in the building, and it has a good expansibility, because the increase of the equipment is only an increase of IP. The architecture of data acquisition is shown in Figure 3.

Human activity recognition engine architecture design
This system needs to analyse the video stream and use the idea of image recognition. At present, the mainstream human activity recognition methods are based on local feature representation and deep neural network [3]. The former is subdivided into interest point detection, local feature extraction and local feature aggregation. The latter includes space-time network, multi stream network [4], depth generation network and temporal consistency network. By comparing the results of different networks in different data sets [5], VGG16 is chosen as the main body of the system neural network.
The comparison of the performance of different networks is shown in Table 1. VGG16 is a large convolution network. The number of neurons contained in it is shown in Table 2. As shown in the table, the final output of the VGG network has more than 138 million parameters. Confront such a large number of parameters, in order to build this system quickly, all layers in the VGG16 network are frozen and loaded into the weight file which does not contain the top weight. In the end, the system is added to the layer of the classification.

Server architecture design
The system architecture is designed to use MVC framework to separate models, views and control layers, and implement [6] based on Spring MVC framework. Spring MVC is a lightweight Web framework, which easy to use, with a large number of technical documents, little difficulty, and good extensibility. It often used to build a high quality Web application. Tomcat [7] is selected at the bottom of the Java EE Servlet server. The system has both user and video database, user database stores users' username, password and other related information. What's more about video database, the system doesn't store video files into the storage of the server, only the path of video files is saved in the database. Finally, the database configuration file is configured on the server to host the connection of the database to Spring. The basic flow of the whole service program is shown in Figure 4.

System implementation
The modules implemented on the embedded video server including video scheduling and transmission module, real-time acquisition module, control information processing module as such. The design of data acquisition part is shown in Figure 5. The video transmission module is the most important part of the data collection. It selects the corresponding scheduling strategy according to the service type to create the video stream. The video data is packaged and sent to the server. The transmission of video data is selected as the UDP protocol. UDP is often used to transmit data in the monitoring system, and the use of UDP can greatly guarantee the QoS. The multicast of UDP protocol can send packets to a specific user group. This system adopts IP multicast technology.

Build an engine for activity recognition
The main technologies used in this system are 3D convolution, migration learning, VGG-16 network and data enhancement. Because of the data set in hand is limited, we build it with the public data set KTH, and use migration learning and data enhancement technology to help train a better network.

3D convolution
When the image is input to the original image, images are not processed into a grayscale image. The grayscale image level is two-dimensional, but the RGB image is a stack of three-color layers of red, green and blue, which has three-dimensional blocks. So we can't use 2D CNN here, we construct convolution neural network which convolution the three dimensions by using the idea of 3D CNN [9]. The original image can be seen as a large rectangle with three slices reclosing. The filter is a small cube which is reclosing with three slices. The number of two object channels in the three-dimensional convolution are same. At this time, the small cube can be put into the large rectangular body to be calculated.
The height and width of the picture in system are both 224 pixels, the height and width of the filter are both 3 pixels, and each layer is a two-dimensional matrix. It is assumed that the three layers of the image are R, G and B respectively. The elements of a matrix are: 2,…,224 j=1,2,…,224). The three layers of filter correspond to R, G and B respectively is r, g and b. The elements of the filter matrix are: 3). Assuming that the output matrix is S i,j , the value of the element S 1,1 of the output matrix can be calculated as (1): (1) Assuming that the step length is 1 pixel, one grid is moved each time, and the above calculation is carried out after moving. Finally, a 222 * 222 size twodimensional matrix will be obtained. A filter can only extract one feature in the picture, and use multiple filters to convolution separately, so that multiple twodimensional images can be output. By stacking these two-dimensional images, a new multidimensional image will be generated. The number of channels used in this image equals the number of filters used.

Migration learning
This system uses the open source VGG16 convolution neural network [5] for human activity identification, which uses the idea of three-dimensional convolution when building the network. The number of channels in the image is three, which corresponds to three colours of red, green and blue respectively. The network is built with 16 layers, and the input original image size is 224 * 224 * 3, and at the beginning there are two coiling layers that use 64 * 3 * 3 and a stride with a step of 1. Next, the picture is compressed with a pool layer of 2, the output size is 112 * 112 * 64, and then a number of coiling layers and a pool layer are intersecting. After 5 rounds, the feature graph is fully connected and finally activated by Softmax. The network structure of VGG16 is shown in Table 3. The system use PyCharm to build the convolution neural network, and the process of building VGG16 is as follows. Keras package is convenient in PyCharm to build the network structure. Keras is an open source deep learning library based on python. This system uses Keras to help building convolution neural network.
The method of migration learning in this system is freezing all layers in base-model, train and adjust the weight only by the newly added layer. The method of adding new layers and freezing all parameters in the VGG16 network in this system is: There are ten types of human actions that need to be identified in this system: hugging, jogging, walking, talking, jumping, punching, kicking, falling down, waving, asking for help and pushing. In recognition, the input video is divided into frames, and the OpenCV library function cv2.imwrite is called in Python to save the video stream by frame. The location of the picture is set to the location where the video frame is saved.