Behavior monitoring model of kitchen staff based on YOLOv5l and DeepSort techniques

. Although the monitoring system has been widely used, the actual monitoring task still needs more manpower to complete. This paper takes yolov5l model and deep sort algorithm as the basic framework to identify and track the staff in kitchen environment. We apply a relation construction with detected items and people, then label the relation corresponding to behaviors violate the regulations of kitchen, such as the staff did not wear mask or hat. We train our model and the experimental results show that the model can correctly identify the inappropriate behaviors of staff. The model achieves the time-constrained accuracy of 95.32% in identifying whether the staff wear a hat or not, and the time-constrained accuracy of 96.32% in identifying whether the staff wear mask correctly. The result shows that the proposed model could fulfil monitoring task in this kitchen environment.


Introduction
Although the monitoring system has been widely existing, the actual monitoring task still needs manpower to complete. The existing video monitoring system usually only records video images, providing information without interpretation of video images, which can only be used for extracting evidence after the event. Recently the deep leaning algorithm has been used in target detection and recognition, which allows the compute to perform monitoring task automatically and intelligently. [1][2][3][4][5][6][7][8] It provides a certain basis for our research.
This paper focuses on the application of intelligent video monitoring in kitchen environment. In this paper, the algorithm based on yolov5l and deepsort is used to detect the people and items in the kitchen environment monitoring to identify whether the staff wear masks and hats correctly.

Yolov5l
The network structure of yolov5 is divided into four parts: input, backbone, neck and prediction. The input part completes the basic processing tasks such as data enhancement, adaptive image scaling and anchor frame calculation. In the backbone part, CSP (cross stage partial) structure is used to extract the main information from the input samples for subsequent use. The neck part adopts FPN (Feature Pyramid Networks) and PAN (Path Aggregation Network) structure, and uses the information extracted from the backbone part to enhance feature fusion.
In our model the lost function of prediction for bounding box applies GIOU_Loss Is to make a prediction and calculate the loss value, such as GIOU_Loss. For two boxes A, B. Firstly, we calculate their minimum convex set (the minimum bounding box surrounding a and b) C. secondly, combined with the minimum convex set C, we calculate the formulas of GIOU and GIOU_LOSS as follows: Yolov5l model updates yolov5 model in depth and width of the network construction. The backbone network part uses CSP structure three times with 3, 9 and 9 residual components. The neck part uses CSP structure five times, and yolov5l uses three residual components in each CSP structure.

Deepsort
Deep sort is a multi-target tracking algorithm. It uses motion and appearance information for data association. The algorithm detects the object in each frame, and matches the object with previous detection.
The weight of matched-degree is obtained by the weighted sum of Mahalanobis distance between the position and the similarity of the image with in the bounding box area. When calculating Mahalanobis distance, Kalman filter is used to predict the covariance matrix of motion distribution. The minimum cosine distance is calculated by using motion and appearance information. The matched-degree is defined in the following formula: where d (ଵ) is Mahalanobis distance, d (ଶ) is cosine distance and ɉ is weight coefficient. The minimum cosine distance is calculated by using motion and appearance information.

Behavior regonition
Firstly, yolov5l is used to recognize objects (people, hats and masks), and then the bounding box of people are transferred into the system to detect violations. This mode detects the behavior violation based on two rules: Hat wearing: The hat wearing is identified based on the bounding box of people and hat. If the bounding box of hat locates above than the top quarter of the height of person's bounding box, it is defined as appropriate hat wearing. Mask wearing: The inappropriate mask wearing, such as wear the mask on the chin is identified with "B mask" label. And not wearing mask is identified as label "C mask" .

Dataseting
The data source of behavior data set is collected in kitchen environment. A total of five different scenes is recorded. Each scene is recorded for an hour, and the total length of the video is 5 hours. The training dataset applies a total of 2000 pictures are captured in the recorded video. The item are labeled with five types, which are "person", "hat", "A mask", "B mask" and "C mask". Among them, label "person" refers to the location of the staff in the camera area. Label "hat" is the hat worn by the staff in the camera area. Label "A mask" is where the worker wears a mask in the camera area. Label "B mask" indicates that the staff does not wear the mask properly in the camera area. Label "C mask" means that the staff does not wear masks in the camera area. The labeling example is shown in the Figure 1.

Experimental implement
In this paper, in the process of the experiment, GPU is needed for calculation. Table 1 is the hardware environment configuration of this experiment.

Result
The performance of items recognition in kitchen environment is evaluated with labbelled dataset. This experiment takes 2000 labeled pictures, where 90% of them are used for training, and 10% of them are use for testing. The number of training iterations is set to 300, The results of the training data set are shown in Figure 2.  The inappropriate behavior recognition is also evaluated. The model achieve accuracy of 95.32% in identifying whether the staff wear a hat or not, and the accuracy of 96.32% in identifying whether the staff wear mask appropriate. Accuracy is defined as the inappropriate behaviors in the video can be correctly recognized with the duration of half a minute. The example of inappropriate mask wearing is shown in Figure 3.

Conclusion
In this paper we proposed a hybrid model which combines yolov5l, deepsort and violation identification function. The model can effectively detect inappropriate behavior of the kitchen staff. Our research can can effectively reduce human labor in the task of the kitchen monitoring and realize the automatic supervision.