Online semi-supervised multi-person tracking with gaussian process regression

. Most existing multi-person tracking approaches are affected by lighting condition, pedestrian pose change abruptly, scale changes, real-time processing to name a few, resulting in detection error, drift and other issues. To cope with this challenge, we propose an enhanced multi-person framework by introducing a new observation model, which adaptively updates fully online to avoid the loss of sample diversity and learning in a semi-supervised manner. We fuse prior information for tracking decision, meanwhile extracted knowledge from current frame is used to assist to make tracking decision, which can be viewed as a transfer learning strategy, and both aspects can ameliorate the tendency to drift. The new approach does not need any calibration or batch processing. Experimental results show that the approach yields comparable or better performance in comparison with the state-of-the-arts, which do calibration or batch processing.


Introduction
Tracking-by-detection approach is very popular during recent years [1] [2]. In practice most tracking-by-detection approaches are still limited to special scenarios and affected by occlusion, scale change, real-time processing etc. Moreover harmed by false and missing detection, some methods employ occlusion reasoning to smoothing the trajectories [3]. However, these methods are sensitive to detection error, because they build the trajectory based on two consecutive frames. Thus, during long-term occlusion or abrupt changes in pose, a danger is the tracked target trends to drift.
To deal with this problem, dynamic and observation model are combined used for the tracking problem. Dynamic model take pedestrian behaviour into account, moreover often used for estimating the new location of pedestrian. However, most existing dynamic model use only the previous one state information for predicting and lack of utilizing prior information, when the pedestrian motion change abruptly tend to incorrect estimation. Observation model represents the pedestrian's appearance change, particularly when adapted online take account for the gradual appearance change. Most of the existing observation models gather past appearance information over time, but in fact, these methods lack utilize current state information of the pedestrian's appearance, inevitably lead to drift problem.
In this paper, we introduce a new observation model to cope with these problems mentioned above in several aspects. First, we fuse prior information for tracking decisions. Second, the observation model is learnt in a semi-supervised manner through using both labelled and unlabelled sample. Third, background information is taking into consideration in the process of observation model updating. Fourth, re-weighting knowledge is used for tracking decision can be viewed as a transfer learning strategy. All the aspects mentioned above tend to alleviate drift. The main contributions of our work are:  The new observation model, updated adaptively avoids the loss of sample diversity and learnt in a semi-supervised manner.  We extract re-weighting knowledge from the current pedestrian status information and used for tracking inference, can be viewed as a transfer learning strategy. We test our method on two multi-person tracking benchmark sequences. Our method achieved promising results better than previously tested state-of-the-arts. The rest of the paper is organized as follows. Section 2 presents the new observation model used for tracking. Section 3 evaluates the performance of the observation model in comparison with a number of typical methods. Section 4 concludes the paper and points out some future work.

New Observation model
In this section, we present the process of tracking. At each frame t f , the current location information of a tracker is stored in a bounding box We use Kalman filter's prediction as input to stochastically generate a set of pedestrian candidate location in the current frame, which is Tracking results of tracker i T can be estimated by MAP as shown in equation (2).
For each sample, we introduce an indicator variable We introduce two real valued latent vectors A l and U l , corresponding to the label A y and U y respectively. We connect regression and classification by using a sigmoid output model. The Gaussian process model restricted to the auxiliary data and unlabelled data is as shown in equation (4).

Graph Laplaacians
We construct the prior covariance matrix based on the weighted graph , which has the node set Vand edge E, corresponding to all samples in the way similar with [4]. We explore the manifold structure of all samples. Furthermore we define weight matrix W of graph  using the method proposed by [5]. Finally prior covariance matrix is defined by the inverse graph Laplacian 1  .
Because of the sigmoid noise label output model, the   is no longer Gaussian and has no closed form solution. Assuming   is a uni-modal function, we use its Laplace approximation to get the optimal estimation of A l and U l . Because we construct prior covariance matrix depending on all samples, the correlated structure of the labelled samples and unlabelled samples has a significant effect on the latent real-valued output. The latent variable A l is the re-weighting knowledge extracted from the Regression can be a soft replacement of indicator label A y , and is better for ameliorating sample misalignment problem, less sensitive to noisy compare with the indicator variable.

Tracker's birth and death
For the purpose of maintaining the tracker, we divided trackers into two groups based on the template it owes. Once a tracker is born we call it Novice, it will accumulate templates throughout the tracking process, after K template accumulated over a period of robust tacking. Novice would be promoted to Expert, conversely an Expert demoted to a Novice when it loses template less than K, we set K to 5. Each tracker keep at most Nmax reliable templates by discarding the lower score template, we set Nmax to 10.
A tracker candidate is activated when its detection rate is above the init  . On the contrary, a tracker would be killed when its detection rate is less than term  . Both of them is given by in equation (5) and (6).
where 1  and 2  is the scale factor, we set to 1 and 2 respectively, for each tracker's detection rate is defined as equation (7).
where matched i N  is the number of detections matched with i T in a sliding window of length t 

Experiments
Through several experiments, compared with the state-of-the-arts, our approach shows his unique advantages.

Datasets and Ground Truth
We currently test our algorithm on two sequences, one is the sequence S2L1, which is taken from the VS-PETS benchmark 2009 [6], representative tracking results can be seen in Fig 1. This sequence is filmed by 7 cameras and show up to 8 people the resolution is (768×576), we only use the first viewpoint. Most people wear similar dark clothes, which make colour-based observation model for tracking difficult. We also test the new algorithm under a crowed environment (S2L2), which has a lot of pedestrians within a confined space make tracking difficulty even for the individual detect. A brief description for these two sequences is as shown in Table 1, the ground-truth used for evaluation is public available 1 .

Experimental Environments
All experiments were tested on a computer with 2.8GHz Octa-core CPU, 16GB memory. We use C++ implementation and rely on the OpenCV and Eigen library. Runtime performance about 2 fps per second with the new observation model employed for tracking. We believe that with GPU implementation or more optimized code could achieve real-time performance.

Evaluation Metrics
There is no standard established protocol to measure multi-object tracking performance, we use the current best practice which calculates the CLEAR-MOT metrics proposed in [7], FP means false positive, MS considers number of missed detections and ID.S takes account for the switches of identities. The multiple objects tracking accuracy (MOTA) was defined as equation (8).
The multiple objects tracking precision (MOTP) was given by equation (9).
Note that both MOTA and MOTP, higher values of the output indicate better performance (see [7] for detail). As shown in table 2, we compare our method with [8] on the PETS2009 sequence the results of [8] is tested by ourselves, for fair comparison we use the same detector as our method and same protocol to evaluate the outputs. Note that the results of [8] are slightly different from the original paper. It may be influenced by the parameter tuning, pretreatment optimization and other factors. We also show the results in [9] when available. As shown method [9] made the calibration action and perform batch processing of the data, our method achieve higher MOTA score success surpass both [8] and [9] on the S2L1 and S2L2 sequence. Compare with [8] our method have less missing detection and potentially increases the number of false positive. We believe the slightly lower MOTP score was caused by the update of the sample set not perfectly adapt the scale change over time. We also noticed that with the increase of density of people in the sense there are few veterans the ratio of veterans is much higher in sequence S2L1 than S2L2, it can be explained by there is more occlusion issues in S2L2 than S2L1.   [18], the comparison results are as shown in Table 3 and Table 4. Compare with other methods, the Recall rate and Precision rate of our method success surpass most methods in both S2L1 and S2L2 sequence. We also got the best MOTA score compare with other methods. Because the tracking decision mainly depend on the observation model, the observation model update process has a significant impact on MOTP score, we will do further optimization to improve the MOTP performance. The speed of this algorithm is proportional to the sample sampling size. While reducing the number of samples, we also consider the balance of accuracy of tracking and algorithm speed. In order to test the sensitivity of our algorithm to parameters, we conduct experiments which parameter setting from the baseline each parameter floating up and down 40% while keeping other parameters fixed. As shown in figure 2, the performance of our algorithm changes within a reasonable range for all the sequence. This indicates that our algorithm is relatively robust to the setting of parameters.

Conclusion
In this paper, we have presented a novel multi-person tracking algorithm. The new observation model adopts graph Laplacian, meanwhile prior gram matrix is constructed based on all samples. In this way unlabelled samples have strong influence on the prior can be viewed as a transfer learning strategy. We divided trackers into two categories base on the number of templates it holds, experimental results show that our algorithm hold obvious advantage compared with other methods.
A future work will research on re-identification scheme in our algorithm to help account for people re-identification. We will extend this framework to Multi-Target, Multi-Camera Tracking.