Do tracking by clustering anchors output from region proposal network

Most existing clustering algorithms suffer from the computation of similarity function and the representation of each object. In this paper, we propose a clustering tracker based on region proposal network (RPN-C) to do tracking by clustering anchors output by region proposal network into potential centers. We first cut off the second part of Faster RCNN and then cast clustering algorithms in feature space of anchors, including K-Means, mean shift and density peak clustering strategy in terms of anchors’ centroid and scale information. Without fully connected layers, the RPN-C tracker can lower the computational cost up to 60% and still, it can effectively maintain an accurate prediction for the localization in next frame. To evaluate the robustness of this tracker, we establish a dataset containing over 2000 training images and 7 testing sequences of 8 kinds of fruits. The experimental results on our own datasets demonstrate that the proposed tracker performs excellently both in location of object and the decision of scale and has a strong advantage of stability in the context of occlusion and complicated background.


Introduction
Visual object tracking, an important member of computer vision, has been attracting researchers for its various applications such as on video surveillance, humancomputer interaction, and autonomous driving [1][2][3] . The way to do tracking has changed greatly with booming of numerous advanced technologies in recent decades, particularly with the help of deep learning, but it remains some very challenging and unresolved problems. By using convolutional neural network (CNN) as descriptor, some proposed object detection networks bring benefits to tracking task not only in the accuracy but also in the robustness particularly in the context of complex background. Among them, the most commonly used networks come from RCNN [4] , Fast RCNN [5] , Faster RCNN [6] , and Mask-RCNN [7] . These CNN-based networks can tell the localization and classification simultaneously during tracking, but tracking with them only may result in some objects missed in a few frames and bring a troublesome shortage, i.e. the inefficiency.
In our proposed tracker, we considered to tackle these two issues by (i) using only region proposal networks (RPN), the part of Faster RCNN, to generate a great many of anchors with rich centroid and scale information; (ii) leveraging adaptive clustering algorithms on the output anchors to find the potential centers (including centroids and scales). Specifically, we combined RPN and clustering together called RPN-C to replace the left part of Faster RCNN, as shown in Fig 1. To begin with, a model is built on VGG16 for each object and then RPN plays a role of outputting candidate boxes, the anchors. Afterwards, our RPN-C tracker compared 3 clustering algorithms cast on the anchors, including K-means [8] , mean shift [9] and density peak clustering [10] . Finally, we established a dataset of 8 kinds of fruits containing over 2000 training images and 7 testing sequences with consideration of illumination, scale changing, camera shaking, multi-targets and etc [11] . We trained models both on the full Faster RCNN and on the part of it and then carried out extensive experiments on testing sequences. The result demonstrated that our RPN-C trackers could maintain the accuracy of tracking and reduce the running time at the same time.  Figure 1. Overall structure. The feature extracted from CNN is fed into RPN to get the information of anchors followed by two cases. In initialization part, we use Fast RCNN to initialize the boxes and in tracking procedure, we cast clustering algorithms to find the centers of boxes.

Initialization network
In initializing part, Faster RCNN inputs a fixed-height image for feature extraction by VGG16 and then the output is fed into RPN network to obtain the boxes offsets and scores of being object or not at every position in the feature space. According to the information of RPN, we could obtain rich proposals of locations for targets and scale them into feature size to reshape the feature of VGG16 into batches with the same size. All the batches will be fed into FC layers to give the final result of centroids and scales. Afterwards, we consider all the result of first several frames to give the more correct initialization by voting (Fig 1). After Initialization, we need to maintain the tracking procedure using part of Faster RCNN (VGG16 and RPN) combined with clustering. Compared with clustering in the image space, clustering in subspace of feature has great advantages of efficiency and accuracy. For one thing, subspace of one image is much smaller than the original image in size, which only needs a small number of iterations to achieve the whole process. For another thing, the replacement of pixels with anchors and their information in the subspace extracted by RPN is much more accurate than the color or texture space in the original algorithm.

Mean shift and K-means
Clustering algorithm is a kind of density estimation based on some kernel function convolving data into centers [8][9] .
The key idea of mean shift is to estimate the density by computing means with kernel function. Assuming that finding centers both for the centroids and the scale of boxes has the same operation, we let { } Given predicted center t j P , the next center can be computed by equation below where K(*) is the Gaussian kernel function.
More exactly, our data to be clustered is the rich anchors output from RPN with 4 dimensions and the update under estimation is mean vector and shift vector. Thus, the centers can be found in two styles, one of which is to gather the anchors as one 4-dimension vector considering centroids and scales simultaneously and the other of which is to regard anchors as two 2-dimension vectors and to cluster them for centroids and scales independently. In practice, the number of boxes to be clustered is known and initialization of clustering can be replaced with last positions. In this case, this mean shift algorithm could be regarded as K-means algorithm with fixed initial number of centers. In our work, both two kinds of algorithms are considered to do comparison.

Density peak clustering
Besides mean shift, a more popular and attractive clustering algorithm [10] allows us to do the tracking in a more novel way. This algorithm was defined with the assumptions that cluster centers are surrounded by neighbors with lower local density and that they are at a relatively large distance from any points with a higher local density. For the anchor information, every center with higher density of votes can be considered as the object location. To this end, we compute two quantities for each anchor t i P : its local density t i ρ and its distance The strategy is cast only to find the global and local maxima on the basis of distance between anchors but simpler than the original algorithm, the assigned members need not to be selected, bringing a much more efficient procedure. Again, the centers of centroids and scales can be clustered together or independently.

Results and discussion
Our proposed algorithm was evaluated on 7 sequences in our own dataset [11] in terms of overlap ratio (OR), center location error (CLE) and operating speed (OS). These 7 sequences are divided into 4 categories according to the interference from their interior and background including grape with static background, mango/strawberry with multi-object information, apple/pear with camera shaking and orange/banana/kiwifruit with change of scale in the context of occlusion. In the tracking part of our RPN-C trackers, 3 kinds of clustering algorithms were considered to do comparison including so-called K-means (KM), mean-shift (MS) and density peak clustering (DP). In addition, we run the tracking by clustering anchors in 2 fashions for centroids and scale both simultaneously and independently (S and I). In detail, we implemented our tracker in python on an Intel E5-2620 2.10 GHz CPU with 8 GB RAM and a GeForce GTX 1080Ti GPU.

Evaluation of three factors
As for the first two performance factors, CLE is calculated as the error in pixels between the center of the groundtruth bounding box (GT) and predicted bounding box given by the tracker (PRE).
( ) ( ) while OR is calculated as  grape  59  59  133  177  190  255  184   mango/  strawberry  59  62  120  164  179  239  184   apple  59  60  79  107  83  94  187   pear  57  60  71  82  64  67  188   banana  58  58  102  139  130  162  184   orange  58  59  92  103  88  102  189   kiwifruit  59  60  81  87  69  75  183   average  58  60  97  123  115  142  186 Tables 1 and 2 show that the average CLE and average OS for our tracker across the range of 7 testing image sequences. Generally, our RPN-C trackers have better performance in the location of objects and the running speed and maintain the accuracy of the boxes and scales. In terms of all these three evaluation factors, RPN+MS stands out in the simultaneous way but other trackers work excellently in some special occasions. Taking DP clustering as an example, it performs well in the context of scale change and occlusion such as kiwifruit and orange. If we compare the ways to do one single clustering, we can see no significant difference between clustering centroids and scales together and independently, but it has much to do with the operating speed. Thus, we could use the RPN-C trackers in the simultaneous way but add some bias to centroids and scales when computing the distance between anchors. To compare them with Faster RCNN to do tracking directly, they have a strong advantage of running speed, apparently the result of the cutting of the FC part. If running the tracker using only CPU, we can get a more impressive comparison. As the performance in tracking accuracy, owing to the missing boxes of Faster RCNN, it gets a bad score since we set a high value if no boxes are detected. Nonetheless, if Faster RCNN gives one box with a very high confidence, the accuracy of both centroids and scale will be very high. However, it loses boxes frequently and detects some fake boxes with lower robustness, resulting in its bad general performance.

Visualization of tracking
For a more vivid visualization, we select the best sequence from every occasion-apple from camera-trembling situation, grape from color change, kiwifruit from scale change and existing occlusion and mango/strawberry from multi-object occasion-to show their tracking results in images. To be exact, the boxes detected by these trackers are shown in different colors. The cyan boxes are detected by Faster RCNN itself and in these four occasions, we chose RPN+MS+S, RPN+DP+I, RPN+DP+S and RPN+KM+S respectively marked in different colors. As Figure 2 shows, many boxes are missed by Faster RCNN especially when the targets have bad appearance. In fact, the testing sequences were collected one year after the training set using a different camera, so they have a big difference between each other particularly in the illumination. That is to say, our RPN-C trackers have stronger robustness to do tracking.

Conclusion
This paper presents state-of-the-art RPN-C trackers with combination between part of Faster RCNN and clustering algorithms, which produce very promising results on our testing sequences in terms of CLE, OR and OS. This algorithm divides the whole tracking procedure into two parts including initialization of bounding boxes using the full part and the main tracking task using part of Faster RCNN combined with clustering. The primary contribution of this paper is to use modified clustering algorithms to do selection of anchors output from RPN. As for the evaluation, we compared 3 related algorithms both in simultaneous and independent ways for centroids and scales to show their performance. The evaluation results show they outperform Faster RCNN owing to their stronger robustness. In addition, as the DP clustering performs not well as we expected, we need to modify the way to find the global and local centers and cast a new strategy to exploit the local density in the future.