Multi-layer attention for person re-identification

Person re-identification has been a significant application in the field of video surveillance analysis, yet it remains a challenging work to recognize the person of interest across disjoint cameras of different viewpoints. The factors affecting the identification results include the variation in background, different illumination conditions and the changes of human body poses. Existing person re-identification methods mainly focus on the feature extraction of the whole frame and metric learning functions. However, most of those algorithms treat different areas without distinction. It is worth emphasizing that different local regions make different contributions to image representaion, which exactly conforms to the attention mechanism. In this paper, we introduce a novel attention network which explores spatial attention in a convolutional neural network. Our algorithm learns the visual attention in multi-layer feature maps. The proposed model not only pays attention to the spatial probabilities of local regions, but also takes the features in different levels into consideration. We evaluate this multi-layer spatial attention model on three benchmark person re-identification datasets: Market-1501, CUHK03, and DukeMTMC-reID. The experiment results validate the advances of our adopted network by comparing with state-of-the-art baselines.


Introduction
Recently, person re-identification (Re-ID) task, as an indispensable part of video behaviour analysis field, has received widespread attention. The task of person re-identification aims at recognizing the same people from multiple different cameras with non-overlapping views. Person re-identification problem has broad potential application prospects in many occasions, especially the security systems. It remains a challenging job since the same person changes a lot under different shooting conditions. As can be seen from the image sets in figure 1, the variation in pose, illumination and background has a great impact on the recognition results.
When dealing with person re-identification task, given an image captured by Camera A (probe image), it is compared with all images which come from Camera B (gallery images). The results are ranked according to the degree of similarity between the probe image and gallery ones. In order to achieve good performance, two steps are essentially important: (i) extract features that better describe the images; (ii) find a proper similarity measurement. Many attempts have been made to improve the above two steps. Some algorithms [1,2,3,4] focus on the feature extraction schemes, including the representation of images in color and texture, and the fusion of different levels of features. Regarding to the similarity learning, several metric learning methods [5,2,6] have been proposed to learn a feature space in which the calculated distance of the feature vectors belong to the same person are smaller than those belong to different pedestrians. With the success of deep learning models, many methods [7,8,9,10,11,12,13,14] rely on deep neural networks to tackle the feature extraction and metric learning steps. The deep network architectures achieve more representative feature expressions than handcrafted features. The most commonly employed convolutional neural network (CNN) models can be categorized into two types: (i) classification networks [8,12,13], which are originally used in the image classification task and object detection task; (ii) siamese networks [7,10,11,14], which take a pair of images or triplet images as the network inputs. However, few algorithms take the human attention mechanism into consideration. With the assumption that when processing images, human visual system tends to selectively focus on the important regions instead of treating all regions equally, attention mechanism dynamically allows different levels of features to gain different weights. The adoption of attention model has been proved effective in various tasks such as image captioning [15,16,17], machine translation [18,19], and question answering [20,21].
In this paper, we adopt a novel algorithm which explores visual attention models in a CNN network. Our contributions are: (1) We propose a spatial attention-based convolutional neural network for person re-identification task. Existing models generally ignore the selection of attentive regions. The formulated attention of spatial regions conforms to the natural visual attention mechanism, resulting in a better expression of the whole image. (2) We explore the attention mechanism in multi-layer feature maps. The multiple layers of weighted attention features are fused. Since CNN networks are multilayer, the proposed attention network takes full advantage of the CNN architecture. (3) The attention-based deep model completes the person re-identification procedure in an end-toend manner. We validate the effectiveness of our framework by comparing the results with the state-of-the-art algorithms on three Re-ID benchmark datasets: Market-1501 [22], DukeMTMC-reID [23] and CUHK03 [9].
The rest of this paper is organized as follows. Section 2 briefly reviews the related work of Re-ID task. Section 3 describes the framework of our attention-based approach in details. Section 4 shows the experiment results on three public benchmark datasets and gives the corresponding analysis. Section 5 concludes our work.
For an image-based person re-identification system, feature extraction and distance learning are typically considered as the two main components. Various algorithms have been proposed to solve the Re-ID problem. Some of them complete the above two steps separately, others utilize the deep learning-based method to achieve a jointly-learning procedure.
In order to present a better representation of given images, the basic principle is to find features unrelated to illumination, pose, background and viewpoints. Many existing algorithms have extracted more discriminative and viewpoint-invariant features. To get more information about the mean attributes of pixels, Matsukawa [3] exploit a hierarchical Gaussian distribution descriptor (GoG), modelling both means and covariances. In [24], An et al. conduct the matchings by projecting the original features into a reference subspace instead of matching directly. Experiments validate the effectiveness of the reference descriptors (RDs) generated with the correlations of the reference sets. To enhance the Re-ID performance, [1,4] utilize different fusing schemes to combine the features obtained at different levels, including pixel-based low-level features (e.g. SIFT (scale-invariant feature transform) [25]), mid-level features (e.g. BoW (Bag of Words) [22]), and high-level features (e.g. features jointly learned in CNN [14]).
In addition to methods concerned with image representation, many approaches aim at finding a proper similarity measurement which makes the distance metric discrimitive. In [26], a null space is learned in which the in-class distance is minimized and between-class gap is maximized. Yu et al. [27] propose an unsupervised approach to deal with the limitation of lacking labelled sample images. The algorithm learns an asymmetric metric with projections on each viewpoint respectively. Considering the cross-view distortion of features, Chen et al. [28] formulate a Camera coRrelation Aware Feature augmenTation (CRAFT) framework, which automatically calculates the cross-view camera correlation. With learned features projected into an adaptive subspace, the CRAFT framework obtains view-specific features for Person Re-ID. [29,30,31] take the advantage of deep learning architecture and optimize the task of feature selection, similarity learning and ranking jointly.
Most of the Re-ID methods are proposed under the assumption that the probe and gallery images are well-aligned. However, due to the different viewpoints and the change of pedestrian poses, misalignment is a key issue to be considered. To solve this problem, visual attention schemes and saliency-based techniques are adopted. Some studies pay attention to saliency learning. Zhao et al. [32] apply pedestrian saliency distribution learning and estimate the scores based on the matching of constrained patches. The whole learning and matching procedure is unified into a RankSVM framework. In [33], a weighted integration scheme is adopted, combing human salience information with SDALF (Symmetry-Driven Accumulation of Local Features) [34]. With the rotation invariant attributes, the experimental performance is improved. Attention mechanism can also be seen as a concern for salient regions. With the recent success of exploiting attention modules in other fields, a few attention-based deep learning networks have been adopted to tackle the problem of misalignment in Re-ID. Gradient-based attention mechanism is exploited in [35]. In [36], both the CNN-extracted features and color histograms are fed into the recurrent attention architecture to perform a coarse-to-fine selection process. Liu et al. [37] formulate a Comparative Attention Network (CAN) architecture which takes image triplets as the training input. The CAN algorithm compares different local regions repeatedly instead of taking just one glimpse of the whole image. The global features learned from CNNs are delivered to a LSTM-based attention module to obtain the visual attentive features. The model in [38] also utilizes LSTM units to generate the spatial attetion encoding. In [39], a Harmonious Attention CNN (HA-CNN) model is proposed. HA-CNN adopts multi-branch scheme to process both local level and global level attention.
For each branch, soft attention and hard attention are combined to fully explore the complementary information.
Different from the attention-related algorithms mentioned above, our proposed attention network not only explores spatial attention, but also fuses the attention weighted feature maps learned at different layers for a richer representation. Consequently, the adopted method outperforms the state-of-the-art approaches.

Overview
We adopt the attention-based model for person Re-ID task. The overall framework is illustrated in figure 2. We formulate an end-to-end Convolutional Neural Network to learn attentive features. Through exploiting spatial attention on multiple layers, our proposed network is able to make the original CNN-based feature maps adaptive to the more discriminative local regions and attributes. The spatial attention aims at encoding "where" to pay attention to. The feature integration scheme results in a more robust feature representation. For attention selection mechanism, we formulate an attention module which learns spatial attention. At the l-th layer of the base CNN network, the feature map is denoted as V l . The l -th layer attention weights are learned from V l . The functions are formulated as: where refers to the attention learning function, which will be elaborated in the next attention sections. is a linear weighting calculation which combines both original feature representation V l and learned weights . A l is the weighted map, considered as the output of an attention module.
Note that the obtained output map l , where h, w and c denote the pixels of the feature map in height dimension, width dimension and channel dimension. The spatial attention maps are generated from the representation map V l .

Spatial Attention
Spatial attention can be intuitively explained and conforms to the visual processing mechanism naturally. Instead of treating all the local regions equally, the spatial attention focuses on the selected regions and helps to enhance the representation. Inspired by the region enhancement characteristic, we adopt the spatial attention module in our formulated network to tackle the misalignment problem for person Re-ID.
Given the original generated feature map l h w c V R    , the spatial attention module aims at calculating a probability map that indicates the attention probabilities of all local regions. The spatial attention weighted feature can be formulated as: where , V i j f represents the feature at the location of (i, j) in feature map V l , and , i j  represents the corresponding attention probability at the same location. At (i, j), the learned spatial attended feature is denoted as ,  Fig. 3. The adopted CNN network structure. ResNet-50 is adopted as the base CNN network for Re-ID task. For the last three conv-layers res5a, res5b and res5c, the attention module is applied to obtain attention-based feature maps. After pooling methods, the mid-level features and high-level features are concentrated for final classification.

Multi-layer Fusion
We choose ResNet-50 [40] as our CNN base network for person Re-ID task. Figure 3 shows the network structure of our proposed method. For ResNet-50 network, res5c represents the last convolutional layer, which contains the image representation in highlevel. To take full advantage of the mid-level semantic information, we also select layers res5a and res5b as a supplement for better description. The feature maps of the above layers are denoted as V res5a , V res5b and V res5c . Instead of fusing the multiple original feature vectors, we concentrate the weighted feature maps after the attention modules, denoted as A 5a , A 5b and A 5c .
By adopting the appropriate pooling scheme, the feature vectors are constructed from multiple attention-weighted layers. For the maps A 5a , A 5b , A 5c that is corresponding to the selected layers, we leverage the global average pooling layer to generate the vectors f 5a , f 5b and f 5c . The concentration of mid-level feature vectors f 5a and f 5b are connected by a fully connected layer, which helps to reduce the feature dimension. After that, the generated midlevel vector f mid is fused with f 5c to obtain the final feature vector for classification.

Experiments
In this section, we evaluate the performance of our proposed attention-based multi-layer integrated method. We validate the effectiveness of the multi-layer fusion strategy. The assessment is conducted on three benchmark datasets comparing with the state-of-the-art baseline algorithms.

Market-1501 Dataset
The Market-1501 Dataset [22] contains 1,501 pedestrians captured from 6 non-overlapping cameras, with 5 of high resolution and 1 of low resolution. There are altogether 32,668 annotated bounding boxes, using DPM (Deformable Part Model) as the detector for pedestrian. In the experiment, we set the training and test sets as provided, including 750 persons in training set and 751 in testing set.

DukeMTMC-reID Dataset
The DukeMTMC-reID Dataset [23] involves 1,404 identities and 36,411 bounding box images. All the dataset images are cropped from video frames recorded by 8 different highresolution cameras. The whole set is partitioned into two parts, with 702 individuals forming the training set, and 702 identities forming the testing set.

CUHK03 Dataset
The CUHK03 Dataset [9] consists of 13,164 images from 1,360 individuals captured by 6 cameras. Part of the images are manually cropped and the rest are detected by DPM. There are 767 individuals in training set and 700 identities in the test set.

Experimental Setup
In our formulated network, we exploit the widely-adopted CNN model: ResNet-50 [40] as the basic CNN network. The method is implemented using the deep learning framework Pytorch. All the pedestrian images in the datasets are resized to 160×64. The human body ratio is thus kept to remain undistorted. For optimization step, we choose Adam as the optimizer. The initial learning rate and the decay factor is set to 0.0005 and 0.95 respectively. We set the batch size as 32, and the maximum training epoch as 100. For performance evaluation, we use the standard CMC (Cumulative Matching Characteristic), which represents the relationship between the correct identification rate and rank numbers, and mAP (Mean Average Precision), which measures the precision of classification. Our whole attention-based framework is completed in an end-to-end procedure.

Evaluation of Multi-layer Attention
In this section, we investigate the improvement effect of our multi-layer attention fusion mechanism. We conduct the experiments by reducing the number of layers followed by the attention module. For our adopted ResNet-50 network, 1-st layer, 2-nd layer, 3-rd layer refer to res5a, res5b and res5c respectively. Table 1 shows the experiment results. In the table, 1-layer refers to the method of applying the attention module to res5c layer only. 2layer represents the fusion of learned attention features on res5b and res5c. And 3-layer denote our adopted approach. As can be observed from table 1, compared to modulating the attention network directly to the last convolutional layer, the 2-layer fusion method gains 3.5% for Rank-1, 2.3% for Rank-5, 2.8% for Rank-10, 0.9% for Rank-20 and 3.2% for mAP on Market-1501. The accuracy of 3-layer fusion method has a minor increase of 0.5% for Rank-1 accuracy. The experiments indicate that the multi-layer attention mechanism has a positive effect for better image representation.

Evaluation on Market-1501
We compare the results of our method with recent state-of-the-arts on Market-1501 dataset, as shown in table 2. The algorithms consist of feature extraction methods, i.e. CRAFT [28], GLAD [41], Zhao et al. [42], metric learning methods, i.e. SCSP [43], DNS [26], and deep network-based methods, i.e. CAN [37], PIE+Kissme [44], PDC [45]. As can be observed from table 2, our method outperforms the 2-nd best algorithm PDC by 3.0% in Rank-1 accuracy and 6.7% in mAP. Among all the compared approaches, CAN and PDC use attention mechanism for person Re-ID task in their algorithms as well. Statistics indicate that our attention-based deep network outperforms other approaches using attention information. It is worth noting that PIE also uses ResNet-50 as the base network architecture, with part-aligned representation based on the pose estimator. Our proposed method surpasses PIE by 8.4% for Rank-1 accuracy. These statistics show the advantage of our multi-layer attention model over the state-of-the-arts baseline.  Table 3 indicates the superiority of our multi-layer attention integration method compared to the state-of-the-arts on DukeMTMC-reID dataset. Compared to Market-1501 dataset, DukeMTMC-reID dataset has more complex scenes and more changes in background. Except for LOMO+XQDA [2], all the other compared approaches, i.e. SVDNet [46], APR [47] and PAN [48], formulate the framework on the basis of the deep learning network. On DukeMTMC-reID dataset, we achieve Rank-1 accuracy at 78.8%, and mAP at 60.0%, surpassing the 2-nd method SVDNet (ResNet-50) by 0.1% and 3.2%, respectively. This suggests that the attention mechanism plays a significant role in feature representation and our proposed network adapts to the complex scenes well.  Table 4 shows the CMC Rank accuracy and mAP of several recent proposed algorithms reported on CUHK03 dataset. We conduct the evaluation on both manually cropped bounding boxes and automatically detected images. Compared to the manually labelled images, the detected ones have more misalignment problems, presenting a more difficult task. In our evaluation, the compared counterparts include HACNN [39], LOMO+XQDA [2], LOMO+XQDA+re-rank [49], Ahmed et al. [7] and IDE+Re-rank [49]. HACNN proposes a harmonious network which combines soft attention in pixel and hard attention in region, exploiting attention mechanism in deep learning network. Our attention method achieves the performance of Rank-1 = 59.2%, mAP = 57.9% for manually labelled images, and Rank-1 = 56.3%, mAP = 54.8% for detected images. It shows the superiority of the adopted attention learning approach over other attention driven mechanism. Despite the disparities among these three Re-ID baseline datasets, i.e. scenes, camera views, and image processing methods, we show that our proposed approach improves the performance for Re-ID task. The experiment results demonstrate the excellence of our attention-based framework.

Conclusion
In this paper, we present a novel attention-based method for person re-identification task. The whole framework is formulated in an end-to-end procedure. Our network benefits from both attention mechanism and multi-layer integration mechanism. For attention selection, we explore spatial attention on different local regions. The introduction of attention module takes full advantage of the complementary attention information. For multi-layer fusion module, the concentration of mid-level features and high-level features performs a better image representation. The detailed analysis of feature fusion scheme on multiple layers is provided. We conduct the performance evaluation on three baseline Re-ID datasets, i.e. Market-1501, DukeMTMC-reID and CUHK03. Experiments validate the effectiveness of our proposed model when compared to the state-of-the-art approaches.