Fusion of Deep Features and Weighted VLAD Vectors based on Multiple Features for Image Retrieval

. In traditional vector of locally aggregated descriptors (VLAD) method, the final VLAD vector is reshaped by summing up the residuals between each descriptor and its corresponding visual word. The norm of the residuals varies significantly, and it can make “visual burst”. This is caused by a fact that the contribution of each descriptor to VLAD vector is not the same. To address this problem, we add a different weight to each residual such that the contribution of each descriptor to the VLAD vector becomes even to a certain degree. Also, traditional VLAD method only uses the local gradient features of images. Thus it has a low discrimination. In this paper, local color features are extracted and used to the VLAD method. Moreover, we fuse deep features and the multiple VLAD vectors based on local gradient and color information. Also, in order to reduce running time and improve retrieval accuracy, PCA and whitening operations are used for VLAD vectors. Our proposed method is evaluated on three benchmark datasets, i.e., Holidays, Ukbench and Oxford5k. Experimental results show that our proposed method achieves good performance.


Introduction
In this paper we consider the task of large-scale image retrieval. In the past few years, Bag-of-Visual-Words (BOW) [1] [2] method has achieved great effect in image retrieval area. Generally, in order to ensure retrieval recall, a relatively large vocabulary will be required. Thus, it will lead to a low efficiency of retrieval time and high memory consumption.
Recently, Jégou et al [3] proposed vector of locally aggregated descriptors (VLAD) model, which aggregates descriptors based on a locality criterion in feature space. In fact, VLAD is a kind of representation of Fisher vector without probability. Its implementation is very similar to the BOW model. Also, VLAD is very cheap in consumptions of time and memory. In traditional VLAD method, the final VLAD vector is reshaped by summing up the residuals between each descriptor and its corresponding visual word. The norm of the residuals varies significantly, thus it can make "visual burst" [4]. To address this problem, we add a weight to each residual such that the contribution of each descriptor to the VLAD vector becomes even to a certain degree.
Originally, the SIFT [5] descriptors are adopted in VLAD method, and has shown good performance. As we all known, the SURF descriptor [6] is faster than the SIFT descriptor. Moreover, the performance of SURF and SIFT is comparative in most cases. [7] verified that the SURF descriptor was not only more efficient but also leading to higher accuracy than SIFT and rootSIFT descriptors. However, both the SIFT and SURF descriptors represent only local gradient information, which miss important color information. In order to solve this problem, many works combine the gradient and color information. For examples, in [7] CSURF feature was proposed, which are SURF-based color information; In [8], the author fused the CLOG [9] features and the SURF features at the stage of similarity measurement; In [10] the author proposed "color-SURF" descriptors which combined SURF with the approximate color local kernel histograms. In this paper, Color names (CN) [11] and SURF features are used in VLAD method.
In recent years, deep features are popular for image processing, such as image classification [12], object detection [13] and speech recognition [14] etc. In this paper, we adopt the pre-trained networks to obtain the deep features of images. Also, in order to improve retrieval accuracy, multiple VLAD vectors and image representation based on deep features are fused.

Framework of our proposed method
The framework of our proposed method is shown in Fig.1. Vocabulary based on local features (SURF or CN) are trained from an independent training dataset. For an image, SURF and CN features are extracted and quantized on corresponding vocabulary respectively. Here, we improve the traditional VLAD method by adding a weight for each residual, called as "weighted VLAD". The CN and SURF features are adopted to weighted VLAD method, respectively. Then, weighted VLAD vectors based on the two features are obtained. Moreover, the deep features are extracted from the image, and image representation based on deep features is computed. Then the vectors are fused into a vector to represent the image. Finally, similarity scores of the query image and dataset images are measured, and the retrieval results are returned.
The residuals of a visual word and the descriptors that are quantized to this word are computed, formulated as Then, the residuals are summed. A vector of length * L K d  is obtained, which is called as the VLAD vector. (4) Power-law normalization is adopted for the vector obtained at the step (3). It contains two steps: firstly, there is the square root with symbol, formula According to above step (3), it may cause "visual burst" phenomenon because contribution of each descriptor for the VLAD vector is not same, i.e., the closer from center of the cluster, the greater the contribution is, and vice versa. In the similarity measurement stage, this residual will be reflected in the contribution since the Euclidean distance is used. To address the problem, we add a different weight for each residual. Here, the weight is set to be the normalized distance of the descriptor and its nearest visual word, denoted as Eq. (1) and Eq. (2), i.e., the smaller (greater) is the distance between the descriptor and its nearest visual word, the smaller (greater) is the weight. The validity is verified in the experimental section (Section 3). It should be noted that a same weight is added to the residuals corresponding to each visual word in [4]. But in our algorithm, different weights are added to the residuals respectively. This makes our algorithm become more flexible and adaptive for image retrieval.
. In order to improve retrieval accuracy, deep features are extracted by using pretrained deep convolutional neural networks (CNN) models, and the length is 3 L .

Feature fusion and similarity measurement
For an image, the three image representations are fused to a vector, denoted as: where 1  , 2  , 3  are the weight parameters, and Euclidean distance is used to compute similarities between the query image and the dataset images. To reduce running time, we adopt PCA and whitening method which suppressed the co-occurrence problem with the dimensionality reduction [15].
Our proposed algorithm is summarized as follows:

Experimental results
In this section we verified our proposed method on three benchmark datasets, i.e., Holidays [16], Ukbench [17] and oxford5k [18]. In addition, Paris60k [19] is used to train vocabularies for Oxford5k. Vocabularies are trained from Mirflickr25k [20] for other datasets. All experiments are implemented on a computer with 8GB memory and 3.3GHz CPU (Intel(R) Core(TM) i5-4590).

Selection of parameters
In our experiments, the dense SURF descriptors are extracted for each image. Moreover, each CN descriptor is obtained in an image patch of size 4 4  . Also, the CNbased and SURF-based vocabularies of size 64 are used. Moreover, CNN features of images are obtained by the VGG-f model [21]. Here, a CNN-based representation is obtained from the second fully-connected layer of convolutional networks for each image, it is a 4096-D vector.
 is a power of the absolute value of VLAD vectors matrix. However, we find that the best value of  is between 0.1 and 0.6. The accuracies with different  by using weighted VLAD based on different features on three datasets are shown in Fig. 2.

Effectiveness of weighted VLAD
In this subsection, we verified the effectiveness of proposed weighted VLAD model. In Fig.3, two examples on UKbench dataset are shown. It can be seen that the results of wVLAD-SURF are better than traditional VLAD-SURF. In addition, we compare our weighted VLAD with [4] (VLAD-LCR-RN) in Table 1.
On Holidays, it can be seen that the results are the same, but the length of our vector is only about a half of VLAD-LCR-RN. On Oxford5k when vectors are reduced to 128D, wVLAD-SURF achieves a better result.

Fusion of multiple features
In experiments, wVLAD-SURF and wVLAD-CN are fused, denoted as wVLAD-SURF+wVLAD-CN. The multiple weighted VLAD vectors and deep features are fused into a vector, denoted as wVLAD-SURF+wVLAD-CN+CNN. In Table 2, the retrieval accuracies on different datasets are listed, where 128 L  denotes the length of VLAD vectors obtained by PCA and whitening operations. When 128 L  , on Holidays, it can be seen that the mAP obtained by wVLAD-SURF + wVLAD-CN increased by nearly 13% compared to wVLAD-SURF and wVLAD-CN respectively. Also, the mAP achieves 0.8306 for wVLAD-SURF+wVLAD-CN+CNN. Also, On UKbench and Oxford5k, the N-S score and the mAPs reach respectively 3.6916 and 0.4322 for wVLAD-SURF+wVLAD-CN+CNN. Thus, feature fusion make retrieval accuracies obviously improved. We compare our method with various methods in Table 3. Specially, in reference [22], the authors constituted "hand-crafted" features like SIFT which is called triangular embedding. It can be seen that the mAP of wVLAD-SURF+wVLAD-CN is higher than the mAPs of other methods on Holidays. We fuse VLAD vectors and deep features. In reference [23], they proposed the multi-scale orderless pooling (MOP-CNN) scheme which combined the deep features and VLAD. The results compared with [23] are listed in Table 3. It can be seen that our method achieves the better results on three datasets.

Conclusion
Since the contribution of each descriptor to the VLAD vector is not the same in traditional VLAD method, it will result in visual burst phenomenon. To address the problem, we added a different weight for every residual to balance the contribution of each descriptor for the VLAD vector. The SURF features describe local gradient information of an image, while the CN features represent local color information. Thus, to improve the image retrieval accuracy, we proposed a simple and effective method that fused our proposed weighted VLAD vectors based on local texture features and local color features. Moreover, in order to improve the accuracy further, deep features are extracted and fused with the multiple weighted VLAD vectors. In order to reduce running time, the PCA and whitening operations were adopted in this paper. Finally, our experiments obtain the better results when compared with other methods.