Deep Supervised Hashing for Fast Multi-Label Image

. In this paper, most of the existing Hashing methods is mapping the hand extracted features to binary code, and designing the loss function with the label of images. However, hand-crafted features and inadequacy considering all the loss of the network will reduce the retrieval accuracy. Supervised hashing method improves the similarity between sample and hash code by training data and labels of image. In this paper, we propose a novel deep hashing method which combines the objective function with pairwise label which is produced by the Hamming distance between the label binary vector of images, quantization error and the loss of hashing code between the balanced value as loss function to train network. The experimental results show that the proposed method is more accurate than most of current restoration methods.


INTRODUCTION
Researchers proposed many efficient retrieval technology in the past ten years, the most successful methods including image retrieval method based on tree, retrieval method based on Hashing which's representative method is a locality sensitive hashing (LSH) [1] and image retrieval method based on vector quantization. Compared with other methods, hashing method have the high efficiency in Hamming distance calculation and the advantages of storage space, so hashing method is very popular in large-scale similar image retrieval.
The Semantic Hashing method based on deep learning is proposed by the Hinton research group in 2009 which opened the door of the deep hashing method [2]. Then, inspired by the powerful ability of the CNN (convolutional neural network) [3], most of researchers apply the CNN based deep hashing method in many fields. Hashing method is divided into two categories: data independent methods and data dependent methods. The first method is randomly map data into binary codes. The typical data independent methods are locality sensitive hashing (LSH) [4] and other modified versions of LSH. The second method uses the trained hash function to map sample to binary code. Compared with the first category, the second way can generate shorter code to get higher accuracy.
The transformation of high-dimensional image features into low dimensional binary hash codes leading the gap between image semantics and hashing code. Therefore, a lot of research is still needed. In this paper, a hashing method for learning by using multi-labels as supervised information is proposed on the basis of many previous methods. The method is combined with deep learning and hashing method, forming the deep supervised hashing method for multi-label images. The basic ideas of the method:  Set the label of the images into a set of codes. According to the Hamming distance between any pairwise images, we get the pairwise labels, and then get the label matrix, which simplifies the multiple labels of the image, and is more convenient as supervised information;  Design a loss function in this paper which contains three components: the difference between hash and image semantics, quantization error when image features are quantized into hash codes, the balance rate for each bit as 0 or 1;  Add the hidden layer with the sigmoid function in the model which makes the input of hashing method is closer to 0 or 1.

RELATED WORKS
Hashing method based on learning: The learning-based hashing methods can be classified into three categories: unsupervised, semi-supervised and supervised. For the first category, label information of the training data is not required. Typical unsupervised hashing method has spectral hashing (SH) [5], iterative quantization (ITQ) [6] and so on. Supervised hashing method uses label of images as supervised information to learn more effective hashing function. Supervised information of hashing method is divided into three types: pointwise, pairwise and list-wise. Pointwise and pairwise convert sorting problems into regression, classification, or orderly classification problems. The typical Pointwise method has supervised discrete hashing (SDH) [7]. And the typical pairwise method has Sequential projection learning for hashing (SPLH) [8], Fast supervised hashing (FastH) [9], Latent factor hashing (LFH) [10], Convolutional Neural Network Hashing (CNNH) [11], Deep supervised hashing (DPSH) [12] and Supervised semantics-preserving deep hashing (SSDH) [13]. The typical list-wise method has Ranking-based supervised hashing [14], Ranking preserving hashing [15] and other methods.
Deep learning: Deep learning is to train data features by building machine learning framework and mass training data to improve the accuracy of classification or prediction. The loss function is the key of deep learning. We should design the loss function as much as possible including all error, and then gain minimum value through the training network. The researchers should design the loss function according to their model. Deep semantic ranking hashing (DSRH) [16] use the order of image correlation from large to small as supervised information. Deep similarity comparison hashing (DSCH) [17] using weight Hamming distance replace standard Hamming distance.

Model and learning
Recently, a new hashing method CNNH predicts the code by convolution neural network (CNN) and cross entropy loss. Compared to the previous method, CNNH's performance improved. After that, Deep Neural Network Hashing (DNNH) [18] and DPSH developed an end-toend model which can update the binary code by the learned image and better display the ability of deep learning. This section introduces the model which is an end-to-end framework combining feature learning and hashing code learning. Let Χ = { } =1 be the training set which contains N samples, where is the th i sample in X. Assume there are K labels, let L = { } =1 be the K-bit binary codes vector of labels of images, where ∈ {0 , 1} indicates whether the label of the current image is existing in the total label. We should convert the multi label to the pairwise label. Let = { 1 , 2 , … , } and = { 1 , 2 , … , } as the label vector of th i and jth sample in X, and the pairwise label of th i and jth sample is computed as follow: is the Hamming distance of and , and σ(•) = 1/(1 + exp (−z)) is the sigmoid function, with z a real value. The range of sigmoid function from 0 to 1 which indicates the output of sigmoid function can express the probability that two samples are similar. Then sign function is used to transform probability to 0 or 1. If is 1, then i and j are similar. If 0, it is not similar.
Let S = { } as the pairwise label matrix. Figure 1 show the end to end deep learning framework based on this method. The framework is composed of two parts: convolutional neural network for acquiring image features and loss function modules for frame learning.

Feature learning part
There are many general frameworks, such as AlexNet and VGG, which integrate deep learning and the creation of hashing function, and make it more convenient to use. This paper introduces the feature learning part based on AlexNet. This model consists of two CNN, which have the same structure and share the weight of each other. Each network has 5 convolution layers (F 1-5 ) and 2 fullyconnected layer (F 6-7 ) as same as the AlexNet. In the convolutional layers, units are organized into feature maps and are connected locally to patches in the outputs of the previous layer. The full connection layer maps the "distributed feature representation" to the space of label sample and as the classifier in network. F 1-2 and F 5-7 has the maximum pooling which get the maximum value of the previous layer as output to reduce dimension of feature and preserved the important feature of image. Each convolution layer and the fully connected layer contain the ReLu (Rectified Linear Units) activation function, which makes the training faster.
Let are the weight of the feature learning part, x i , x j as the input of two networks, the output of F 7 respectively is = ∅( ; ) and 7 = ∅( ; ), which will input the latent layer to predict the hashing function by pairwise label matrix.

Loss Function part
The process of learning is design the appropriate loss function, and then optimizes the weight of network by the stochastic gradient descent (SGD). As shown in Figure 1, the loss function is made up of two parts: the target loss function and the loss function about hashing code. The target loss function is to determine whether the two images are similar by the Hamming distance of the hashing code of the two images. The loss function about hashing code is the loss function set to generate effective hashing codes. The loss function about hashing code contains two parts: the difference between the input of hash function and 0.5 and the probability of each bit of hashing code be 0 or 1. 7 and 7 which is the output of F7 input the loss function part, the first layer of this part is fully-connected layer with sigmoid function which make the output closer to 0 or 1.

The Target Loss Function
The output of sigmoid function with the input 7 is a = ( 7 + ), where ∈ × is the weight between F 7 and the latent layer, is the bias. And the output of latent as follows: 7 7 ( ) (sgn( ( ) 0.5) 1) / 2 The hashing code of all images is combined as a binary matrix B = { } =1 . Because discrete optimization is a NP problem, so the target function with pairwise labels can define as LFH which use = 7 + replace : ( And we can add the equality constraints as regularization term in (4):

The Loss Function about Hashing Code
Beside the target loss function, we need to optimize the effectiveness of hashing code. The optimization of hashing code consists of two parts: the quantization error between the output of sigmoid function and hashing code and the optimization for keeping the hashing code's balance.
The hashing code is binary code which is generated by images feature through the latent layer, but the range of feature is very large, therefore the optimization of quantization error can make the output of sigmoid function more close to 0 or 1. So the first part of the optimization of hashing code can be maximize the sum of squared errors between the latent layer activations and 0.5, that is ∑ || =1 − 0.5 || 2 . How to ensure the balance of the hashing code, the best way is the half of bits of all the hashing code is 0, and other bits is 1. We use optimize the model by SGD which is randomly divide all training data into multiple batch, and it is difficult for each batch to keep the hashing code balance. Given an image and a is a discrete probability distribution on {0,1}. We hope that the probability of 0 and 1 randomly generated is equal, so the second part of the optimization of hashing code can be ∑ ( , where mean(· ) represents the average value of all elements in the vector calculation. This method helps to generate the same number of 0 and 1 for learning sample. At the same time, the minimum Hamming distance between two hash codes with the same number of 0 and 1 is changed to 2, which makes the hash code more separated.
Finally, the two constraints are combined to make each bit of a have the probability of 50% be 1 or 0: where p∈{0,1}. The optimization of (6) can optimize most of loss point in the network, and we can gain the effective hashing code that is similar to the image semantics by minimize the loss function.

Learning
The last loss function combined all the loss term which is proposed above: We optimize a parameter and other parameters are fixed in the backward learning. The parameters that need to be optimized by BP learning are W, V andθ. First differentiating : : : Then, we can update W, v and θby :

Dataset and equipment
The CIFAR-10 dataset contains 60,000 color images of size 32×32, which can be divided into 10 categories(6,000 images for each category). The whole dataset is divided into 50,000 training samples and 10,000 test images. The NUS-WIDE contains 269,684 images and 81 labels, which is a dataset with multiple labels per images.
The method of this paper compared with other methods in the experiment, which are divided into 4 categories:  Traditional data-dependent methods with handcrafted features (512-dimensional GIST descriptor), including SH, ITQ, SPLH, FastH, LFH, SDH.
All experiments were implemented in MATLAB using MatConvNet toolbox with Intel(R) HD Graphics 4600.

Evaluation protocols
we use two evaluation metrics to evaluate performance of all the above methods. These evaluation metrics represent the difference performance of these hashing method:  Mean average precision (mAP): We rank all the images according to their Hamming distances to the query and compute the mAP. mAP is used to measure the accuracy of hashing methods.  Mean training speed (MTS, images/s): How many images can be trained per second for the method.

Experimental Results
In the first experiment, we randomly select 1000 images (100 images per class) as the query set in CIFAR-10. For the supervised methods, we randomly select 5000 images (500 images per class). And we randomly sample 2100 query images from 21 most frequent labels (100 images per class) in NUS-WIDE. For the supervised methods, we randomly select 500 images per class as training set. For the unsupervised methods, we use the rest images in CIFAR-10 and NUS-WID as the training set. The result of first experiment is shown in table 1 and table 2. As shown in Table 1, our method outperforms all other baselines, including data-dependent methods with hand-crafted features and deep hashing methods with pairwise or triplet labels. And we can see the training speed of our method better than DPSH which is foundation of our method in Table 2. Through the analysis of Table 1 and Table 2, we can found not only in the mAP but also in the training speed our method has a better perform compared DPSH.  Consequently, our method can get more effective hashing code by the modified framework and the loss function which can enhance the speed of generating hashing code and strengthen the ability of distinguish similar or dissimilar hashing code. Table 3 is the results of the second experiments which is the comparison between hashing method based on pairwise labels and ranking labels. Because of the training data and query data is different from the first experiment. We randomly select 10000 images (1000 images per class) as the query set in CIFAR-10, and the rest as training set. Meantime, we randomly select 2100 images (100 images per class) as the query set, the rest of NUS-WIDE as the training set. our method and DPSH based on pairwise label, others based on ranking labels. We can see our method and DPSH perform better than other methods, and our method give significant improvements over the state-of-the-art deep hashing methods across different number of bits. Therefore, our method can effectively improve the speed and precision of image retrieval systems.

Conclusions
This paper proposed a new learning hashing method in the end-to-end model which combines the target loss function and the loss function of hashing code. Our method not only carry on feature learning and hashing code learning simultaneously, but also we can get better effective hashing code. Experiments on specific datasets show that our method improves the accuracy of image retrieval and the speed of training models compared with others.