Multi-scale fusion and non-local attention mechanism based salient object detection

With the development of deep learning, researches in the field of computer vision are attracting more attention. As the pre-processing operation of visual tasks, a salient model may focus on pure architectures. The paper proposes a new multi-scale fusion network to enrich high-level redundant information with the enlarged receptive field. With the guidance of attention mechanism, the framework can capture more effective correlation spatial and channels information. Building a short-connection between high-level and each level features to transmit the contextual features. The model can be used in a variety of complex scenes for end-toend image detection, with simple structure and strong versatility. Experimental results obtained on multiple common datasets have shown that the proposed model achieved better performance both in the visual effect and the accuracy for small object and multi-target detection.


Introduction
Salient object detection aims to identify the distinctive visually objects or regions in images then distinguish them from the environments [1]. While the conventional convolutional neural networks have obtained excellent performances in the task of salient object detection, there still exist some problems resulting in it suboptimal.
During the process of feature extraction through the convolutional neural networks, the image resolution reduced after repeated pooling operation generally. For better semantic salient object detection, high-level features matter, however, the quality of the prediction cannot be guaranteed by using up-sampling.
For an end-to-end model, the prediction based on the attention module performs better. Based on the result of dilation multi-scale fusion, the proposed attention module refined nonlocal network, and the spatial and channel separately of each layer can be enhanced, similar with ResNet. Multi-scale fusion is to enhance the spatial features without flexible-shape images that are pre-processed. Meanwhile, more effective features based on the separate spatial and channel can be extracted using non-local attention mechanism.
In this paper, we purpose a novel salient object detection deep neural network, multi-scale fusion based on non-local attention with short connection network (FNASNet). The previous works showed that the multi-scale fusion mechanism and the attention mechanism make great contributions to extracting flexible features. FNASNet adopts the multi-scale fusion mechanism with dilation for more complete extraction of global and local features. Besides, non-local attention module infers global features along two separate dimensions, spatial and channel, making the network concentrate on more effective information. Short connections enable the transfer of feature from high-level to low-level, which facilitates the network to capture context.

Network overview
Based on VGG-16 Net, the multi-scale fusion module aims at extracting multi-scale contextual features of images with the guidance of non-local attention module, which effectively selects efficient features on spatial and channel dimension.See Figure 1.

Multi-scale fusion module
This network adopts multi-scale fusion within the convolutional block by using 1×1 kernel to fuse the multi-scale features after dilation. Multi-scale features of an image can be cascaded together using different kernels. Figure 2 shows the multi-scale fusion module adopted in this paper. Inspired by the dilation in morphology [2], the network increases the redundancy by using dilation strategy to make up for the loss of high-level features, on the basis of the extracted high-level features, and avoid introducing too much redundant information to affect the prediction. Morphological dilation (•) is implemented by utilizing the specific convolution with multi-scale, written as, where ( ) represents the ℎ feature map, ( ) is the dilation kernel of ℎ layer, = 6, the dilation rate rate is set to size of 1,3,5,7.
( ) is used to represent the multi-scale fusion features, which can be obtained from equation (2): where represents the morphology operation. When = 6, this operation act as dilation. ( ) represents a multi-scale fusion kernel.

Non-local attention module
All the pixel position and channel can be calculated with the softmax operation from feature map, aiming to give effective guidance for assigning global contextual attention on each pixel. Figure 3 shows the non-local spatial and channel attention module. Non-local attention module aims at generating an attention map at each pixel and channels over its context region and constructing an contextual feature with attention to enhance the feature representability of the network.
At ℎ feature map, the obtained feature vector about spatial and channels, which are denoted as ,ℎ and , ,ℎ ∈ × , ∈ , are calculated via a softmax function to generate the global attention weights ,ℎ and from spatial and channel, respectively.
where represents the number of channels for each pixel ( , ℎ), ,ℎ ∈ × , ∈ . For the pixel ( , ℎ), the features at all locations in ,ℎ are weighted by ,ℎ to construct the attended spatial contextual feature ,ℎ , For the channel , the features at all channels in are weighted by to construct the attended channel contextual feature, Finally, the non-local attention at each spatial and channel feature is the sum of ,ℎ , ,ℎ ,

Short connection method
Multi-scale responses are learned from different layers with increasingly larger receptive fields, and these responses are concatenated together for outputting final saliency. To obtain more information, we use short connection transmit the features from high-level to low-level. Figure 4 shows the architecture of short connection module.
Features are first generated via multi-scale fusion based on dilation, merging the information from deep layer to shallow layer to enhance the power of the network.

Datasets
In the experiment in this paper, DUTS [3] dataset was used in the training stage, which is the largest public dataset for salient object detection at present. Besides, this method is evaluated in five benchmark datasets,, ECSSD [4], PASCAL-S [5], DUT-O [6], HKU-IS [7] and SOD [8].

Evaluation metrics
The detection performance of the model in this paper was evaluated by constructing a calculating Mean Absolute Error(MAE) and F-measure. F-measure. It is an overall performance measurement and computed considers from precision and Recall values, as follows: Mean Absolute Error. To analyze the similar between salient map and the ground truth, indicate the impact of non-salient pixels, which is given by the following equation: where ( , ) represents the ground truth value at the pixel point ( , ), ( , ) is the salient map.

Comparisons with the State-of-the-Arts
In this section, we compare our FNAsNet with previous state-of-the-art methods, including MCDL [9], DS [10], DLS [11], SBF [12] and RSDNet [13]. The result shown in Table 1. Quantitative Evaluation. As shown in Table 1, FNAsNet constructs a new global multiscale fusion module, most of the comparison algorithms involved in the experiment are based on multi-scale fusion methods. Compared to the previous algorithm, our algorithm have a better performance, the model focus on more effective information, making the relationship between pixels closer, F-measure also is high, in the case of small amplitude fluctuation precision error, improving the detection performance of the model. The model shows good performance on PAS and SOD, and strong ability for small datasets of multi-target detection. The MCDL, DLS, SBF and RSDNet algorithm in the 1st row, 3rd row, 4th row and 5th row, respectivily, all propose the algorithm for fusion of context information and different level feature maps. Our algorithm uses a short connection to cascade high-level features with each level of features, enhancing the context information of the image and achieving good performance. More than one network is constructed in the second of the model, and as the results showing, our method has achieved a better effect than DS algorithm. Qualitative Evaluation. Figure 5 illustrates the visual comparison of our method with other approachs. The first row shows the reflection problem in the image. The results shows that MDCL and DLS have poor performance for reflection problems, and DS did not retain enough integrity of the target. SBF and the algorithm presented in this paper show good results. The second row is the case that the image is similar to the background color. Except for DLS and SBF algorithm, the detection results of other methods are relatively accurate. 3rd, 4th, and 5th rows respectively show the detection performance of different algorithms on multi-target, single-target, and multi-category targets. It is worth noting that the detection results of the last three rows are figure in the reflection, small object detection and transparent object detection respectively. Our algorithm have strong detection performance for small object detection and reflection problem, but on the transparent object detection, target integrity reduce to a certain extent, while other algorithms have a certain limitation on different situations, generally get poor performance on the reflection problem.

Summary
In order to enhance the ability of network to extract effective features, this paper proposes a new multi-scale fusion strategy, which uses dilation operation to increase the redundant information of high-level feature maps. Under the guidance of the attention mechanism between spaces and channels, the feature context information is effectively increased through the cascade of short connections between different multi-scale features. Through quantitative experiments and qualitative analysis, the algorithm presented in this paper have high robustness and accuracy. It can be seen from the detection results that the algorithm shows a better detection performance on complex scenes, single object detection and reflection. The experimental results shows that the multi-scale fusion algorithm based on attention mechanism has high credibility and value for constructing an end-to-end salient object detection algorithm at arbitrary scale. In the future, on the basis of this work, more attention will be paid to the detailed features, such as the edge of the image, to achieve a greater breakthrough in the performance of the model.