DNS: A multi-scale deconvolution semantic segmentation network for joint detection and segmentation

Real-time semantic segmentation has become crucial in many applications such as medical image analysis and autonomous driving. In this paper, we introduce a single semantic segmentation network, called DNS, for joint object detection and segmentation task. We take advantage of multi-scale deconvolution mechanism to perform real time computations. To this goal, down-scale and up-scale streams are utilized to combine the multi-scale features for the final detection and segmentation task. By using the proposed DNS, not only the tradeoff between accuracy and cost but also the balance of detection and segmentation performance are settled. Experimental results for PASCAL VOC datasets show competitive performance for joint object detection and segmentation task.


Introduction
Semantic segmentation is a technique to assign semantic or object-class labels to individual pixels in images [1]. It is usually converted into pixel-wise classification problem, and focuses on the connection of semantics and location. The natural step in semantic segmentation is utilizing global information to resolve what while utilizing local information to resolve where [2].
Despite the attention it has received, global context and interplay between labelling and detection of object instance are still the restricted factors of semantic segmentation. Before the Simultaneous Detection and Segmentation (SDS) [3] were proposed, the detection and segmentation tasks are usually treated as individual one. Especially, most of the existing semantic segmentation approaches focus on single inference of net-design for higher accuracy, which are difficult to extend to incorporate other types of tasks. The effectiveness of these convolutional nets largely depends on the sophisticated model design regarding depth and width, which has to involve many operations and parameters.
For the remarkable progress of recent deep convolutional neural nets, we in this paper resort to DCNN method similarly to manipulate semantic segmentation task. Besides, the problem of using a single network to handle multiple tasks has been repeatedly pursued in the stage of deep learning. In [4] the DCNN is used for recognition, localization and detection, while in [5] DCNN is trained for surface normal estimation, depth estimation and semantic segmentation, and [6] for joint detection, pose estimation and region proposal generation. Importantly, unlike the aforementioned holistic approaches, we are interested in exploiting semantic segmentation in order to improve both object detection and segmentation performance. Here, we take advantage of deconvolution mode with the sharing weight to combine these two tasks. Towards this goal, our convolutional network is trained under the supervision of bounding boxes and segmentation maps.
Meanwhile, real-time semantic segmentation has become crucial in many practical applications and brought with fundamental difficulty of reducing computation for pixelwise label inference. As the prevalent image detection and segmentation pipelines, the approach remains expensive and relies on a region-based strategy that makes the network architecture inappropriate for semantic segmentation. Learning and inference in our model are efficient as we reason at the detection and segment level. We extensively evaluate the proposed model called 'DNS' on the PASCAL detection and segmentation benchmarks. What's more, the proposed method shows a graceful degradation compared with its counterpart.
Our main contributions are summarized below: 1. We focus on building a deep model for joint detection and semantic segmentation with a decent speed. To work out this problem, we introduce a multi-scale deconvolution mechanism which is a direct mode to perform easily. Specifically, we learn a multi-layer deconvolution network, which is composed of down-scale and up-scale stream. In this stream, we combine the multi-scale features instead of using the multi-scale inputs which has been demonstrated that outperforms average-and max-pooling, and can achieve excellent performance.
2. The trained 'DNS' network makes it possible to train a single net for multiple task (detection and segmentation). We achieved competitive advantages in PASCAL VOC benchmark. In addition to the trade-off between accuracy and inference cost, you will find that our DNS trained only on the PASCAL VOC dataset settle the balance of detection and segmentation performance.
The paper is organized as follows: Section 2 discusses related work; Section 3 presents our real-time multi-task framework. Finally, we devote to our experiments in Section 4, and Section 5 concludes the paper.

Related work
Before we introduce our approach, we now present techniques for both detection and semantic segmentation.
Representative methods [7][8][9][10][11] consider semantic segmentation task as simultaneous detection and segmentation (SDS), which in introduced in [7]. Semantic segmentation has recently witnessed rapid progress, but many leading methods are unable to identify object instances. To encourage the research on this problem, a Multi-task Network Cascades (MNCs) [11] for instance-aware semantic segmentation is proposed. This model consists of three networks, respectively differentiating instances, estimating masks, and categorizing objects. Although Multi-scale CNNs and their variants have made striking success for modelling the global scene structure for an image, they are limited in labelling fine-grained local structures like pixels and patches, since spatial contexts might be blindly mixed up without customizing their scales. Convolutional Feature Masking [10], connected Markov random field models [8], and Mask R-CNN [9] are also designed to address the issue of contexts of object labels.
In recent years, neural networks are driving advances in semantic segmentation, in which each pixel is labelled with the class of its enclosing region. Most of the convolutional versions of existing networks obtain precise segmentation from fixed-sized inputs in a particular dataset. These works [12,3,13], bring together DCNN methods and traditional computer vision algorithms for addressing pixel-wise segmentation problem.
Through the use of contextual information, 'deep CRFs' [14] is proposed to improve semantic segmentation, by combining the strengths of deep CNNs to learn powerful feature representations, with Conditional Random Fields (CRFs) which can capture contextual relation modelling. This method avoids repeated inference, and so is computationally tractable.
Incorporating multi-scale features in fully convolutional neural networks (FCNs) [2] has been a key element to achieving state-of-the-art performance on semantic segmentation. One common way to extract multi-scale features is to feed multiple resized input images to a shared deep network and then merge the resulting features for pixel-wise classification. FCNs uses large receptive field and many pooling layers, both of which cause blurring and low spatial resolution in the deep layers. As a result, FCNs tends to produce segmentations that are poorly localized around object boundaries. Using a color-based CRF on top the FCN prediction [13] is one way that attempts to address this issue in post-processing steps. Although post-processing the output of FCN with a fully-connected CRF can increase segmentation accuracy near object boundaries, mean-field inference in fully-connected CRF model is expensive in terms of both memory and CPU time. To this end, a taskspecific edge detection model [15] using CNNs and a discriminatively trained domain transform is proposed. This domain transform can equivalently be seen as a recurrent neural network (RNN), and it is a special case of the recently proposed RNN with gated recurrent units.
In addition, using CRF on FCN require additional parameters and low-level features that are difficult to tune and integrate into the original network architecture. To overcome these problem, a Boundary Neural Field (BNF) [16] is proposed. It is a global energy model integrating FCN predictions with boundary cues. Further, some steer DNN architectures, like decoupled DNN [1], Deconvolution net [17] are designed to make precise per-pixel label prediction tasks.
While the discrete CRF is a natural _t for labelling tasks of semantic segmentation, a new end-to-end trainable deep network, referred to as Gaussian Mean Field (GMF) network [18], whose layers perform mean field inference over a Gaussian CRF, is proposed. The Gaussian CRF is composed of three sub-networks: a CNN-based unary network for generating unary potentials, a CNN-based pairwise network for generating pairwise potentials, and a GMF network for performing Gaussian CRF inference. This method outperforms various recent semantic segmentation approaches that combine CNNs with CRF models.
Meanwhile, some similar works set out to deal with both detection and semantic segmentation jointly [19][20][21][22]. Yao and Fidler [19] propose a traditional approach with highorder potentials to holistic scene understanding that reasons jointly about regions, location, class and spatial extent of objects. Fidler and Mottaghi [20] focus on how semantic segmentation can help object detection, and their model blends between the detector and the segmentation model, by boosting object hypotheses on the segments. Both of these two method neglect the strength of DCNN. Teichmann and Weber [21] introduce an approach (MultiNet) to joint classification, detection and semantic segmentation via a unified CNN architecture where the encoder is shared amongst the three tasks. However, the MultiNet trained and evaluated on KITTI dataset is limited, and mainly designed for autonomous driving. Kokkinos [22] introduce a convolutional neural network (CNN) that jointly handles low-, mid-, and high-level vision tasks in a unified architecture that is trained endto-end. It is necessary to point out that UberNet initializes from a network that was trained with M-SCOCO data, which needs more dataset than our DNS.

Detection and semantic segmentation with DNS
In this paper, we propose an efficient and effective semantic segmentation architecture, called DNS, to jointly reason about object detection and semantic segmentation. Figure 1 presents DNS architecture. In this DNS, we add convolutional feature layers to the end of the truncated base net. These layers decrease in size progressively and allow predictions of detections at multiple scales. For object detection task, it is performed by a single convolutional layer that predicts the class and the coordinates of bounding box in the feature maps of the upscale stream. Similarly, we in the segmentation task upscale all the activations of the upscale stream and concatenate them to predict the pixel labels and produce segmentation maps.
In figure 1, the trained DNS is composed of two parts i.e. Convolution and Deconvolution networks. Firstly, the input image is pre-processed by a convolution network to produce a map with high-level features. We employ ResNet-50 or VGG-16 base net for convolutional part. Taking VGG-16 as the example, the convolution network discards the fully-connected softmax layers of VGG-16. We call this layer Conv5-3, following the deconvolution part. This part consists of Down-Scale and Up-Scale stream. Fig. 1. The DNS architecture, which performs semantic segmentation with fully convolutional network. We adopt deconvolution layers to build the segmentation maps. Multi-upscale feature maps are utilized to make pixel prediction.
In the Down-Scale stream, we respectively use 3 × 3 and 1 × 1 convolutions to get the so-called 'fc-6' and 'fc-7' layer. Given the features produced by fc-7, we employ the similar block in each which includes the 1 × 1 and 3 × 3 convolutional layer as discussed in [23] followed by a pooling layer to produce more precise prediction. With three convolution block, we in the Up-Scale stream apply the same deconvolution pattern in [24] to skip the connections. This skip connection in our DNS is used likewise to prevent the gradient from affecting the backbone network too violently and ensure the stability of the network. With three deconvolution, the feature maps are concatenated in order to predict subsequently precise object masks and segmentation maps.
In the training stage, given training data annotated with bounding boxes and segmentation maps, we design the loss function which is simply the sum of two loss functions of these two task. Our training objective is expressed as: � � xpr spe = B �tsp � � � ��� � � xpr � spe � (1) In equation (1), we use xpr and spe to the index two task; � � denotes the weights of the base net B �tsp , � xpr and � spe are task specific weights; B �tsp is the loss function of the base CNN model(ResNet-50 or VGG-16); ��� � � xpr � spe � is the task specific loss function. This task-specific loss is written as follows: � � � � xpr � spe = where we use � to index training samples, denote by � xpr spe � , � xpr spe � the task-specific network prediction and ground truth at the �-th example respectively, by � xpr spe the taskspecific network parameters.
To implement the detection operation, we follow the similar approach proposed in [23]. The objective function of detection task is to minimize error between ground-truth bounding boxes and the input image with anchor boxes. For segmentation task, the loss function is the cross-entropy between predicted and target class distribution of pixels. Specifically, we use a 1 × 1convolutional operation with 64 channels to map each layer of the upscale-stream to an intermediate representation. After this, each layer is up-scaled to the size of the last layer using bilinear interpolation and all maps are concatenated. This representation is mapped to c feature maps, where c is the number of classes, by using 3 × 3 convolutions to predict posterior class probabilities.

Experiments
We now present various experiments conducted on the Pascal VOC 2007 and 2012 datasets, for which both bounding box annotations and segmentation maps are available. Section 4.1 presents the datasets and the metrics in more details; Section 4.2 presents technical details which is important to make our work reproducible. The last section discusses the inference speed in the network architecture.
Our experiment has two objectives. The first one is to explore how DNS architecture addresses the two individual tasks. The second objective is how to settle the balance of detection and segmentation task in one single net (DNS). In order to examine this, we compare to the results in three aspects. Firstly, we contrast with the prevalent models only for the single detection task. Then the primary models merely used in segmentation task are compared. Finally, we contrast five representative approaches designed for joint detection and segmentation task.

Experimental setup
Datasets and Metrics: We use the PASCAL VOC07, and VOC12 datasets. All images in the VOC datasets are annotated with ground truth bounding boxes of objects. Both VOC07 and VOC12 consist of 20 foreground object classes and one background class. The VOC07 dataset is divided into 2 subsets, train-val (5011 images) and test (4952 images). The PASCAL VOC12-train subset contains 5717 images annotated for detection and 1464 of them have segmentation ground truth as well, while VOC12-val has 5823 images for detection and 1449 images for segmentation. We train our DNS on different subsets which consist of 'voc07-trainval-seg', 'voc12-train-seg', 'voc12-val', and 'voc12-val-seg'. In these four subsets, 'voc07-trainval-seg' subset includes 5011 segmentation images in PASCAL VOC07, while 'voc12-train-seg' 1464 segmentation images, 'voc12-val' 5823 images, and 'voc12-val-seg' 1449 segmentation images in PASCAL VOC12.
Optimization: Our DNS is coded in Python and TensorFlow. The experiments were conducted on a Tesla K40c GPU with 11439M memory. In all experiments, we use the Adam algorithm instead of SGD, with a mini-batch size of 32 images. The initial learning rate is set to 1� 㐶t and decreased twice during training by a factor 10. We also use a weight decay parameter of 5 × 1� 㐶t . As already mentioned, we use ResNet-50 as a feature extractor, 512 feature maps for each layer in down-scale and up-scale streams, 64 channels for intermediate representations in the segmentation branches. We evaluate our proposed methods on the PASCAL VOC 2007 and 2012 test set. We also compare our test set results with other competing methods.

Individual object detection.
We start by verifying that diverse training PASCAL VOC dataset make much difference in our DNS. In the experiment, the default image size used to train our DNS is 3�� × 3�� . The comparison of detection performance on different training PASCAL VOC datasets are reported in the Table 1. Table 2 and 3 respectively present the comparison of detection accuracy between DNS and state-of-the-art models on PASCAL VOC07 and VOC12. As shown in Table 1, our DNS result achieves the best results on the joint subset of voc07-trainval-seg, voc12-train-seg, and voc12-val, with 69.0 on 07mAP column and 73.3 on 12mAP column respectively. Note that the result on voc07-trainval-seg, voc12-train-seg, and voc12-val subset of ResNet-50 base net is better than each subset of VGG-16 base net. However, all the rest subset results on 07mAP and 12mAP column of base net of VGG-16 are better than that of ResNet-50.  Table 3. As Table 3 shows, detection on VOC12 improves by more than 4.3% on VOC07. Meanwhile, our detection results are similar to Faster RCNN on VOC12, which is also better than YOLO and SSD. We argue than this result could be competitive even though it is still 3.3% less than R-FCN.

Individual semantic segmentation
The second task that we have tried is semantic segmentation. Even though a broad range of techniques designed for this problem, we compare to the state-of-the-art methods. Table 4 presents the comparison of different segmentation evaluation results on PASCAL VOC12 test. Table 4. Semantic segmentation evaluation results on PASCAL VOC 2012 test set.

Joint detection and semantic segmentation
Motivated by the empirical results in the previous paragraphs, we have explored the ability of how DNS architecture addresses the two individual tasks. Now for the joint task, we contrast to five representative methods, which similarly make use of a single net to address joint detection and segmentation (final goal) in Table 5. We also shows the visualization results of detection and segmentation on PASCAL VOC 2007 test images in figure 2.
From table 5, we can observe that very little of works address this two task jointly. Among the rest method which conduct on PASCAL VOC Dataset, our DNS is largely better than SDS, Holistic Scene Understanding and segDPM method. What's important is that our DNS achieves second good performance for joint detection and segmentation task. Compared with DAG which is a little bit less than our performance, the detection performance of DNS achieves 73.3 with about 6.2 mAP growth, while segmentation performance 69.8 with about 2.3 mean IOU decline. The main reason is that the object detection architecture of DAG is based on the Faster-RCNN, which is a little bit better than ours, and the semantic segmentation architecture of DAG is based on FCN, that is a little bit worse than ours. Especially the accuracy of DAG drops (both mAP and mean IOU) significantly after the adversarial perturbations are added.
Compared with UberNet which achieves the best performance, in conjunction with Faster-RCNN, UberNet makes use of the VOC 2007 dataset to fine-tune the MS-COCO pretrained network for detection task. For segmentation task, UberNet deviates from the Deeplab-FOV architecture by using linear operations on top of skip layers to reach the similar result of DeepLab-v2. In spite of this, the mAP accuracy of our DNS is about 5.5 under UberNet, while the mean IOU accuracy of DNS just 1.3 under UberNet.

Speed comparison
To strike the balance between accuracy and inference cost, we report speed comparison to other state-of-the-art pipelines in figure 3. Our approach is the most accurate among these five detectors working 24 frames per second (FPS) and in the setting close to real time (19 FPS), it can provide the real-time detections among the counterparts, while also providing semantic segmentation mask.

Conclusions
This paper adapts a deep deconvolution semantic segmentation model (DNS) to handle both detection and segmentation task. Experiments on PASCAL VOC dataset have shown that: (1) Our DNS network based on weight-sharing is advantageous to both detection and segmentation task. (2) Merging the down-scale and up-scale features not only improves the performance over deconvolution baselines, but also allows us to fast the detection speed. (3) Our network demonstrates the competitive performance in PASCAL VOC detection and segmentation benchmark.