Robust Robot Grasp Detection in Multimodal Fusion

. Accurate robot grasp detection for model free objects plays an important role in robotics. With the development of RGB-D sensors, object perception technology has made great progress. Reach feature expression by the colour and the depth data is a critical problem that needs to be addressed in order to accomplish the grasping task. To solve the problem of data fusion, this paper proposes a convolutional neural networks (CNN) based approach combined with regression and classification. In the CNN model, the colour and the depth modal data are deeply fused together to achieve accurate feature expression. Additionally, Welsch function is introduced into the approach to enhance robustness of the training process. Experiment results demonstrates the superiority of the proposed method.


Introduction
With the development of visual sensors, robot vision perception technology has made great progress. Especially in recent years, robots can perceive the colour and the distance properties of the environment due to the development of RGB-D sensors like Kinect and Xtion. In the field of home service robot applications, vision based robot grasp detection has been an important research direction in robot technology for the reason that it can improve the level of human-computer interaction.
For the service robot, the objects to be operated are divided into two categories, the known and the unknown objects. The known objects mean that robots have stored the model of the object. Under this circumstance, robot grasp detection procedure can be separated into two processes. The first process is to detect the specific object and the second one is to estimate the object's pose and find out the proper grasp point. However, on the other hand, the robot cannot always store models for all objects. For example, when the service robot enters a strange environment, objects in this exception would be unknown for the robot. Then, grasping the unknown object or the model free object can be a difficult task for the robot.
During last few years, several significant approaches [1][2][3][4][5][6][7][8][9] have been proposed to solve the model free grasp detection problem. In 2006, 2D grasping point representation method was proposed by A. Saxena [1]. In 2010, Q. V. Le [2] introduced a method by using multiple contact points to represent grasp locations. In 2011, Y. Jiang [3] presented a new representaiton mehtod which describe the robot grasp as a 5dimensiaonal oriented rectangle. In order to learn the representation, the SVM ranking algorithm was introduced in the learning process. With the development of deep learning, convolutional neural networks has been introduced as powerfull visual models [4][5]. In 2013, Lenz [6] proposed a convolutional networks based method for robot grasp detection. The method using sliding windows to generate multiple candidate grasps firstly and true grasps could be retained by the classifier. In 2016, J. Wei [7] proposed a multimodal fusion based deep extreme learining machine for robot grasp recognition. L. Trottier presented a dictionary learning method in the same year. However, these three methods have disadvantages of high complexity. In 2015, J. Redmon [9] introduced a real-time detection method by using single-stage regression. The end-to-end solution reduced the training difficulty. However, the coarse data fusion leads to low detection accuracy. Meanwhile, the training process has slow convergent speed because of outliers.
In order to improve robot grasp detection accuracy, we adopted an improved approach. The contribusions are the following. First, we proposed a robust loss function by taking Welsch function into consideration. Second, we introduced Atrous convolution algorithm to our architecture to improve the local expression ability of features. Both these two contributions can improve robot grasp detection accuracy.
The rest of the paper is orgnised as follows. Section 2 describes the deep regression model for robot grasp detection. Details of the proposed method are discussed in Section 3. Section 4 evaluates performance of the proposed method. Finally, Section 5 gives the summary and conclusion.

Related works
Generally, object models should be constructed before robotic grasping. For instance, the specific object can be expressed by invariant local features [10], sparse 3D point cloud model [11] and dense 3D point cloud [12]. However, building object model is difficult and time consuming. This limits the ability of the robot to adapt to the environment. Recently, category based object detection [13][14][15] has been applied into robot applications. It is still difficult to determine object's pose.
Y. Jiang [3] proposed a rectangle representation method for robot grasp, which skips object detection and pose estimation process. Each grasp is described by a rectangle with its central coordinates, size, and orientation. This expression method greatly simplifies the complexity of the model. With the development of deep learning, convolutional neural networks based grasp detection method was proposed by I. Lenz [6]. However, the sliding window approach decreases the detection efficiency. L. Trottier [8] proposed a dictionary learning method for robot grasp detection. Although it takes the advantage in high accuracy, proposal procedure makes it inefficient. J. Redmon [9] presented an end-toend solution for real time grasp detection. Nevertheless, the shallow data fusion architecture can affect the accuracy of the detection. At the same time, fullconnected layers reduced the local feature representation ability. D. Guo [16] introduced a pipeline using reference rectangles on the feature map. For that only colour information was employed, the detection accuracy is still not high enough.

Architecture
In our work, the robot grasp is represented by a grasping rectangle which is a 5-dimontional vector {x, y, w, h, θ} the same as in [5]. In this vector, x and y denotes the central coordinates, w and h denotes the rectangle size,  represents the angle between the rectangle and the horizontal direction. Inspired by [9], robot grasp detected in the regression way. The proposed architecture is shown in Figure 1.  The proposed architecture is a modification of the VGG-16 networks [17]. The full connection layers of VGG-16 were replaced by three convolutional networks aiming at improving the expression ability of local information according to hole filling algorithm [18]. In the architecture, all convolutional layers have the same kernel size 3×3 and the same stride 1. After each convolutional layer, batch normalization was employed to improve detection accuracy. All pooling layers are maxpooling layers and the same stride is 2. In the architecture, the input colour image size is 224×224×3 and the output matrix size is 7×7×7.

Detection methodology
The detection architecture is designed aiming at combining classification with regression. For each input image, we detect 49 (7×7) results, and each result is a 7 dimensional vector. As can be seen in Figure 2, the vector indicates the graspable probability of the detection result. Meanwhile, it represents the location of the graspable rectangle. The detection result is illustrated in Figure 2. x r y r w r h r θ r

Fig. 2. Detection result in each vector
The loss function of the proposed architecture is as follows.
In equation (1), the loss function is composed by two parts: the classification part and the regression part.  In equation (2), the value 32 indicates the sizes of each separate block. In equation (3), wmax and hmax are the largest width and height of the graspable rectangles.
During training process, back propagation is to adjust parameters to meet detection requirements. Outliers can lead to slow convergence rates. Meanwhile, small residuals contribute less to the back propagation process. Inspired by [19], we introduced a more robust loss function named Welsch to retain relative large residuals. The function is designed as follows: In equation (4), α=2.9846. The function and its derivatives are shown in Figure 3.

Multimodality fusion
Input size 224×224×3 Nowadays, RGB-D sensors make it possible for robots to get distance information of the scene. For robot grasp, colour and depth data fusion has been a research hotspot. In our work, we proposed a data fusion method based on feature fusion. We first trained the RGB based neural networks. The same as RGB image based networks training, we feed depth image to the similar networks. In order to simplify the training complexity, each depth image was expanded to 3D. After all, we concatenate the features from the RGB image and the depth image, and retrain layers Conv6, Conv7 and Conv8. Feature fusion methodology is illustrated in Figure 4.

Experimental methodology
We adopted the Cornell Grasping Dataset to evaluate the proposed method. The dataset contains 240 different objects. The total number of the samples is 885. For each sample, RGB image and 3D dense cloud image is provided. The 3D dense cloud images were first transformed to depth images. For the reason that each image contains multiple graspable and non-graspable locations, the proposed method can be applied to these multiple label detection task. In order to improve the robustness of the networks, we performed an extension for the original dataset in image cropping, rotation. The number of the training examples was extended to 3540 and the number of the graspable rectangles was over 20,440. The dataset was separated into two parts, 80% was the training set and the remaining 20% was test set. Moreover, all images were cropped into the size of 224×224.
In order to evaluate the performance of the proposed method and the conventional methods, the grasp intersection over union (IoU) metric was employed in our work. The IoU is defined as: In the upper equation, the numerator and the denominator of the right part denotes the overlap area and the union area respectively. The grasp was seen as a right one only when the IoU was above 0.25 and the angle difference between the detected grasp and the truth was less than 30°. Two experiments were conducted in order to evaluate the image-wise and object-wise detection accuracy.

Training details
As mentioned previously, our proposed architecture is a modification of VGG-16 networks. Parameters of the last three layers were generated randomly, other parameters were initialized by the pre-trained VGG-16.
Then we converted the model to perform robot grasp detection on the RGB and depth image dataset respectively. Features corresponding to the RGB and depth images were extracted and concatenated to perform multiple modalities fusion. The last three convolutional neural network layers were re-trained accordingly.
During training and validation process, Caffe deep learning library was employed. The experiment was performed based on two NVIDIA Tesla K80 GPUs with 24GB memory. In order to speed up the training process, the NVIDIA CuDNN library was taken into our work. In the training procedure, we took stochastic gradient descent method to optimize the model.

Results and analysis
We compared our proposed method with algorithms proposed by Y. Jiang [3], I. Lenz [6] and J. Redmon [9]. Experimental results can be seen in Table 1. In Table 1, the proposed method has the better detection accuracy of 88.90% in image-wise split and 88.20% in object-wise split. The detection speed of the proposed method is 117ms per image. Some detection results can be seen in Figure 5. According to the experimental results, the detection accuracy is higher than that of the traditional convolutional neural networks. The accuracy of the proposed method is increased for the reason of the reasonable loss function and the modified neural networks. The speed of our proposed method is slightly slower than Redmon's because that the proposed architecture is much more complex. However, the speed is still acceptable. Moreover, the end-to-end convolutional neural network structure takes the advantage of high speed compared with auto-encoder based learning method.
We also applied the proposed method to our application. The detection accuracy reached to 82%.
Some detection examples are shown in Figure 6.

Conclusions
Robot grasp improves the ability of human robot communion. This paper proposes a robust method to detect robot grasps. The main contributions of the paper are: (1)We propose a new robust loss function for robot grasp detection using deep neural networks. By taking Welsch function into consideration, back propagation by interior points with small contributions and outliers can be reduced. This can lead to robust training progress.
(2)Hole filling algorithm is introduced to improve the expression ability of local features. Experimental results show that the proposed method achieves the superior performance in robot grasp detection.