CNN and RNN mixed model for image classification

. In this paper, we propose a CNN(Convolutional neural networks) and RNN(recurrent neural networks) mixed model for image classification, the proposed network, called CNN-RNN model. Image data can be viewed as two-dimensional wave data, and convolution calculation is a filtering process. It can filter non-critical band information in an image, leaving behind important features of image information. The CNN-RNN model can use the RNN to Calculate the Dependency and Continuity Features of the Intermediate Layer Output of the CNN Model, connect the characteristics of these middle tiers to the final full-connection network for classification prediction, which will result in better classification accuracy. At the same time, in order to satisfy the restriction of the length of the input sequence by the RNN model and prevent the gradient explosion or gradient disappearing in the network, this paper combines the wavelet transform (WT) method in the Fourier transform to filter the input data. We will test the proposed CNN-RNN model on a widely-used datasets CIFAR-10. The results prove the proposed method has a better classification effect than the original CNN network, and that further investigation is needed.


Introduction
Current CNN [1][2] networks have become a standard method of machine learning for 'mesh data' (pictures, videos, etc.). Researchers have created many different structural network models on the basis of the CNN model, and have proved to be successful at a variety of image correlation problems including handwritten digital recognition [3], natural picture recognition, etc. Among them, [4] is one of the most classic networks currently used for image recognition. AlexNet [5] is also a neural network based on Convolution + Pool structure, It is one of the origins of deep learning. Starting from AlexNet. After that, there are a lot of CNN models appearing, ZF-Net [6] began to use the de-convolution to output the feature of the CNN network middle layer, and this feature enables people to more intuitively understand the working principle and network architecture within CNN. They discover the characteristics of edges, colors, etc. in shallow CNN detection images, and deeper CNN layer will detect shape features of objects to be identified. While the RNN has become the standard method for machine learning of sequence data (audio, natural language, etc.). NLP [7] and machine translation [8][9] are the most current applications. Combining RNN processing sequence and CNN to process image data, the main research fields include image tagging [10], target detection [11], video screen behavior detection, etc. [12] innovatively proposed a network model ReNet for image classification. This network replaces the Convolution + Pooling layer in CNN using four common UNI-Dimensional RNNs layers. In this paper, we use a similar structure in the ReNet model, and propose our model CNN-RNN model for image classification. The model has some similarities with ResNet [13], this article is a reference to the jump connection method of ResNet model, using RNN Layer to Calculate Continuity Features in CNN Layer, and then connect these features to the final fully connected network for classification prediction. In this work, we will test the proposed CNN-RNN model on a widely used object recognition datasets CIFAR-10 [14]. Experiments have shown that the CNN-RNN model has better accuracy and call more research in the future.

Model description
In this paper, we make some changes to the RNN layer in [12], and no longer calculate sequence results in four directions. We only calculate the sequence results in two directions, as shown in Figure 1 (example picture is for reference only, non-experiment dataset picture), it is a one layer structure of RNN calculations. It can be seen as a bidirectional RNN model. The RNN model can be LSTM (Long Short Term Memory) [15], GRU (Gated Recurrent Unit) [16], IRNN [17] or ConvLSTM (Convolutional LSTM Network) [18]. Denote the input data of the RNN layer is , w and h denote the width, height of input x, and if x is the original picture, the c represents the number of color channels, other c represents the number of filters (or the number of convolution kernels). According to [6], the output of the convolutional layer after deconvolution operation is the result of the edge, texture, and shape of the image. Therefore, the output of the convolution layer is the original image after the filter processing, which has the original image part of the feature attributes. Define the identification area (Patch in the figure 1)

Model analysis
The filter is an operation in image processing. The specific operation of filtering is the sum of the product of the image pixels and the filter. Convolution and filtering are largely similar, but in fact it can be seen as a filtering process. The convolution operation can also be implemented using wave processing. The wave of the image is a non-periodic discrete signal in the time domain. Therefore, the Fourier transform can be used to operate the image wave. Convolution calculation is the most important in CNN. If the images in CNN are regarded as non-periodic discrete signals in time domain, the convolution of images is the process of taking waves in different frequency bands. The image can be viewed as a discrete two-dimensional function f(x,y), and the one-dimensional Fourier transform can be extended to two dimensions. The transformation from the spatial domain to the frequency domain is the Fourier transform: The essence of a convolution kernel is a two-dimensional function, which has a corresponding spectral function.
in the formula is a certain kind of spectral function. The convolution of the image is actually taking the characteristics of different frequency bands in the image. The convolution operation in the time domain segment is actually equal to the multiplication operation in the frequency domain segment. However, the Fourier transform has limitations because it cannot characterize the local characteristics of the signal in the time domain, and does not work well for sudden and non-stationary signals. So using a wavelet transform is a better choice: where  represents a wavelet basis, the difference from ix e in the ordinary Fourier transform is that it is not a simple positive or cosine wave basis function but a wavelet basis function that satisfies certain conditions. Two-dimensional waves have not only local features but also sequence features. Therefore, this paper uses the RNN network to extract the sequence features of the two-dimensional wave, and also uses the sequence features as an important feature to judge the image classification.
ResNet's skip connection allows it to stack layers of the network deep without vanishing gradient problem [20], because of this hopping connection method, the depth of the ResNet model is not deep. The feature of the middle layer output can be directly connected to the final full-connection network. ResNet is actually a neural network structure composed of multiple shallow CNNs. The CNN-RNN method in this paper cannot achieve the same deep network structure as ResNet, but it can also use the different feature output of the middle layer to connect to the final full-connection layer.

DataSets and Model Architectures
The CIFAR-10 dataset contains 60,000 images, and each of which belongs to one of ten categories; airplane, automobile, bird, truck, ship, horse, frog, dog, deer, and cat. The shape of each image is (32,32,3). ConvLSTM (Convolutional LSTM Network) was proposed by [18] to solve the problem of precipitation nowcasting. ConvLSTM is an improvement over full-length longterm memory networks (FC-LSTM). It is similar to an LSTM layer, but the input transformations and recurrent transformations are both convolutional. ConvLSTM not only has the timing modeling capabilities of LSTM, but also features local features like CNN. All ConvLSTM inputs where '*' indicates convolution operation and '°' indicates Hadamard Product. For more details on ConvLSTM reference [18]. In this paper, we use various types of RNN models to compare and find that different types of RNNs have little effect on the results. Figure 3 shows the model structure of the CNN-RNN model trained on the CIFAR-10 data set proposed in this paper.

Training
The CNN-RNN model is also an End-to-End learning process that does not require additional preprocessing. To train the networks, we choose a wildly used adaptive learning rate algorithm, called Adam [21]. The CNN-RNN model will bring the problem of model overfitting due to the addition of an additional network structure. To reduce over-fitting problem. We will apply the normalization term imposed on the BatchNormalization [22] and RNN weights at the output of the RNN layer.

Results and Analysis
In the model training, we found that the standard VGG model will occur vanishing gradient problem, due to the network hierarchy being too deep, and the model hardly converged. Therefore, in this paper, we use a simplify VGG-9 model. We properly remove the two Pool layers to get this article's streamlined VGG-9. The learning rate in the model training process uses a fixed size of 0.001. The ResNet [13] used the variable learning rate in the original paper, so the results in this paper will be different. The training results of the model are shown in Table 1, including the accuracy of each model and the accuracy of the test.  Figure 4 and Figure 5 show the accuracy of the original CNN model and CNN-RNN models modified by RNN (including ConvLSTM and RNN), respectively. It can be seen that the improved accuracy of the CNN-RNN model is generally higher than that of the original CNN model. The combination of CNN+ConvLSTM and CNN+IRNN does not differ greatly in accuracy. This shows that the feature of CNN output are not long-term dependencies. And there are only simple dependencies, so using an ordinary IRNN will satisfy the requirement. Therefore, the general RNN can meet the requirements.

Conclusion and Future Work
In this paper, we propose an improved model CNN-RNN model, which has a certain improvement in the accuracy of image recognition problems. But it will increase the complexity of the model. It will reduce the training speed, and increase risk of overfitting. It need to add regularization items to limit the complexity of the model. In this paper, we simply extract feature from the CNN layer through the RNN network. There are many other extraction methods and combinations that need to be studied.