Optimizing Convolution Neural Network on the TI C6678 multicore DSP

: Convolutional Neural Networks (CNNs) have become the most advanced algorithms for deep learning. They are widely used in image processing, object detection and automatic translation. As the demand for CNNs continues to increase, the platforms on which they are deployed continue to expand. As an excellent low-power, high-performance, embedded solution, Digital Signal Processor (DSP) is used frequently in many key areas. This paper attempts to deploy the CNN to Texas Instruments (TI)'s TMS320C6678 multi-core DSP and optimize the main operations (convolution) to accommodate the DSP structure. The efficiency of the improved convolution operation has increased by tens of times.


Introduction
In recent years, deep learning has emerged in the rapid development of computer hardware, and has become a popular technology. The Convolutional Neural Network (CNN) has achieved remarkable results in the field of computer vision. Because CNNs use a large number of convolutional layers, the computational complexity of the network is increased, and a huge workload is brought about. At present, we often use GPUs to accelerate the training and inference processes of CNNs. As technology continues to develop and application needs, the overhead of GPU area and power consumption becomes unbearable, so deploying CNNs on mobile and embedded platforms is beginning, such as FPGAs and ASICs. Digital signal processors (DSPs) are known for their low power consumption, high computing power, and ease of programming. A large number of MAC units have a natural advantage for convolution operations. This paper uses Texas Instruments (TI)'s TMS320C6678 for the deployment and optimization of CNNs.
The TMS320C6678 is a multi-core DSP based on TI's Keystone architecture. It has 8 C66x cores and supports 1GHz or 1.25GHz frequency, up to 320GMAC or 160GFLOPs, and consumes only 10W. The TMS320C6678 combines high performance with low power consumption and is the first choice for Highperformance Embedded Computing (HPEC).
As shown in Fig. 1, the C6678 has eight C66x CorePacs based on Very Long Instruction Words (VLIW). Each core has 32KB of L1P, L1D and 512KB of L2 SRAM memory, and multi-core shared memory MSM SRAM with a capacity of 4MB. The processor also has some external resources that are shared with multiple cores such as multicore navigators, network coprocessors, packet accelerators, and semaphores. Among them, the .M unit is responsible for the multiplication instruction, the .S unit and the .L unit are responsible for the execution of the arithmetic, logic and branch programs, and the .D unit is mainly responsible for reading the data. A .M unit contains 16 16-bit multipliers, so the C66x core can perform four single-precision multiply (32bit) per cycle, making it the most performing floating-point DSP on the market. The perfect integration of multiple C66x DSP cores creates a multi-core Systemon-Chip (SoC) device with superior performance.

Convolution method
Convolution operation is the main operation of CNNs, and it is also the most time-consuming part. The convolution is calculated as Equation 1, where kw is the width of convolution kernel, and kh is the height of convolution kernel. sw is the step of the sliding window in the horizontal direction, and sh is the step of the sliding window in the vertical direction. xn represents the number of channels of the input image, o(n,i,j) represents the output value of (i, j) in the nth output image, and x(m,i,j) represents the mth input image (i, j), w(n,m,u,v) represents the value of weight, n corresponds to the channel of the output image, m corresponds to the channel of the input image, and u and v represent the coordinate of a single convolution kernel.

Sliding windows
The sliding windows method is to perform the operation step by step according to the Equation 1, and to multiply and accumulate the input data and the convolution kernel by means of loop nesting and data index. The main implementation process is as follows:

GEMM
GEMM is an effective convolution method used by mainstream deep learning frameworks such as Caffe, MXNet, etc. This method use im2col totransforms the entire convolution process into a matrix multiplication (GEMM in BLAS), and GEMM is extremely optimized in various BLAS libraries. As shown in Fig. 3, this convolution method rearranges the input data in the order of convolution kernel data to ensure that the two are in oneto-one correspondence, and the addresses are consecutive. Although the matrix multiplication takes time and space to tile input data, it guarantees the continuity of the address, which is still more beneficial to the hardware implementation and shortens the time it takes for the convolution operation.

Winograd algorithm
The Winograd algorithm is a fast convolution algorithm that derives from Winograd's minimum filtering algorithm. The Output Height=2 Figure 3. The convolution step of Caffe, using im2col firstly. After that, matrix multiplication of weights and inputs. Winograd algorithm reduces the number of multiplications with a series of transformations on the input data and the convolution kernel. Simply put, more addition calculations are used to reduce the multiplication. For example, a one-dimensional convolution with an output size of m and a convolution kernel size of r, a normal convolution operation requires m × r multiplication, and a Winograd algorithm requires (m + r -1) multiplication. The Winograd algorithm can be a good acceleration for the platforms where the multiply computing clock period is much larger than the addition calculation clock period.

FFT algorithm
Fourier transform and fast Fourier transform are the calculation methods often used in classical image processing algorithms, but they are not usually used in CNN because the convolution kernel size of CNN is usually small, such as 1×1, 3×3, etc. In this case, the time overhead of the FFT is even larger. Therefore, unless CNN uses a relatively large convolution kernel, the time overhead of the FFT can be hidden.

CNN architecture
The CNN was a biophysical model for the recognition of two-dimensional shapes, which was originally inspired by the neural mechanism of the visual system. The CNN can be regarded as a special multi-layer perceptron or feedforward neural network. In the case of translation, it is highly invariant and has certain invariance in the case of scaling and tilting. In addition, the CNN also has the characteristics of local connection and weight sharing, which reduces the amount of calculation and the number of parameters. CNNs excel in many fields, especially in image recognition.
The CNN mainly includes convolutional layers, pooling layers, and fully connected layers, and its general structure is as shown in Fig. 4. Among them, the convolution layers are the main structure of the whole network. In convolution layers, several filters are applied to extract different types of features from the input data. The role of the pooling layer is to downsample the input feature map and reduce the size of the input data. The most common pooling method is the MaxPooling, which takes the maximum value of each part for output. After extracting the feature, the CNN will typically connect one or more fully connected layers at the end for classification.
Although the calculation of the fully connected layer is far behind the convolutional layer, the parameter ratio is the largest among the entire network parameters.

Deployment
We are committed to deploying a common CNN framework on the DSP. The mainstream deep learning frameworks now include Caffe, Tensorflow, MXNet, etc., but these frameworks rely on a number of different operating system-specific libraries and are not suitable for DSP embedded systems.
We write our framework in C. Since only forward derivation is required, we only need to implement the calculation of the Convolution Layer, MaxPooling Layer, the Fully-Connected Layer, and the data transfer. In order to maintain the accuracy of the data, all data uses a singleprecision floating point type. The input data and convolution kernel weight values are stored in the DDR, and the required data is preloaded into L1 and L2 during the calculation.
In the implementation of convolution, we have measured several convolution methods. At the same time, we optimized the matrix multiplication for the characteristics of the C66x DSP. As shown in Table Ⅰ, we measured three convolution methods: sliding windows, GEMM, and optimized matrix multiplication with AlexNet on the TMS320C6678. It is not difficult to see that our optimized matrix multiplication is much more efficient than using GEMM in Caffe directly, and even improves the efficiency by hundreds of times.
When implementing matrix multiplication, we use the inline function _ftod to combine adjacent data, divide the whole matrix into 2×2 small matrices, and use DMPYSP, DADSP 2-way SIMD single-precision addition and singleprecision multiplication instructions to block matrix multiply and accumulate operations. Through loop expansion and scheduling, 8 times of multiplying and adding operations per loop, making full use of the two functional units of DSP to improve parallelism, the actual efficiency can reach ~ 2GFLOP / s.  Figure 4. The CNN architecture is mainly composed of convolutional layers, pooled layers and fully connected layers. The pooling function replaces the output of the network at that location using the overall statistical characteristics of the adjacent outputs at a location. The maxpooling is to take the maximum output in the adjacent rectangular area, so only logical comparison is needed, and a lot of calculations are not needed. The fully connected layer can be thought of as a global convolution with the same convolution kernel size and input feature map size. In the implementation, the method of referring to convolution is carried out.

Parallel
After completing the deployment, we begin parallel optimization. The TMS320C6678 has 8 cores. It is obviously unscientific to perform 8-core parallel operations on all tasks. Instead, it may cause communication time to be greater than the operation time. Through profile analysis, the time of project is mainly used for convolution calculations, so we only perform parallel acceleration for convolution.
There are two main types of parallel processing models: master-slave model and data stream model. The model structure is shown in Fig. 5 and Fig 6. The master-slave model depicts a model that controls concentration and execution distribution, while the data flow model represents distributed control and execution. We already know that the most time-consuming layer of CNN is convolutional layer, and the whole network is streamlined, dependent on the front, not independent, so it is more suitable for the master-slave parallel model. In the aspect of multi-core communication, IPC is commonly used for communication, but we find that the time of interruption event is large, especially for small convolutional layers. Therefore, we use semaphore query to communicate. The specific implementation is as follows: • The main core occupies a semaphore A; • From the core query to the time when the semaphore is occupied, the loop starts to perform the operation, and the core also occupies one semaphore; •The operation is completed, and all semaphores are released from the core; • The main core queries the operation after the release of the core semaphore and continues execution.
The CSL library gives us a semaphore synchronization function: CSL_semAcquireDirect( ); // Synchronization semaphore //Convolution operation; CSL_semReleaseSemphore( ); //Free semaphore We test the speed of convolution after parallel acceleration, as shown in Table Ⅱ, which is several times faster than before, especially in the case of a large number of convolution operations.

Conclusion
This paper introduces the basic structure of CNNs and deploys them to TI's TMS320C6678 platform. The most time-consuming layer of CNN on the DSP is the convolutional layer. Therefore, we analyze several commonly-used convolution methods, and optimize and parallelize convolution calculations according to the characteristics of TMS320C6678, which shortens the computation time 30 to 60 times. The data storage in this paper uses single-precision floating-point numbers. The next step is to optimize the fixed-point number. Based on this, the pooling layer can be further optimized. At the same time, many CNNs have a large number of weights, and it is worthwhile to study the compression and tailoring of weights when transplanting to embedded platforms.