Matrix-DSP back-end support based on TVM compilation structure

. The emergence of deep learning frameworks has greatly facilitated the construction of network models, but it has not solved the problem of network models deployed in different hardware backends. TVM combines hardware-independent optimization and hardware-related optimization decoupling ideas to provide excellent solutions. By analyzing the basic structure of TVM and the basic process of neural network deployment on hardware, TVM has realized the basic support of the independently developed chip Matrix-DSP,which provides a foundation for further exploring the performance of the chip and enriching the application scenarios of the chip.


Introduction
In recent years, with the rapid development of hardware chips and the rapid improvement of chip performance, the research on deep learning has set off a wave of enthusiasm in the world. Its accumulated theories and achievements in machine vision and artificial intelligence have been widely used in automobiles. manufacturing, manned spaceflight, medical diagnosis and many other scientific fields. In order to make deep learning quickly applicable to actual scenarios, in addition to further exploring the deep principles of neural networks and carrying out theoretical innovations, it is also necessary to solve the problems of rapid network model construction and deployment on a variety of hardware backends and network acceleration. The emergence of deep learning frameworks has greatly reduced the difficulty of building network models, but the support for a variety of back-end hardware is not complete, and it often requires a lot of unnecessary repetitive work when adding a new back-end hardware. At the same time, due to the difference in the basic structure and application scenarios of the framework, the network model defined on a certain framework cannot be well migrated to other network frameworks to complete inference.
In order to solve the problem of support for different network frameworks, and new back-end hardware support and hardware acceleration, TVM [1] used the compiler's thinking to complete the analysis of the neural network and its deployment in the back-end. The upper-level TVM undertakes networks under various frameworks, introduces intermediate representation of the code under the idea of decoupling, realizes front-end goal-independent optimization (graph optimization) and back-end related optimization (operator optimization), and then link the compiler LLVM to get Object code that can run at high speed on the target hardware. This article mainly discusses the completion of the back-end support for Matrix-DSP on TVM.

TVM overall architecture
The deep learning [10] [2] IR) resides in the front-end, and the low-level intermediate representation (tensor IR) resides in the back-end, but is relatively traditional. TVM can better obtain the overall information of the application and complete specific optimizations (such as graph optimization) for deep learning. At the same time, TVM also supports the use of mature compiler tool chains (such as LLVM [3] ), so that the generation of specific hardware back-end instruction codes can be handed over to existing compilers. TVM can be dedicated to the optimization of network models and hardware-specific operator algorithm optimization. In general, the TVM front-end receives networks under different frameworks, and generates a target-independent calculation graph (relay IR) after format conversion. After performing a graph optimization algorithm suitable for deep learning on the relay IR, it is divided into a tensor expression representing the function of the operator. The back-end manually or automatically optimizes(with the assistance of AutoTVM [4] ) the operator resolved into tensor IR form according to the target hardware characteristics, and finally connects the compiler LLVM to generate high-performance code specific to the target hardware. The figure 1 shows the basic compilation architecture and compilation process of TVM.

Computation Graph and Optimization
Computation Graph

TVM front-end
The front end first implements support for networks under various frameworks such as pytorch [5] and tensorflow [6]. After the unified analysis of the network model is completed, the network is expressed as a calculation graph in the form of relay IR, and then the graph optimization algorithm is used to achieve the goal-independent optimization of the calculation graph . The calculation graph mainly looks at the entire calculation process from a global perspective, without the need for meticulous implementation of each operator.
In the calculation graph, nodes represent tensor operations or input tensors, and edges represent dependencies between nodes. TVM performs a large number of graph optimization operationson the front end. These optimizations can be roughly divided into three categories [9]: node level, local level and global level. Table 1 shows the basic optimization methods.

TVM back-end
After the front-end relay IR has undergone basic optimization operations such as graph optimization, the operators in the calculation graph will be further parsed into operator expressions described in tensor language. The optimization primitives provided by TVM can be used to implement target hardware-specific operators optimization. The construction of TVM's tensor language and the implementation of tensor IR borrow the characteristics of the separation of calculation and scheduling of the Halide [7] language, so that the nodes in the calculation graph do not need to be bound to their specific implementation on the backend,only need to know the corresponding type of its node to complete global optimization. After identifying the back-end target string target, the operator can be mapped to the specific back-end implementation and optimization. Regarding back-end-specific operator optimization TVM provides a large number of optimization primitives. The corresponding logic code after the operator optimization can be printed out through tvm.lower( ) for debugging and viewing. It is found that the functional composition of the operator is actually equivalent to a multi-level nested for loop. The optimization process of the operator is actually the processing process of the tensor IR syntax tree. Since the basic structure of the syntax tree corresponds to the structure of the multi-layer for loop, its functional representation can be realized without an overly complicated syntax tree. After completing the basic tensor ir syntax tree construction and optimization corresponding to the tree, TVM will connect the back-end compiler LLVM to complete the code generation of the target hardware, in order to avoid converting the tensor ir syntax tree into a program in a certain language ( Such as C language), and then connect the compiler LLVM to parse the unnecessary work brought by the program. TVM directly uses LLVM to complete the analysis of the tensor IR syntax tree, and calls the LLVM library function to traverse the nodes, and directly generate a module containing LLVM IR while processing. The module containing LLVM IR will further complete IR optimization, instruction selection and other operations through the LLVM backend, and finally generate target code that can run at high speed on the target platform. To achieve this process, support for Matrix-DSP is added to LLVM. Including the basic information describing the hardware, the description of the instruction set, and the regulation of instruction matching.

Matrix-DSP analysis
Matrix-DSP is a high-performance DSP with SIMD+VLIW characteristics independently developed by the National University of Defense Technology. Figure 2 shows the main structure of its core. The instruction dispatch component extracts the instruction to be executed from the received instruction packet, and sends it to the scalar component and the vector component for execution. The scalar processing unit (SPU) in the scalar component is not only responsible for scalar data operation, instruction flow control, and execution of serial tasks, but also for the control of the vector component. The scalar storage unit (SM) is mainly used for scalar data access, and the first level Cache (L1D) realizes the caching of scalar data. The vector processing unit (VPU) in the vector component is mainly responsible for performing computationally intensive tasks. The VPU is composed of 16 homogeneous vector computing engines (VPE), and the VPE contains 3 floating-point multipliers MAC. The on-chip array memory (AM) can realize 16-channel SIMD-wide vector data access, supports two vector storage instructions and DMA parallel access operations, and provides a higher memory access bandwidth for the VPU.

Matrix-DSP back-end support implementation
Through the analysis of the TVM compilation architecture in Section 2, the main function of the TVM front-end is to analyze the network model and the optimization of the calculation graph that has nothing to do with the processing target. The TVM back-end is an important part of completing the support for Matrix-DSP. The support for Matrix-DSP in the backend mainly includes the following types of tasks: complete operator programming and hardware structure-specific operator optimization, supplement the hardware-related interfaces in the compilation process, and call LLVM library functions to implement LLVM IR generate.

Operator optimization method
In order to optimize the operator according to the hardware characteristics, TVM provides a wealth of optimization primitives, including loop-specific optimization primitives such as tile, split, reorder, unroll and parallel, and intrinsic primitives for embedding the internal mapping of the hardware, etc. When optimizing primitives on the specified hardware, we need to pay special attention to the following hardware information: the data storage method in the memory (row or column first), the register size and cache size on the hardware, and the vector length supported by the hardware. Using optimization primitives to optimize operators is actually by changing the order of data access, reducing the number of cache misses, and using vector acceleration components to improve operator performance. We can refer to the theory in the article and combine the characteristics of the Matrix-DSP structure to complete the optimization of the operator .Due to space limitations, the implementation of specific operator optimizations will not be given. Table 2 gives the functional description of some primitives.

Add call interface
The function tvm.build( ) in TVM is the core function to generate the target machine code. Knowing its basic workflow is beneficial to the realization of code addition. The input of this function contains the parameter S that represents operator scheduling (operator optimization), and the string target that indicates the target hardware name. After obtaining the parameter S, the function build( ) in the file build_module.py will realize the optimization after the corresponding scheduling through the function lower( ), and convert it into a LoweredFunc in the form of Tensor IR. After the generation of LoweredFunc, it will jump to codegen.cc. In the function build( ), this function is the real entry point for generating target hardware code from Tensor IR. It will complete the analysis of the target string for the first time, and prepare for the backend to generate the module on LLVM. After continuing to jump to the function init( ) of the file llvmmodule.cc, it starts to traverse all LoweredFunc and parses the target string again to call the codegen class of the specified hardware to complete the translation of the syntax tree corresponding to LoweredFunc. The following shows part of the code added by the interface.

LLVM IR generation
The LoweredFunc generated after the tensor expression is processed by the function lower( ) actually exists in the form of a syntax tree in TVM. The syntax tree needs to include necessary nodes such as basic operation nodes Add node, Sub node and Max node, including judgments nodes with loop information such as IfThenElse node and For node, nodes used for vector optimization such as ramp node and other nodes. In TVM, VisitExpr_( ) or VisitStmt_( ) is used to complete the translation of nodes. These functions will directly call LLVM library functions to generate LLVM IR sentence by sentence. The following shows examples of functions used to translate nodes.

Overall system test
In order to verify the correctness of the implementation support on TVM, a vector addition written using tensor expressions is selected for testing, and the function get_source( ) provided by TVM is used to print out the LLVM IR and assembly code corresponding to Matrix-DSP. Examples and results are shown in the following.