CFD Problems Solving Parallel Approaches on Supercomputers

The work is devoted to developing and testing parallel algorithms, suit of computer programs for numerical solution of CFD-problems on modern supercomputers. The paper summarizes our experience in solving various practical problems of gas dynamics. These problems include research of gas flow around bodies of complex shape by viscous gas flows with radiation processes near the surface of the bodies, calculations of jet streams in the open space and flows in micro channels of technical systems. We have developed approaches combining unstructured grids, domain decomposition technique, parallel implementation via MPI technology, OpenMP, CUDA. Calculations are performed on the heterogeneous computer systems with Graphical Processing Units, classical microprocessors (Central Processing Units) and use advanced parallel approaches.


Introduction
At present time there is a rapid increase in productivity of computer and supercomputer hardware. For example, some mobile or desktop systems utilizing the GPU have performance of the order of 1 teraflop and higher. Peak performance of the most powerful supercomputer system reaches about 50 PFLOPS (Guangzhou, China; 1 PFLOP means 10 15 floating point operations per second, 1 TERAFLOPS = 0,001 PFLOPS, 1000 PFLOPS = 1 EXAFLOP) [1]. Supercomputer technologies provide a progress both in the defense and civilian sectors of science and technology. Ultra-high-performance systems allow us to simulate natural and technological processes with fine detailing, simultaneous consideration of different nature factors and fast analysis of numerous possible variants. Therefore, we can get information about studied process with a high degree of accuracy, and as close as possible to real-life situations. In the field of fundamental science, the supercomputer technologies give us possibility to explore previously inaccessible for detailed modeling the phenomenon, such as problems in astrophysics, direct modeling of turbulence, genome deciphering and others.
Recently, a new methodology of interaction of supercomputer modeling techniques and experiments are formed. According to it, a detailed comparison of the calculated data and experimental data is carried out. As a result of this comparison, reliable mathematical models and computer programs are developed and in the further, they replace experiment by calculations. The further investigations are performed using computational experiment. This is a problem of storing and analyzing the huge amount of data. In this case the computational complexity of the problem increases several times (non-linearly). The growth depends on the number of taken into account factors. These problems, for example, include an analysis of the situation in the near-Earth space (space debris), analysis of the transport situation in cities and etc.
Heterogeneous computing systems with performance about 5-10 PFLOPS widespread recently. In order to reduce the cost and power consumption, the main processing core of such systems together with conventional Central Processing Units (CPUs) constitute special calculators -Vector Processing Unit (VPU) or Graphical Processing Unit (GPU). However, combined use of CPU, GPU or VPU is complicated by a number of problems. For example, the VPU and GPU have its own memory and it differs from the CPU memory address space. Therefore, the applications must explicitly copy the data in memory VPU or GPU and back. This communication is slow and inefficient. The introduction of heterogeneous systems with coherent memory has not yet led to their mass extension and use. Thus, at present the developments of special parallel approaches to the implementation of the calculations on heterogeneous systems with different topologies and architectures are relevant [2,3].
In this paper the Computational Fluid Dynamics (CFD) tasks on supercomputers with hybrid architecture of computing nodes, including CPU, VPU or GPU solving parallel technology is presented. Technology suggests that the discrete mathematical model of the CFD-process uses unstructured mesh and can be strongly spatially inhomogeneous. In this case, optimization techniques for balance of the computational load per a single node or a calculator inside node are implemented. To solve this problem, dynamic load balancing algorithms and block methods of exchange among the different devices of random access memory are proposed.

Problem definition
For investigating different CFD-problems we used system of Quasi Gas Dynamics (QGD) equations [4] and its approximations on hybrid meshes, which suppose adaptive, locally thickening meshes including both rectangular and triangular cells. The QGD system of equations is a generalization of Navier-Stokes model including additional dissipative terms with a small parameter W acting as an arti¿cial regularizer.
The QGD system has a form of the conservation laws and is presented in a common form as (1-6): Mass Àux density vector is expressed as: , Here ȝ is the viscosity coefficient, M is the molar mass, R is the universal gas constant, -a sign of inner product and I -unity matrix. F -momentum vector of external forces, -power density of external energy sources, Q U -density, -vector of velocity, u Z -compressibility factor, -coefficient of heat capacity at constant volume, System (1)-(6) is completed by the equations of state for gas, suitable boundary conditions and the expressions for the coefficients of viscosity, heat conductivity and Ĳ coefficient.
For special problems QGD system is supplemented by additional terms. In this paper we consider two applied tasks: radiation transfer and gas mixture flows.

Radiative Gas Dynamics
Many space programs are carried out today. They include launches of interplanetary stations (ExoMars 2016) and International Space Station flights (MKS, Soyuz Flight VS14) [5]. Therefore, scientific explorations of this area are quite relevant.
Returning to Earth after a flight into space is a fundamental problem, that demands to take into account gas-dynamic flow processes and the effects associated with radiation. Investigations of flows around a re-entry vehicle entering the atmosphere are traditional for high temperature gas dynamics because superheating proved a serious problem.
Development of space vehicles thermal insulation is important aspect of designing spacecraft. It requires accurate prediction of environmental gas parameters. An accurate description of the aero thermal environment is required to minimize the weight and price of the Thermal Protection System. The inability of test facilities to reproduce the high enthalpy flow coupled with the prohibitive expense of flight tests leads to the use of an analytical method, namely CFD, to describe the flow.
Radiative Gas Dynamics (RGD) approach is implemented for computing. The model includes QGD equations system and radiation transfer equations. The radiation transfer is realized with the diffusion approximation model [6][7][8][9]. The diffusion approximation model was coupled with the gas dynamic model by including the radiation energy in the energy equation (3', 7, 8): where Q is photon frequency, : -the direction of photon propagation, I Q -radiation intensity, corresponding to a specific frequency, I Qp -the intensity of the equilibrium radiation, F Q -absorption coefficient. The diffusion equation (8) can correspond either individual spectral lines or a number of frequency intervals [6,7]. We use the second approach and set 600 frequency groups for air. Tabulated data from [    For QGD equations a time explicit conservative numerical scheme was developed. Explicit schemes are very convenient for parallel computing. The main technique of parallelization is domain decomposition. Inter-communications among cluster nodes are performed with MPI library and OpenMP technology. For RGD calculations we realized parallelizing by groups: linear systems of each spectral group are computed by their own CPU cores. Then the sum of independent results is calculated. Advantages of this approach: simple realization, high computation speed, free selection of linear system solution method and pre-conditioner. Disadvantages are: inability to utilize more CPU cores than spectral group count (600), absence of load balancing. Other variant is the use of parallel algorithms for solving linear equations systems. This allows us to increase number of CPU cores but efficiency of these methods significantly decreases when the number of cores grows.
General parallel algorithm integrates the advantages of previous two methods. All available CPU cores are divided into groups and each group processes its own set of linear equations systems. Parallel algorithm for linear equations systems solving is used within each group and all the data are stored distributively.

Gas mixture flows
Another problem in which we used a developed parallel approach is related to modeling flows of gas mixtures in microchannels technical systems. Here we present the study of the evolution of a binary supersonic gas jet that is propagated in a vacuum. The computing complexity of such flows is associated with the violation in some parts of the computational domain of continuity hypothesis.
Using a multiscale approach is a way to overcome the problem of discontinuity [15][16][17][18][19][20]. We propose method that integrates a macroscopic approach and a method of molecular dynamics (MMD) [21-23]. Macroscopic approach is based on the QGD system [4]. The flow parameters are corrected with MMD. The system of molecular dynamic equations is used as a subgrid algorithm. At first step for MMD we assume that the molecules are distributed uniformly in space. Velocities of the molecules are distributed according to the Maxwell equilibrium function. Computational domain is shown in Figure 3. The QGD equations of binary gas mixture dynamics are written as following (1)-(3), where F and Q are formulas (9, 10): A mixture consists of two gases: a and b with numerical densities and and mass densities Here are the exchange terms (5)- (6). , , , -vector of velocity and the total energy density, allowing us to take into account redistribution of momentum and energy between the components of the mixture. These parameters will be determined through MMD below. a  Table  2. The parallel algorithm is based on "domain decomposition" technique and on splitting into physical processes. Splitting on physical processes means that QGD equations are processed by CPUs and MD is processed by CPUs, VPUs or GPUs. Gas dynamics is calculated on CPU units. For MMD parallelization an equal number of particles is sent to each computing unit. According to the algorithm the molecular dynamic calculations do not require exchanges.
Domain decomposition on computational domain is used for QGD equations. We take into account four levels on parallelization: nodes of system ĺ CPUs ĺ CPU cores ĺ CPU threads. Node of system and GPU or VPU get the same subdomains. GPU or VPU computes the exchange terms. If GPU or VPU is unavailable these calculations are produced by CPU threads.
Parallelization of MD calculations is produced by splitting the set of molecules into groups with the same power. Each GPU or VPU produces a few molecular groups.

Results
For solution of described problems, we constructed the parallel software suite that is based on hybrid technology. The technology combines MPI, OpenMP and CUDA. The parallel testing techniques developed by us were performed using various computing systems. For demonstration of hybrid calculations, two systems were implemented: the K-100 hybrid cluster with CPU and GPU set at Keldysh Institute of Applied Mathematics RAS (Moscow), the MVS-10P supercomputer at Joint Super Computer Center of RAS based on CPUs and VPUs. The results of our studies for each of these tasks are presented below.

Radiative Gas Dynamic computing
Early for solution of RGD problem we used two parallel systems SKIF MSU and MVS-100k. Group parallelizing efficiency using 600 CPU cores was 84%. Hybrid parallelizing (groups consisting of 8 CPU cores) reached 79% efficiency using 600 CPU cores and 66% efficiency using 1200 cores. Now we use the PETAFLOPS-system MVS-10P with CPU and VPU devices. The system has 207 nodes. Each node has two CPU Intel Xeon E5-2690 (8 cores and 16 threads per one CPU) and two VPU Intel Xeon Phi 7110X (41 cores and 244 threads per one VPU). Parallelization scheme was based on domain decomposition technique. Program realization was based on MPI and OpenMP technologies. For parallelization on spectral groups we used threads of CPU or VPU. The results of testing of our parallel software on space 3D grid with 48*10 6 hybrid cells are shown in Figure 4. These results confirm that the CPU calculations are more effective. However the using of all devices on each node allows minimizing the computation time. And as summary, high possibilities of modern hybrid computer systems allows solving large RGD-problems.

Calculation of gas mixture flows
Calculations of gas mixture flows were provided on K100 hybrid cluster with CPU and GPU. Each node of the cluster has two CPU Intel Xeon X (6 cores and 12 threads per one CPU) and three GPU NVidia Tesla C2050 (448 CUDA cores per one GPU). We tested of our software on 3D Cartesian grid with sizes 240ɯ240ɯ720 cells. The calculations were connected with evolution of nitrogen and hydrogen jet in nickel microchannel with sizes 30x30x90 microns. Each grid cell in dependence on physical conditions can consist of 500 -50000 gas particles. The results of testing are shown in Figure 5. These results confirm also that using of CPU systems is more effective. Perhaps the using of GPU allows reducing calculation time by 6 -10 times.