# Benchmarks¶

The performance of the library on shared and distributed memory systems was tested on two example problems in a three dimensional space: simple scalar Poisson problem and a non-scalar Navier-Stokes problem. The source code for the benchmarks is available at https://github.com/ddemidov/amgcl_benchmarks.

The first example we consider is the classical 3D Poisson problem. Namely, we look for the solution of the problem

in the unit cube \(\Omega = [0,1]^3\) with homogeneous Dirichlet boundary conditions. The problem is dicretized with the finite difference method on a uniform mesh.

The second test problem is an incompressible 3D Navier-Stokes problem discretized on a non uniform 3D mesh with a finite element method:

The discretization uses an equal-order tetrahedral Finite Elements stabilized with an ASGS-type (algebraic subgrid-scale) approach. This results in a linear system of equations with a block structure of the type

where each of the matrix subblocks is a large sparse matrix, and the blocks \(\mathbf G\) and \(\mathbf D\) are non-square. The overall system matrix for the problem was assembled in the Kratos multi-physics package developed in CIMNE, Barcelona.

## Distributed memory benchmarks¶

Support for distributed memory systems in AMGCL is implemented using the subdomin deflation method. Here we demonstrate performance and scalability of the approach on the example of a Poisson problem and a Navier-Stokes problem in a three dimensional space. To provide a reference, we compare performance of the AMGCL library with that of the well-known Trilinos ML package. The benchmarks were run on MareNostrum 4 and PizDaint clusters which we gained access to via PRACE program (project 2010PA4058). The MareNostrum 4 cluster has 3456 compute nodes, each equipped with two 24 core Intel Xeon Platinum 8160 CPUs, and 96 GB of RAM. The peak performance of the cluster is 6.2 Petaflops. The PizDaint cluster has 5320 hybrid compute nodes, where each node has one 12 core Intel Xeon E5-2690 v3 CPU with 64 GB RAM and one NVIDIA Tesla P100 GPU with 16 GB RAM. The peak performance of the PizDaint cluster is 25.3 Petaflops.

### 3D Poisson problem¶

The AMGCL implementation uses a BiCGStab(2) iterative solver preconditioned with subdomain deflation. Smoothed aggregation AMG is used as the local preconditioner. The Trilinos implementation uses CG solver preconditioned with smoothed aggregation AMG with default settings.

The figure below shows weak scaling of the solution on the MareNostrum 4 cluster. Here the problem size is chosen to be proportional to the number of CPU cores with about \(100^3\) unknowns per core. The rows in the figure from top to bottom show total computation time, time spent on constructing the preconditioner, solution time, and the number of iterations. The AMGCL library results are labelled ‘OMP=n’, where n=1,4,12,24 corresponds to the number of OpenMP threads controlled by each MPI process. The Trilinos library uses single-threaded MPI processes. The Trilinos data is only available for up to 768 MPI processes, because the library runs out of memory for larger configurations. The AMGCL data points for 19200 cores with ‘OMP=1’ are missing for the same reason. AMGCL plots in the left and the right columns correspond to the linear deflation and the constant deflation correspondingly

(Source code, png, hires.png, pdf)

In the case of ideal scaling the timing plots on this figure would be strictly horizontal. This is not the case here: instead, we see that AMGCL looses about 6-8% efficiency whenever number of cores doubles. This, however, is much better than we managed to obtain for the Trilinos library, which looses about 36% on each step.

If we look at the AMGCL results for the linear deflation alone, we can see that the ‘OMP=1’ line stops scaling properly at 1536 cores, and ‘OMP=4’ looses scalability at 6144 cores. We refer to the following table for the explanation.

Cores | Setup | Solve | Iterations | |||
---|---|---|---|---|---|---|

Total | Factorize E | Total | RHS for E | Solve E | ||

Linear deflation, OMP=1 |
||||||

384 | 3.33 | 0.04 | 49.35 | 0.82 | 0.08 | 76 |

1536 | 5.12 | 1.09 | 52.13 | 1.83 | 0.80 | 76 |

6144 | 20.39 | 15.42 | 79.23 | 31.81 | 4.30 | 54 |

Constant deflation, OMP=1 |
||||||

384 | 2.88 | 0.00 | 58.52 | 0.73 | 0.01 | 98 |

1536 | 3.80 | 0.02 | 74.42 | 2.51 | 0.10 | 118 |

6144 | 5.31 | 0.24 | 130.76 | 63.52 | 0.52 | 90 |

Linear deflation, OMP=4 |
||||||

384 | 3.86 | 0.00 | 49.90 | 0.15 | 0.01 | 74 |

1536 | 6.68 | 0.05 | 64.91 | 0.66 | 0.13 | 96 |

6144 | 7.36 | 0.76 | 60.74 | 2.87 | 0.79 | 82 |

19200 | 59.72 | 51.11 | 105.96 | 30.86 | 9.54 | 84 |

Constant deflation, OMP=4 |
||||||

384 | 3.97 | 0.00 | 65.11 | 0.30 | 0.00 | 104 |

1536 | 6.73 | 0.00 | 76.44 | 1.01 | 0.01 | 122 |

6144 | 7.57 | 0.02 | 100.39 | 4.30 | 0.10 | 148 |

19200 | 10.08 | 0.74 | 125.41 | 48.67 | 0.83 | 106 |

The table presents the profiling data for the solution of the Poisson problem on the MareNostrum 4 cluster. The first two columns show time spent on the setup of the preconditioner and the solution of the problem; the third column shows the number of iterations required for convergence. The ‘Setup’ and the ‘Solve’ columns are further split into subcolumns detailing time required for factorization and solution of the coarse system. It is apparent from the table that weak scalability is affected by two factors. First, factorization of the coarse (deflated) matrix starts to dominate the setup phase as the number of subdomains (or MPI processes) grows, since we use a sparse direct solver for the coarse problem. Second factor is the solution of the coarse problem, which in our experiments is dominated by communication; namely, most of the coarse solve time is spent on gathering the deflated problem right-hand side for solution on the master MPI process.

The constant deflation scales better since the deflation matrix is four times smaller than for a corresponding linear deflation case. Hence, the setup time is not affected that much by factorization of the coarse problem. The communication bottleneck is still present though, as is apparent from the table above.

The advantage of the linear deflation is that it results in a better approximation of the problem on a coarse scale and hence needs less iterations for convergence and performs slightly better within it’s scalability limits, but the constant deflation eventually outperforms linear deflation as the scale grows.

Next figure shows weak scaling of the Poisson problem on the PizDaint cluster. The problem size here is chosen so that each node owns about \(200^3\) unknowns. We only show the results of the AMGCL library on this cluster to compare performance of the OpenMP and CUDA backends. Intel Xeon E5-2690 v3 CPU is used with the OpenMP backend, and NVIDIA Tesla P100 GPU is used with the CUDA backend on each compute node. The scaling behavior is similar to the MareNostrum 4 cluster. We can see that the CUDA backend is about 9 times faster than OpenMP during solution phase and 4 times faster overall. The discrepancy is explained by the fact that the setup phase in AMGCL is always performed on the CPU, and in the case of CUDA backend it has the additional overhead of moving the generated hierarchy into the GPU memory.

(Source code, png, hires.png, pdf)

The figure below shows strong scaling results for the MareNostrum 4 cluster. The problem size is fixed to \(512^3\) unknowns and ideally the compute time should decrease as we increase the number of CPU cores. The case of ideal scaling is depicted for reference on the plots with thin gray dotted lines.

(Source code, png, hires.png, pdf)

Here AMGCL scales much better than Trilinos, and is close to ideal for both kinds of deflation. As in the weak scaling case, we see a drop in scalability at about 1536 cores for ‘OMP=1’, but unlike before, the drop is also observable for the constant deflation case. This is explained by the fact that work size per each subdomain becomes too small to cover both setup and communication costs.

The profiling data for the strong scaling case is shown in the following table, and it is apparent that the same factorization and coarse solve communication bottlenecks as in the weak scaling scenario come into play. Unfortunately, we were not able to obtain detailed profiling info for the constant deflation, but it should be obvious that in this case communication is the main limiting factor, as the coarse problem factorization costs much less due to reduced size of the deflated space.

Cores | Setup | Solve | Iterations | |||
---|---|---|---|---|---|---|

Total | Factorize E | Total | RHS for E | Solve E | ||

Linear deflation, OMP=1 |
||||||

384 | 1.01 | 0.03 | 14.77 | 1.04 | 0.07 | 64 |

1536 | 1.16 | 0.76 | 5.15 | 0.71 | 0.48 | 50 |

6144 | 17.43 | 15.58 | 40.93 | 34.23 | 2.72 | 34 |

Constant deflation, OMP=1 |
||||||

384 | 1.22 | 16.16 | 76 | |||

1536 | 0.55 | 12.92 | 72 | |||

6144 | 3.20 | 48.91 | 46 | |||

Linear deflation, OMP=4 |
||||||

384 | 1.34 | 0.00 | 14.38 | 0.13 | 0.01 | 62 |

1536 | 0.77 | 0.03 | 4.66 | 0.40 | 0.08 | 68 |

6144 | 0.98 | 0.76 | 3.24 | 0.78 | 0.48 | 50 |

Constant deflation, OMP=4 |
||||||

384 | 2.75 | 18.05 | 80 | |||

1536 | 0.55 | 4.63 | 76 | |||

6144 | 0.21 | 3.83 | 66 |

Next figure shows strong scaling AMGCL results for OpenMP and CUDA backends on the PizDaint cluster. The problem size here is \(256^3\) unknowns. The scalability curves show similar trends as on the MareNostrum 4 cluster, but the GPU scaling is a bit further from ideal due to higher overheads required for managing the GPU and transferring the communication data between the GPU and CPU memories. As in the weak scaling case, the GPU backend is about 9 times faster than the CPU backend during solution phase, and about 3 times faster overall.

(Source code, png, hires.png, pdf)

An interesting observation is that convergence of the method improves with growing number of MPI processes. In other words, the number of iterations required to reach the desired tolerance decreases with as the number of subdomains grows, since the deflated system is able to describe the main problem better and better. This is especially apparent from the strong scalability results, where the problem size remains fixed, but is also observable in the weak scaling case for ‘OMP=1’.

### 3D Navier-Stokes problem¶

The system matrix in these tests contains 4773588 unknowns and 281089456
nonzeros. AMGCL library uses field-split approach with the
`mpi::schur_pressure_correction`

preconditioner. Trilinos ML does not provide
field-split type preconditioners, and uses the nonsymmetric smoothed
aggregation variant (NSSA) applied to the monolithic problem. Default NSSA
parameters were employed in the tests.

The next figure shows scalability results for the Navier-Stokes problem on the MareNostrum 4 cluster. Since we are solving a fixed-size problem, this is essentially a strong scalability test.

(Source code, png, hires.png, pdf)

Both AMGCL and ML preconditioners deliver a very flat number of iterations with growing number of MPI processes. As expected, the field-split preconditioner pays off and performs better than the monolithic approach in the solution of the problem. Overall the AMGCL implementation shows a decent, although less than optimal parallel scalability. This is not unexpected since the problem size quickly becomes too little to justify the use of more parallel resources (note that at 192 processes, less than 25000 unknowns are assigned to each MPI subdomain). Unsurprisingly, in this context the use of OpenMP within each domain pays off and allows delivering a greater level of scalability.