# Hybrid MPI

In this subsection we will learn how to use the new hardwares that are coming up with thousands of cores with the help of Hybrid MPI.

### Hybrid MPI + OpenMP Masteronly Style

We saw in the previous exercise that the scaling efficiency may be limited by the Amdahl’s law. This means that, of course, even though all of the computation is actually parallelised, we might still have large chunks of serial code present. For example, the serial code is the code that follows after #pragma omp for reduction in the previous example. So, the reduction clause is a serial portion of code even though it utilizes parallel threads. But this is the last command and the following MPI_Reduce is actually collective communication as we have already learnt.

If we are doing something like this in the loop, we will surely get a definite amount of serial code, meaning that we will anyway be limited by the Amdahl’s law in scaling. This directly implies that we cannot utilize the abundant thousands or more cores (even a million) that are popping out each day on new and recent hardware.

An efficient solution to these problems would be an overlap. Some kind of region where we could do MPI simultaneously with OpenMP in order to overcome these communication issues. This can be achieved by the Hybrid MPI + OpenMP Masteronly Style. There are quite a few advantages of using this hybrid approach, however, the most prominent are that:

• there is no message passing inside of the SMP nodes and
• there are no topology problems.

An efficient example to explain the need and efficiency of this is if we are doing a ray tracing in a room for example. The problem of ray tracing is that the volume, that we are describing, is quite complex. So, let’s say if we have to do the light tracing and reflections that we see from the lighting and so on, we would need to compute the parameters for each ray. This is already several gigabytes of memory and if we have just 60 GB of memory per node, then we are limited by memory to solve the problem. So, we cannot do large problems with many cores because each core in MPI actually gets its own problem inside it. There is no sharing of the problem among the threads, processes or cores. We could usually solve this problem fairly easily by using MPI + OpenMP.
These kind of problems, which take a lot of memory since they are complex because of the description of environment and so on are best done with MPI + OpenMP.

### Calling MPI inside of OMP MASTER

If we would like to do communication, then it is usually best to do OMP master thread. This ensures that only one thread communicates with MPI. However, we will still need to do some synchronization. As we learnt in the previous weeks about synchronization, is that sometimes in parallel programming, when dealing with multiple threads running in parallel, we want to pause the execution of threads and instead run only one thread at the time. Synchronization means that whenever we do MPI, all threads will need to stop at some point and do the barrier.

In OpenMP the MPI is called inside of a parallel region, with OMP MASTER. It requires MPI_THREAD_FUNNELED, and we saw in the previous subsection this implies that only the master thread will make MPI calls. However, we need to be aware that there isn’t any synchronization with OMP MASTER! There is no implicit barrier in the master workshare construct. Therefore, with OMP BARRIER it is necessary to guarantee, that data or buffer space from/for other threads is available before/after the MPI call! The barrier is necessary to prevent data races.

Fortran directives:

!$OMP BARRIER!$OMP MASTER call MPI_Xxx(...)!$OMP END MASTER!$OMP BARRIER

C directives:

#pragma omp barrier#pragma omp master{ MPI_Xxx(...);}#pragma omp barrier

We can see above that this implies that all other threads are sleeping, and the additional barrier implies the necessary cache flush!

Through the following exercise we will see why the barrier is necessary.

### Example with MPI_recv

In the example, the master thread will execute a single MPI call within the OMP MASTER construct, while all the other threads are idle. As illustrated, barriers may be required in two places:

• Before the MPI call, in case the MPI call needs to wait on the results of other threads.
• After the MPI call, in case other threads immediately need the results of the MPI call.

Code in Fortran:

!$OMP parallel !$OMP do do i=1, 1000 a(i) = buf(i) end do !$OMP end do nowait !$OMP barrier !$OMP master call MPI_Recv(buf, …) !$OMP end master !$OMP barrier !$OMP do do i=1, 1000 c(i) = buf(i) end do !$OMP end do nowait!$OMP end parallel

Code in C:

#pragma omp parallel{ #pragma omp for nowait for (i = 0; i < 1000; i++) a[i] = buf[i]; #pragma omp barrier #pragma omp master { MPI_Recv(buf,....); }  #pragma omp barrier #pragma omp for nowait for (i = 0; i < 1000; i++) c[i] = buf[i];}