Rolf Rabenseifner

Rolf Rabenseifner

I'm head of Parallel Computing – Training and Application Services at HLRS and member of the MPI forum. In workshops & summer schools I teach parallel programming models in many universities and labs.

Location HLRS, University of Stuttgart, Germany

Activity

  • Rolf Rabenseifner made a comment

    My question may go beyond this course: Is there any step in this course where you explain the use of C-pointers in Fortran and the intrinsic Fortran function C_F_POINTER?

  • @LewinStein You may play with MPI/tasks/C/Ch11/solutions/ring-1sided-store-win-alloc-shared-signal.c whre the slower send/recv is substituted by a faster setting flag variable and polling on it.

  • @LewinStein In your answer, you mentioned MPI_Send/MPI_Irecv/MPI_wait (which would be definitely semantically identical to MPI_Send/MPI_Recv) but the example uses MPI_Irecv/MPI_Send/MPI_wait. It is a ring example and it would deadloch if you use an MPI option that switches all MPI_Send into synchronous protocol. In a non-cyclic example, MPI_Send/MPI_Recv with...

  • In principle, MPI_Barrier should be prefered if and only if there is a huge number of neighboring processes, i.e., that the O(log_2(# processes)) is less then the number of neighboring processes.
    For example, in a Cartesian 3-dim code, one has typically only 6 neighbors and 2**6 = 64. If you have more than 64 proceses, then the barrier wold be a bad idea for...

  • I want to add, that especially on a shared memory system, the Send/Recv of an empty message can be very fast (~0.2 µs) whereas the visibility of X:=1 in the seconds process may be significantly longer delayed, especially if there is other stuff around that must be executed by the hardware.
    If you substitute the Send/Recv by modifying a flag variable in rank 0...

  • Fully correct answer, because if in your Step 2) each process would provide some size, then in your Step 3), yll processes would retrieve a pointer to the total begin (as asked in our question), but the size of teh window portion of process 0 would not be the total size.

  • Dear Kristian, in the MPI forum, they argued that with MPI_Win_allocate (in comparison to MPI_Alloc_mem + MPI_Win_create) all processes of the underlying communicator can locate the windows in all processes at the same virtual address and this would minimize the load of base addresses to only one load although in one RMA epoch, RMAs to many other processes may...

  • I should add that this window creation is not a synchronization, i.e., depending on which synchronization method you coose, you must start with the related starting synchronization before you call the first RMA routine.

  • Yes, your alternative answer with the modified code in process 1 is also correct.
    Don't oversee that MPI_Send may internally use a synchronous protocol and therefore may be delayed until the MPI_Recv in process 2 is called.
    In both processes 1 and 2 you can choose the start of the pt-to-pt communication before or after MPI_Win_wait. Both calls to...

  • No, one cannot increase the memory size of a window. There is a very special feature in MPI: dynamic windows. But here, you attach fixed portions (i.e., also no way to"increase" such a portion) and all is done through absolute addresses, i.e., a very different interface and designed mainly supporting compilers, e.g. to implement a partitioned global address...

  • I apologize, but I have to correct myself:
    The lower part of the figure describes these RMA and local load/store epochs, which are always separated by synchronizations.
    It would be more correct to say that the RMA accesses must be separated from the local load/store epoch by such synchronizations, and that a sequence of RMA accesses (which can be in...

  • Yes, fully correct.

  • Yes, you are right. In the current version of the first, second and last answer to question 4, there is a pair of fences that surrounds only one local assignment:
    "MPI_Win_fence / Assignment inside process 1 / MPI_Win_fence".
    For none of the processes, this pair of fences is used to define an exposure epoch or an RMA epoch with RMA calls. Therefore, the...

  • I hope that you can continue. This course is designed for anyone familiar with MPI that wants to learn to program using the new interface. This means that you are already familiar with blocking and nonblocking MPI point-to-point communication. Based on this, I hope that week 1+2 is not too condensed. They present all what is needed to really understand the MPI...

  • Yes, the quizzes should help to really get the content of the previous steps before continuing with the next step.

  • Yes, both.

  • And depending on your application's needs and usage, all processes my use PUT and GET in a more symmetric way, similar to discussions in a larger group of humans, and in other applications one-sided communication may be used only in one direction between two processes.

  • Hi all, nice to meet you here during the next 4 weeks.

  • Hi Hasan, do you have already experience with MPI point-to-point blocking and nonblocking communication?
    In this course, you'll first learn additions about one-sided communication (weeks 1+2) and then in weeks 3+4 all about the shared memory in MPI.

  • Yes, lock and unlock is logically executed on the target process.
    And an MPI library is allowed to buffer actions, i.e., a short MPI_Put may be buffered and the Unlock may already come back and the real locking on the target is done later; see MPI-3.1 Chapter 11.5 or MPI-4.0 Chapter 12.5 for more details.

  • MPI_Get-accumulate gets the target values, i.e. reads the target data and stores it into the result-buf in the arguemnt list. Then it takes the data from the origin_buf in the argument lists and executes the operation op (also in the argument list) to combine the current target values with the values in the orig buffer. The result is then stored at the target...

  • Shared memories are a very special operating system (OS) feature as you saw in step 1.5. This peephole-feature may really occupy OS resources.

  • MPI_Win_create_dynamic together with the MPI_Win_attach is a very special feature for people how want to implement other languages through MPI, e.g. partitioned global address space languages. It is not recommended for normal application programming.

  • If two processes write to the same window location (e.g., in a third process) without any synchronization in between, then you have a write-write race condition.
    Or if one process writes locally to its own window memory and another process concurrently reads the same window location with MPI_Get, then you have a write-read race condition.
    Does this answer...

  • Not the same, only similar. Fortran allocation works within the Fortran language. MPI_Alloc_mem, MPI_Win_allocate and MPI_Win_allocate_shared return C pointers and you'll learn next week in Step 2.1 how to translate such a C pointer into an Fortran pointer.

  • If you start with MPI, I would recommend that you may look at beginners courses, because this one is an advanced course.
    Our whole MPI course material is visible at https://www.hlrs.de/training/par-prog-ws/MPI-course-material

  • MPI_Alltoall is sending only one message, but one different message to each process, i.e., with 10,000 processes, each process has to send 10,000 messages, or in total 100,000,000 messages must go through the network. All collective routines require that the receive count must **exactly** match the size of the message. Therefore, if you use zeros in the...

  • Welcome at this course. Nice to hear from you all and where you come from and what you work on and how your work is related to MPI. At the end of the course, I hope that I can give you further hints where you can find more information on MPI.

  • > Does the window works like a dynamically allocatable "buffer" in each process?
    It depends, whether you create or allocate your window, see later for details.
    > Is it a function supported by specific hardware structures?
    Yes and no. It depends on your hardware and how the MPI library uses your hardware and whether you allocate the window through an MPI...

  • Rolf Rabenseifner made a comment

    And we also intended that you have (two) central places (i.e. pdf's) where you can add personal notes :-)

  • Please keep in mind that many applications have a solver loop and they never know "ahead of time" (or in advance) how many steps are needed before an abort criterion is reached. But this does not prevent doing a fence / RMA / fence epoch for the needed halo data exchange in each simulation step in the solver loop.
    Please keep also in mind that a sequence of...

  • Before we give you an answer, I want to clarify your question, which means how I understand it.
    You try to solve the problem with
    MPI_Win_fence (each process in the role of both, sender and receiver)
    In the role of a sender:
    Loop over all MPI_Put to send its data to its receivers
    MPI_Win_fence (each process in the role of both, sender and...

  • Zheng answered already what is a handle using the official text in the MPI standard.
    And now about a request handle: It refers to all the information in the argument list of a nonblocking call plus the current internal status of the operation that was initiated with this nonblocking call. It must be finished with a call to MPI_Wait or MPI_Test or their...

  • > Will disabling register optimization in Fortran impact the performance?
    Yes, this would be a dramatic slow down.
    This means, you should really use the proposed methods:
    - The declaration as ASYNCHRONOUS should not impact loops. It should only prohibit the move of instructions accross procedure calls.
    - The MPI_F_SYNC_REG call is doing exactly the same...

  • Thank you very much for your two bug reports:
    Question 6, we now set the case-insensitive-flag in Future learn.
    Question 7, we'll modify/correct the questions and answers as soon as possible and also allow multiple choices.

  • Yes, you are right. We can add to our explanations such terms. It is not helpful, to have this already in the proposed answers, because in real life, you need to find it.
    For example in Question 1, the current explanation text for the correct answer reads:
    "Process 2 is moving the contents of its variable C to the window of process 1"
    but can be optimized...

  • Yes, but only question 10 showed such an example. For the PUT in process one, you have any freedom to choose your send buffer, i.e., you can use normal (i.e., private) variables or also all your window variables (i.e., from this window handle, but also from other window handles). If you are doing so, this process 1 must careful synchronize, but this comes later.

  • Yes, only a process itself can read and/or modify data in its private variables. MPI_Put and MPI_Get can access only variables/arrays in the windows of another process. As shown in Step 1.5, windows are like peepholes into the memory of a process, i.e., other process can go only through this peephole and not to any other normal (i.e., private) variable.

  • This means that all processes of this communicator will have access to the windows created or allocated in all these processes, i.e., the window handle reflects the group of processes plus all the information about all the windows that were created or allocated in one collective call to MPI_Win_create or MPI_Win_allocate.

  • And it is allowed to switch from one synchronization epoche to another within the use of the same window. But for this, an epoch with one synchronization method must be completely finished before start another epoch with another synchronization epoch. MPI-3.1 Sections 11.5 and 11.7 describe the rules in detail, but they are not easy to read :-(

  • A completely agree

  • This is completely a question of the quality of your MPI library. I never benchmarked this.

  • This is a question of the quality of the MPI library. If your MPI library implements the MPI_Mem_allocate and MPI_Win_allocate by allocating operating system shared memory pages then the same restrictions may apply as for the shared memory procedure MPI_Win_allocate_shared:
    If MPI shared memory support is based on POSIX shared memory, then
    the memory may be ...

  • Yes, you are right. We should already mention this in step 1.5 although it will be then addressed again in step 1.9.

  • > Symmetric with respect what?

    For example with MPI_Win_allocate, the MPI library may allocate for all processes their windows with the same virtual address to minimize table lookups internally in the MPI library. Whereas MPI_Alloc_mem is is a local function and therefore not able to perferm any collective optimization.

  • My apologies, I gave the answers in the first run and not here in the 2nd run )-:

    MPI_Win_create_dynamic is a very special thing: This call is collective and does not provide any window memory. The memory is attached then later with local (!!!) calls to MPI_Win_attach (and removed with ...detach).
    Addressing must be done with absolute addresses, see...

  • Please have a look to my answers to the questions from Georgios Giannakopoulos and Outi Vilhelmiina Kontkanen in step X.X (sorry, this was a reference to a comment in the 1st run of this course - see correct answer in next comment)

  • Good idea. Done.

  • here the cited text:
    "In the case of a window created with MPI_WIN_CREATE_DYNAMIC, the target_disp for all RMA functions is the address at the target; i.e., the effective window_base is MPI_BOTTOM and the disp_unit is one. For dynamic windows, the target_disp argument to RMA communication operations is not restricted to non-negative values. Users should use...

  • This "It was not really designed for end-users ;-)" is the reason, why we'll not go into details on this in the 2nd week step 2.1.

  • Please see my comment above to Georgios Giannakopoulos

  • MPI_Win_create_dynamic is a very special thing: This call is collective and does not provide any window memory. The memory is attached then later with local (!!!) calls to MPI_Win_attach (and removed with ...detach).
    Adressing must be done with absolute addresses, see official MPI-3.1 page 411, lines 8-13.
    This interface was designed to support thi...

  • moved to answer the question of
    Emiliano Ippoliti

  • Welcome to our course on the MPI communication method named "one-sided communication"!

  • HPE and HLRS have already started to look on this, i.e., to try to make progress on bringing one-sided communication with notifications into a future MPI standard.

  • -removed-

  • @GeorgGeiser
    Yes, you are right. I want to note: MPI_Win_test_lock and MPI_Win_test_unlock are not existing MPI functions. But they can be proposed to the MPI Forum, which is the MPI standardization body.
    Such a proposal would typically require some research paper with a real application showing that there are real performance benefits, which is typically a...

  • We used this sentence because we wanted to say, if process A first wants to do an MPI_Put and then process B want to MPI_Get this Information, then you can achieve this with MPI_Put; MPI_Fence in A amd MPI_Fence_MPI_Get in B, but there is no chance provide this guarantee of a sequnce with MPI_Lock/Unlock.
    Nevertheless, all operations within a synchronization...

  • > (By the way, in many MPI3 libraries "mpi_f_sync_reg" does not exist.)
    If such an MPI library provides a Fortran interface, hen this library is not a correct implementation of MPI-3.0 or later. You should ask for a bugfix.

    And now to your first question:
    Please read the text very careful: If you use ASYNCHRONOUS the you also should use
    IF (.NOT....

  • In principle, each MPI library implementation is allowed to be as non-optimized or optimizes as the vendor or customer would like to provide or buy.
    I would always say, MPI_ACCUMULATE is typically designed for "sparse" usage, e.g., 100,000 processes, but each process calls MPI_Accumulate only to 0-30 target processes.
    For the non-spare use-case, I typically...

  • Please have a look back to Activity 1.5. If a Communicator has 4 processes, then all processes must collectivelly call MPI_Win_create and therefore each process has the handle. And in the previous activity, please check again for the meaning of the output argument win.

  • No, Put is called at the origin process with the Argument target_disp__origin_process. This valus is then multiplied with disp_unit__target_process as specified in MPI_Win_create at the target_process. The prodoct is then added to the win_base_addr that was also specified in MPI_Win_create at the target_process.
    With our index, we want to show, in which of...

  • Very good question. The members of the MPI Forum never asked themselves whether win_base_addr in the Fortran iterface of MPI_Win_create should have the CONTIGUOUS Attribute (not CONTI>N<UOUS).
    The MPI-3.1 Standard writes on page 403:
    "In Fortran, one can pass the first element of a memory region or a whole array, which must be `simply contiguous'...

  • Please wait until the next activity (2.2)

  • First, yes, win_size may be chosen as 0 in some processes.
    Second, in that case, the win_size==0, it is not really defined by the MPI Standard 3.1, whether the win_base_addr is ignored or cached on the win handle, i.e., when you look into MPI-3.1, Section 11.2.6 on page 414, whether the Attribute MPI_WIN_BASE will return win_base-addr or not. Threfore, my...

  • Yes, you can mix:
    - the mpi_f08 handles are Fortran derived types (=structure) with exactly one INTEGER :: MPI_VAL inside.
    - This value is identical to the value used in the mpi module and mpif.h
    - new style handles can be also handed over to Fortran code using the mpi module, because all new handle types and the TYPE(MPI_STATUS) must be provided in both,...

  • Please be aware, that put and get are never atomic. All put and get operations between two synchronizations are nonblocking and executed without a given sequence for any data item in an array. The only atomic functions are the accumulate functions (there is more than one). And the atomicity is per array element.

  • Sounds correct. Often accumulate is used for sparse Problems, 30 neighbors, but in total 10,000 processes. In this case, the solution with an array does not scale in size. In many cases, the MPI Standard provides functionalities that can be better optimized than hand-written solutions. Thank you for providing such an example for this.

  • @StellaValentinaParonuzziTicco Both examples are read-write conflicts, one with RMA read (i.e. an MPI_GET()), the other one with a local read (i.e., a local load operation)