Zheng Meyer-Zhao

Zheng Meyer-Zhao

I work as a software engineer for HPC applications at ASTRON in The Netherlands. I spend most of my time developing software, giving trainings on HPC related topics, and developing training materials.

Location ASTRON, Dwingeloo, The Netherlands

Activity

  • MPI manages system memory that is used for buffering messages and for storing internal representations of various MPI objects such as groups, communicators, datatypes, etc. This memory is not directly accessible to the user, and objects stored there are opaque: their size and shape is not visible to the user. opaque objects are accessed via handles, which...

  • Thank you for reporting the bugs! The options of the question and answers to these options have been updated/corrected.

  • An epoch in the sense of one-sided communication is the time between two consecutive synchronization calls. Such a period is usually used for RMA calls to a remote window (i.e., in the role of being an origin process) and/or local load and stores to the local window (i.e., in the role of being a target process)

  • An epoch in the sense of one-sided communication is the time between two consecutive synchronization calls. Such a period is usually used for RMA calls to a remote window (i.e., in the role of being an origin process) and/or local load and stores to the local window (i.e., in the role of being a target process)

  • It is the same as used in other collective MPI routines. The communicator can be e.g. MPI_COMM_WORLD.

  • No, we don't have any benchmark results on this. And it depends on the quality of your MPI library.

  • @GeorgGeiser Could you elaborate what you mean by "the calling process"? Do you mean the process that calls MPI_Win_lock?

  • The problem may occurs with all buffer arguments of nonblocking
    MPI routines, i.e., Independent on whether Array or variable,
    or whether MPI_Put/Get/Accumulate buffers or direct loads and stores to the 1-sided window by the target process within a local load/store epoch, or whether in other nonblocking routines like MPI_Isend or MPI_Irecv.

  • Sorry, I don't have the answer. You may contact PRACE https://prace-ri.eu/contact-us/ for this question.

  • No. When a window is locked by process A, other processes' lock can only lock this window when the lock of process A is released. So MPI_Win_lock of other processes will automatically be triggered once MPI_Win_unlock of process A has returned.

  • You can find the digital version on the website of MPI-forum.org at https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf

  • Yes, that's indeed the problem, the extra MPI_Win_fence is not needed. The question is to select the "most correct and accurate way".

  • On origin and target side (as with message passing on sender and receiver side), the combination of sendcount*sendtype and really used recvcount*recvtype
    must reflect the same sequence of basic datatypes (the recvcount in the argument list may be larger the really used count).
    This means, for example, you may (e.g., with MPI_Put) send 10 doubles located very...

  • Unless you are working with legacy code which uses mpif.h, for new applications please use mpi_f08, as most mpif.h implementations do not include compile-time argument checking, therefore, many bugs in MPI applications remain undetected at compile-time (e.g. missing ierror as last argument in most Fortran bindings).

  • C_F_POINTER(cptr_buf, buf, (/max_length/) ) assigns the target of the C pointer cptr_buf to the Fortran pointer buf and specifies its shape. 'buf' will be used in the rest of the application.

  • The example described here perfectly illustrated race condition. A block of RMA operations is not atomic, that's why synchronisations are needed.

  • Parallel computing is done at the programming level. It is the software developer's responsibility to make sure that the program runs in parallel correctly. With modern processors, vectorization is also possible. To do this, users can compile programs with compilation options that enables vectorization.

  • There are more and more libraries available nowadays, which allow you to write a few lines of code to offload the computation to GPUs without having to write CUDA code yourself. Therefore, there are more and more GPU users, but not that many CUDA developers.

  • Hi Michael, I agree with you. However, this is a Future Learn policy that we cannot do much about :(.

  • Hi John, the concept of parallel programming will be explained in Week 3. However, there is no coding exercises in this course.

  • In case of hyper-threading, two threads are running on one CPU-core. The instructions that need to be carried out from the two threads will be in the pipeline for the CPU-core to execute.

  • When multiple cores are trying to access the memory intensively at the same time, the memory becomes the bottleneck, so everything is slowing down. A single program can be assigned to run on more cores if it is programmed to do so.

  • When using distributed memory architecture, you will need to write the software yourself to let it know how to split the tasks to run on different machines, i.e. each machine has its own memory, but there are multiple machines. More about this will be explained in Week 3.

  • The difference between the two architectures are explained later in the course.

  • Hi everyone, I am Zheng, an HPC consultant. I am interested in the learning process of robots.