Jussi Enkovaara

Jussi Enkovaara

Senior application scientist at the Finnish national supercomputing center CSC - IT Center for Science.

Location Finland


  • Hi @LisaLanduyt how did you measure the timings? Generally, larger the array, more benefit there should be from vectorization.

  • As pointed out by @IharSuvorau optimization often makes readibility worse, so remember to document any optimizations in the code!

  • Hi @CamilleClouard note that in the example above `data` is a ten element array, while `rank` is a scalar value. When you gather, the root contains the values from all the other ranks, and the non-root ranks do not get anything, i.e. `n` is none. If you gather the `data` variable, the `n` in root should contain four element list (with 4 MPI tasks) where each...

  • Hi @CamilleClouard lower case and upper case routines work differently as lower case always return data, while for upper case routines one needs to provide the "return" array as argument. In the above example with `n = comm.gather(rank, root=0)`, in rank 0 `n` will be list containing all the ranks, whereas in other ranks `n` will be none.

  • Creating a communicator is collective operation with some cost (in very large parallel scale the cost maybe surprisingly large).

  • Hi @DominikLoroch semantics of isend is that the send buffer can be reused only after the communication is completed (i.e. wait or test has been called), i.e. whether internal buffering is used depends on the MPI implementation and the message size. We should probably emphasize that also in the article.

    You can try a simple test:
    from mpi4py import...

  • Isendrecv would be just the same as Isend followed by Irecv

  • Hi @VikasKushwaha , FutureLearn policy is to provide certificates only for fee (that's not for us to decide), but please provide feedback directly to FutureLearn via the Contact link in the bottom of the page.

  • Hi @RonaldCohen , yes, it is possible to have OpenMP threading in C or Fortran code that is called from Python. One can also have MPI calls both in C/Fortran and in Python within the same program, one just needs to pass the communicator to C/Fortran function.

  • Hi @MauriceKarrenbrock you are correct, OpenMP cannot be used in Python code (you can still have a Python module written in C utilizing OpenMP). Generally, the global interpreter lock in CPython makes it difficult to efficiently parallelize pure Python code with threads.

  • If you need multidimensional arrays in "simple" calculator, Numpy might also be convenient (even if you do not care about the performance)

  • The equation shows maximum speed up with "infite" number of CPU cores. If whole problem (100 %) is parallelizable, then the maximum speed up is infite :-)

  • If you use local single core as reference, then yes.

  • The semantics of send and recv are that their completion *might* depend on other processes, i.e. send might return only if the corresponding recv has been called.

    In practice, for small messages MPI libraries typically perform some internal buffering, so that send returns "immediately" while for large messages corresponding recv has to be called.

  • No, in this case in rank 0 you want to both send to 1 and receive from 1 (and similarly for rank 1)

  • Actually, as cimport and import perform completely different things (import cython definitions vs. import python modules), one can actually cimport and import as the same name (you can also just 'import numpy' and 'cimport numpy'). This is the practice that most cython tutorials actually use, but personally I prefer to use different names.

  • Even though `cdef` functions can be called only from within the same cython module (the .pyx) file, they can return either Python values or pure C values. Thus, `cdef int add(...)` returns a pure C integer, while with `cdef add` the return variable is converted from pure C value to Python object which adds some overhead.

    Generally, when the type of...

  • Yes, it should be mandel.pyx , I have no idea why I used .cyt in the video... :-) (`cython` command does do actually care about the file extensions, but `cythonize` function in setup.py requires '.py' or 'pyx' extension)

  • Yes, in this case the function call overhead is much larger than the actual computation so you do not see much difference.

  • Hi @GianniProcida as the first rule of optimization you should first make sure that the pure Python module works correctly. The Python debugger can be useful in this: `python3 -m pdb myfile.py`.

    If it seems that cythonization introduces bugs, you can try to follow the steps in the above link pointed by @IainS

  • Hi @NooraH and @PatrickKürschner I realized that Fortran compiler is missing from the virtual image, sorry for that!

    You can install it: sudo apt install gfortran

    After that also f2py3 should work.

  • Thanks @YannickGansemans this is fixed now. Dangers of not doing edits in real code...

  • Yes, thats right, you cannot call 'cdef' functions from Python (i.e. from heat_main.py) but only within the cython module. Thus, evolve can be made ´cdef´ but other functions would need to be `cpdef` (if one wants to cythonize them).

  • @LauraVuorinen I have still a bit more limited experiences about Numba, but it looks really promising. At least in simple cases one seems to get same performance as with Cython (or C/C++ or Fortran) with tiny effort, similar to what @giordanozanoli wrote.

    We would definitely like to include also Numba in this course, unfortunately we haven't yet had enough...

  • @IharSuvorau lesson here is that good non-optimized algorithm beats optimized bad algorithm 10000000 - 0 :-) Premature optimization is ...

  • @FraserKennedy this is expected behaviour, Cython is more strict about implicit type conversions.

  • Hi Rafael, just to point out that your measuring also implicit creation of NumPy array from lst, of course with math example there is also the overhead from list comprehension. Generally you are still right that math is typically more efficient for scalars even if you neglect the array creation:

    In [2]: a = 0.37 ...

  • Hi Christof,
    sys.getsize(a) of returns the size of the whole object 'a', which in case of NumPy array includes all the metadata (array shape etc.) in addition to actual data. In this case this metadata takes up 96 bytes:

    In [2]: a = np.zeros(1, dtype='S1')
    In [3]: sys.getsizeof(a) ...

  • Hi, as Germain pointed out NumPy array has a reference only to single "data buffer" in memory (there cannot be hierarchy of references), and thus there is really no concept of shallow copy with NumPY arrays.

    If one uses "shallow" `copy.copy` from Python stdlib (even though the arrays own copy-method is the recommended one) with NumPy arrays, one still gets...

  • One can also assign multiple values at the same time:
    a[[4, 6]] = [-1, -2]

    Generally, one can index NumPy arrays with integer lists / arrays and boolean mask arrays:
    In [1]: a = np.arange(10)
    In [2]: m = a < 5
    In [3]: a[m] = -1 ...

  • The link describes the array API, which is in principle a bit different thing than the array constructor "array". But, I admit it is a bit confusing. Furthermore, as one can provide as dtype also 'c8' and 'c16' meaning single or double precision floating point numbers... NumPy does not fully comply with Zen of Python :-) ("There should be one-- and preferably...

  • Hi Lassi, floating points can be a bit peculiar as many simple decimal numbers cannot be presented exactly, but with double precision numbers only up to ~16 digits. Here, the culprit is 0.2:
    In [1]: format(0.2, '.18f')
    Out[1]: '0.200000000000000011'

  • Hi Tom, could you point out where NumPy documentation says that 'c' is a complex number?

    Frankly, I think that 'c' for character array is for some historical background compatibility, and according to NumPy documentation is not recommended. However, it is still the easiest way to create such an array from a string.

    In addition to 'fromiter', one use also...

  • You can create NumPy arrays both from tuples and lists, i.e.
    numpy.array([1, 2, 3, 4]) and numpy.array((1, 2, 3, 4)) create exactly the same NumPy array. Also, the input data (either from list or tuple) is always copied, if you have something like
    myarray = numpy.array(mylist)
    modifying myarray does not change mylist.

  • Hi Giancarlo, that is a bug indeed, thanks for spotting it! We will fix it shortly.

  • Hi Ihar, you can investigate it yourself by running the heat equation with and without cProfile :-) . Generally speaking, there is some overhead from cProfile, but I think it depends also on how many time() calls you need. Internally, cProfile is impelemented in C so its timing routines are in principle more efficient, but on the other hand it is performing a...

  • Hi, I assume you are using Windows? This same input file is used it in quite few places in exercises, and in Linux/Mac this is conveniently dealed with symbolic links (when modifying the file the changes are seen everywhere without need to copy). Unfortunately symbolic links do not work in Windows so you need to manually copy it.

  • Hi Giancarlo, nice thing with older machine is that you real feel the difference when optimising the code later on :-) (not that more powerful machines would not see any benefits.)

  • There was indeed a typo in the formula, thanks for spotting it!

  • Hi Alexander, you are right that boundary condition is also needed.

  • Hi Marko, see demos/performance/matmul/test_matmul.py (or directly in github at: https://github.com/csc-training/hpc-python/blob/master/demos/performance/matmul/test_matmul.py

  • Measuring performance incurs always some overhead, but just "import time" should be negligible in most real situations. Note that Python caches imports, i.e. when a module is imported multiple times from multiple places in the program, e.g. the disk is read only once and in subsequent imports everything is readily in memory.

    The process_time() function has...

  • The choice between compiled vs. interpreted is generally speaking compromise between programmer efficiency and computer efficience. Nowadays active area of development is just-in-time compilation which tries to get best of both worlds.

  • Hi, the dynamic typing and data structure are quite generic areas and the issues Dominik mentioned are directly related to both of them.

  • Hi Cristina, just out of curiosity which browser/operating system you are using?

  • The pythonuser has admin right, so you can authenticate with the same 'hpc1python' password.

  • Hi Eva, can you provide a bit more details about your problem? What operating system you are using and what is the actual error message?

  • Hi Eike, if the tests pass, you should be fine for the rest of the course.

  • Hi Sameer, installing MPI can sometimes be a bit tricky in Mac and Windows environment which is one of the reasons we are supporting only Linux. You can try the brew install as suggested by Qing above or just use the virtual machine.

  • Hi Ashwin, thanks for spotting this bug! We will fix it shortly

  • Hi Outi, you are right that for-loops can be quite inefficient in Python (as said in the assignment, this is an inefficient implementation :-) ). During the course we will look for some better ways.

  • Hi, you are right the correct module is `heat_main.py`, sorry for the typo!

  • In numerical calculations Fortran compiler can usually optimize the code easier. However, in most cases C code can be made equally fast but you might need to add extra guidance for the compiler (pragmas, restrict qualifiers for pointers etc.)

  • Hi @DariaS , using the provided image should not be too much work if you already have VirtualBox running as you can skip the step 2 in the instructions (in principle the .ova image should work also with other virtualization systems such as VMware but we have not tested them).

    Downloading the image takes of course some time (depending on your bandwidth),...

  • Hi @BillRoberts , the file path in bottle.dat is due to use of symbolic links in the Linux side, which work a bit differently in Windows. You will probably encounter the same issue also later in the course.

  • Hi @MicahWood , I guess you are using Windows? We use symbolic links for bottle.dat which work a bit differently in Windows, so you will probably encounter the same issue also later in the course.

  • Hi @PauliinaMäkinen in principle, virtual machine should have only a small effect on the execution time, but if your system is busy otherwise (e.g. some system update is running on background) it can make the simulation within the virtual machine run slower. Also, large variation in the execution time suggests that system is busy also with something...

  • Sorry for the deprecated `plt.hold` in this exercise. Provided virtual machine works (although with warning), if you are using your own matplotlib installation which gives error you can try to replace `plt.hold(False)` with `plt.gca().clear()' (or issue git pull, material has been fixed in github)

  • Hi Alice, can you be a bit more specific, what kind of trouble you are having with the password?

  • Hi Bill, in principle everything we discuss can be done also in Windows provided you have the necessary packages installed.