David Henty

David Henty

I have been working with supercomputers for over 25 years, and teaching people how to use them for almost as long. I joined EPCC after doing research in computational theoretical physics.

Location EPCC, The University of Edinburgh, Scotland, UK.

Activity

  • Yes - the OS is constantly juggling dozens of different processes and threads, trying to ensure they all get their fair share of CPU time. The threads that OpenMP creates are just thrown into the mix with all the others. For HPC applications we usually make sure that a minimum of other tasks are running so the OpenMP threads will run almost continuously on the...

  • It is possible to reuse the heat but, until recently, the outlet water was not hot enough to be of much use. However, modern machines run much hotter which makes the heat carried away but the water much easier to use e.g. in heating other buildings - see "Energy Efficiency by Warm Water cooling" at...

  • @IstvanF The layout of all the cabinets is typically fixed to minimise cable lengths. Connecting all the cables is a huge job and normally done by dedicated experts.

  • That is a very good point - for large simulations on supercomputers, the limiting factor (the slowest part) is usually reading and writing memory and not the clock speed of the CPUs.

  • Yes - in a typical cellular automaton model you need to know the state of all the neighbouring cells. In 1D this is 2 neighbours (left and right), 2D is 4 neighbours (up and down as well), 3D is 6 neighbours ... In general, it's 2xD neighbours for D dimensions. If you include diagonals then the numbers of neighbours for 1D, 2D and 3D are 2, 8 and 26. In...

  • There are a number of parallel packages that can do Molecular Dynamics on parallel supercomputers, e.g. NAMD, GROMACS, LAMMPS, AMBER, ... EPCC recently ran an online LAMMPS tutorial - see https://www.epcc.ed.ac.uk/blog/2019/online-lammps-training-archer

  • @AndrewMatthew Up until the early 2000's, each manufacturer had their own version of Unix, e.g. Unicos (Cray), Tru64 (DEC/Compaq), Irix (SGI), Solaris (Sun), AIX (IBM), ... The advantages were that each OS was tailored for a particular architecture, but the development cost of maintaining their own OS was too much for most companies so they gradually moved to...

  • You can argue that more powerful CPUs enable software to be written more easily as you can concentrate on functionality and elegance rather than having to worry about performance (since a fast CPU can still run less efficient software at an acceptable speed). Another view is that fast CPUs just encourage poorly written, bloated software!

  • Power consumption and heat are real issues for mobile devices - you want to maxmise battery life and, as you point out, they are not well designed for getting rid of heat. This is why multicore technology is so attractive even if it makes the software more complicated - two cores each running at 1GHz use less power than one core running at 2GHz.

  • In practice, different cores will all be running at different speeds. Modern CPUs vary clock frequency dynamically based on load (e.g. turn it down if the processor is getting hot, crank it up if there aren't that many cores running and there is spare power). Even if they operated at the same clock speed, they would run at very different speeds in practice as...

  • The Game of Life is a very good example in terms of parallelising a real program. In practice, the strategy is identical to the traffic model - at each step, you update each cell based on the state of its nearest neighbours. In the 1D traffic model that just comprised the cells up and down in the road. For the 2D Game of Life, it's the eight nearest neigbours...

  • That's exactly correct - Message Passing is harder to implement, but less prone to subtle bugs. Most importantly for supercomputing, it is the only way to run on multiple nodes as Shared Memory is limited to a single node. Although this is a fine way to use all the cores on your laptop, on ARCHER this would limit you to running on only 24 cores of the total...

  • Virtualisation / containerisation is becoming more common in Supercomputing as it allows to develop on a local system (e.g. your laptop) and deploy in a larger machine (e.g. ARCHER). However, this can cause significant slowdowns for parallel programs. The whole point of virtualisation is to insulate operating systems from each other and from the hardware. In a...

  • That's correct - fans blow air over the blades, so it's cool air in and hot air out. The air is then cooled by large chillers which transfer that heat from the air to water, and at the ACF we can normally cool the water back down again using "ambient cooling" since the weather in Scotland is not normally very hot! See...

  • I commented on a similar point someone made in a different step and I think it's relevant here too:

    "This was tried in the early days of parallel computing and was called "metacomputing" - a single program running across separate computers distributed all over the globe. The problems are reliability (one of the machines could crash) and speed (it takes a...

  • This was tried in the early days of parallel computing and was called "metacomputing" - a single program running across separate computers distributed all over the globe. The problems are reliability (one of the machines could crash) and speed (it takes a long time for a computer in Europe to communicate with one in Japan). However, the model is used in...

  • My understanding is that it is using the appropriate precision for storing floating-point numbers rather than always using the highest precision available. For example, at the start of a calculation (where you may be a long way form the correct answer) there may be no need to use double-precision numbers - maybe single precision is enough. Later on, as you're...

  • @DavidFischak I first started working in HPC back in 1990 and you're right that there was a lot more diversity in the market: lots of competing processors and different flavours of Unix from numerous manufacturers. This changed and for quite some time we've had an almost complete monopoly of Intel x86 CPUs and Linux. However, things are changing again and, as...

  • @BernatMolero Monte Carlo simulation typically refers to any computation where random numbers are used. For example, if I wanted to simulate people evacuating from a building then I might use lots of random numbers to decide if someone turns left or right at the end of a corridor on their way out. This leads to lots of different simulations where people take...

  • Production-line manufacturing is an very good analogy. As you point out, there is parallelism within a single production line (e.g. different workers build different sections of a car as it passes down the line). The amount of parallelism might be limited, e.g. if there are 20 steps then you can't make use of more than 20 workers. The solution, as you've...

  • David Henty made a comment

    Hi - I'm David Henty and I work at EPCC at the University of Edinburgh, Scotland, UK. I co-developed the MOOC with Weronika and colleagues from SURFsara in the Netherlands.

  • As Jane points out, the number one machine has a performance profile that isn't necessarily representative of the majority of the world's supercomputers. However, another factor is that Moore's law is relevant for the performance of a single CPU. A supercomputer has many thousands of CPUs, so the total performance can outstrip Moore's law if we also increase...

  • Exactly - even a "null" message actually contains data such as the headers so they do clog up the network.

  • We could, but I think the issue has always been that processor speeds have increased more rapidly than memory systems so we're fighting a losing battle.

  • A very good point! Over the years, computing has swung between "think client" models like your "dumb terminal" example (processing done remotely) and "thick client" models like powerful desktops (processing done locally). We seem to be in a "thin client" phase where many of our devices are just used as access points for remote processing systems such as...

  • These cycles are observed in real predator / prey data, see e.g. https://theglyptodon.wordpress.com/2011/05/02/the-fur-trades-records/

  • @TonyMcCafferty I don't know if it's exactly what you were thinking of, but people do something called "autotuning" to optimise performance. If there are lots of possible parameters to adjust for a computation, you can simply run thousands of copies with different settings and find out experimentally what the best settings are. This takes huge amounts of...

  • If the batch system is doing a good job then the system should be reasonably full up all the time. People do build machines specifically to mine bitcoins, but it wouldn't be a cost-effective use of a supercomputer as you would not be using the capabilities of the high performance network.

  • Thanks for putting that link in!

  • @FrancescoMaroso You're correct that we could have had a GPU portion. However, we might have effectively ended up with two smaller systems - one with GPUs and one with CPUs - rather than one large system. The main focus of ARCHER was to enable very large simulations that could not be done on any other academic system in the UK so the decision was to have the...

  • F1 designers definitely use supercomputers to model their cars. However, to ensure a level playing field between teams, the amount of computer time they can use is severly limited e.g. I found this discussion on an F1 fan site: https://www.f1technical.net/forum/viewtopic.php?t=13311

  • @HarryTerkanian We have a few simple parallel programs written in MPI plus C or Fortran that we use on training courses - see for example the exercise material at http://www.archer.ac.uk/training/course-material/2017/12/intro-ati/index.php - which cover image processing, fluid dynamics and fractals. These should be relatively easy to port to a Raspberry Pi...

  • @SimonHennessey We have a few simple parallel programs written in MPI plus C or Fortran that we use on training courses - see for example the exercise material at http://www.archer.ac.uk/training/course-material/2017/12/intro-ati/index.php - which cover image processing, fluid dynamics and fractals. These should be relatively easy to port to a Raspberry Pi...

  • People have been looking at using FPGAs for HPC several years. Despite the potential for very good performance compared to power consumption, the problem has generally been programming them. It is very difficult to get good performance from large, numerically intensive programs written in C, C++ or Fortran.

  • @GillianC That's a good point - if a problem has a very complicated geometry such as if you wanted to simulate the air flow round an entire car then it is not easy to split the calculation up into equal-sized chunks. In situations like this then the approach is exactly as you describe - an important part of the pre-processing stage is "mesh partitioning" where...

  • @HarryTerkanian As ever, problems in computing have very good analogies in everyday life and "The Mythical Man Month" is an excellent analogy to the problem of just throwing more CPU-cores at a calculation. The real killer is that as you add more CPU-cores, each core is working on a smaller piece of the problem and the overhead of communication becomes greater.

  • The problem is to do with power consumption and heat production. Although we could produce a CPU with twice the speed, it would be so power hungry that it would be too expensive to run. It would also not be suitable for consumer devices as you would need expensive additional cooling to stop it overheating - your laptop can only really accommodate a small fan....

  • That's a very good point - on ARCHER the nodes are packaged so that there are four on a physical "blade". This means that these four nodes can actually communicate with each much more quickly than with nodes on a different blade.

  • I'm glad you found them useful - we significantly expanded the "Towards the Future" section after the first run last year as it was clearly an area that people were interested in.

  • That's correct, but it's important to note that this comes from the use of accelerators (in the case of Piz Daint, NVIDIA GPUs) rather than traditional multicore CPUs. Since GPUs have a very different architecture to CPUs, it's not immediately clear how many "cores" a GPU has, but the top500 list appears to count the number of "Streaming Multiprocessors". The...

  • That's an interesting observation, but in supercomputer networks it turns out that the major overhead is getting the data onto and off of the network infrastructure. Once data is on the network it travels very fast, so the cable length doesn't have such a big effect on the end-to-end transfer time.

  • I was always sceptical about whether driverless cars would take off as, even if they reduce risks at a statistical level (i.e. fewer accidents across thousands of drivers) an individual driver will always think that they would have done better than the robot in each particular accident. However, I read an article that made the point that for driving there is a...

  • My understanding is that the complexity comes from simulating two materials of very different viscosities at the same time - oil is very thick and gas is very "runny" in the sense that it flows very easily. I'll see if I can find a more definitive answer ...

  • I don't think hard-wiring the OS would be a good idea as any errors could never be fixed, e..g you could not patch the system when yet another security hole was discovered! I have talked about caches in terms of data, but in fact instructions are also cached so the performance of the operating system is usually very good as all commonly executed pieces will...

  • Although individual packets of data may be retransmitted, if there is a serious network failure then it will typically bring the whole system down. We spend lots of money on supercomputer networking for both speed and reliability. If you are doing calculations across widely distributed computers, such as done by Amazon and Google, you build resilience into the...

  • @SandraPasschier It depends. On ARCHER, you do your visualisation on a separate (smaller) system called the Data Analytic Cluster, although it is connected to the same disk storage as ARCHER so you don't have to copy your data around. If the visualisation is very computationally expensive, or needs such huge amounts of data that you can't afford to write it...

  • My understanding is that TPUs are designed for very fast calculation but at low precision. This is OK for many artificial intelligence applications but probably not OK for traditional computer simulations - I touched on this a bit in a previous answer https://www.futurelearn.com/courses/supercomputing/3/comments/25575992

  • I didn't notice you'd already answered Anton's question before I posted my own answer in https://www.futurelearn.com/courses/supercomputing/3/comments/25988184

  • @TonyMcCafferty A very good point - log graphs can be deceptive and hide the enormous increase in the data values by collapsing them together. We touch briefly on quantum computing here, which some believe is the next step.

  • Having periodic boundary conditions in our one-dimensional traffic model is the same thing as using a circle (i.e. a roundabout) rather than a line (a straight road). As you point out, in two dimensions, periodic boundaries in both dimensions gives you the topology of a torus (i.e. a doughnut) where you come back to where you start if you head off either the...

  • This is an interesting point. However, too many CPU cores has two downsides for supercomputing. First, it can overload the memory bandwidth as we saw in Week 2. Second, all CPU-cores on a node typically share a single network connection which means the communications can slow down. This is why in typical business and banking applications you see machines with...

  • I guess that modern satellites have improved the situation for some measurements. Interestingly, however, you can use simulation to cope with sparse experimental data. Imagine we know the weather today buy only on a sparse grid (say 20 miles spacing). Let's start with yesterday's sparse experimental data and *guess* what the real data was on a 100 yard grid...

  • That's exactly the point - at some point the limiting factor is not the power of the CPU cores but their ability to access the memory. The table at the bottom of https://www.futurelearn.com/courses/supercomputing/3/steps/260548 illustrates this to some extent. Rather than adding more physical CPU-cores to a processor (which I'm not remotely qualified to do!) I...

  • @FrancescoMaroso The peak performance would have increased significantly. Each ARCHER node is around 0.5 Tflop (2 x 250 GFlop CPUs). At the time of installation, a GPU might have been around 1 TFlop peak so a CPU+GPU node would have had well over twice the pure CPU peak. However, for a national system like ARCHER, you need to look at the spread of applications...

  • Fortran is still commonly used in computational science because it is a language designed specifically for scientific and technical computing. However, the stats on ARCHER are slightly misleading. Many people run centrally installed packages - although a larger fraction of the CPU cycles are used running programs that are written in Fortran, this is because...

  • A very good question! From a user point of view, all the cores are basically the same. I have heard it said that certain core Linux operating system services are designed to run on the first core (which would be number zero) but I can't find any references to this ...

  • In case we didn't explain it clearly enough, it's not a question of the memory being "big-enough" - it's a question of whether it is possible for all CPU-cores to access the memory with sufficient speed. The answer is no - if all CPU-cores try to access main memory at the same time then they slow each other down. It's a bit like the road network - you would...

  • EPCC is involved in a project looking at the application of cross-point memory for supercomputing - see http://www.nextgenio.eu/. One focus is using it as a faster alternative to disk - the project is looking at the "new 3D XPoint non-volatile memory, which will sit between conventional memory and disk storage.".

  • This illustrates one of the challenges of writing an efficient parallel program. By making some parts of the calculation very fast, other parts start to be the limiting factor and you have to start addressing them as well. So, a 2-hour check-in may not be particularly significant for a normal airplane but for Concorde it has a significant effect on the...

  • @GillianC There is a list of projects like this on Wikipedia - see https://en.wikipedia.org/wiki/List_of_distributed_computing_projects

  • That's correct - to run a program on multiple cores requires that the program is capable of being parallelised, and that a programmer has implemented the parallelism. However, remember that you can still use multiple cores by running several different serial programs at the same time. In this situation the operating system can automatically take advantage of...

  • That's the key point - to run a single program on multiple cores requires a parallel algorithm. The operating system can automatically keep all the cores busy if there is a large number of serial applications to run at the same time, but parallelising an application currently needs human expertise.

  • You're right that High Bandwith Memory is a very important development for supercomputing - since we're typically limited by memory bandwidth, anything that increases it is going to have a dramatic impact on performance.

  • That's exactly the point I was trying to make - having two laptops is often just inconvenient. As a family grows, the parents would probably buy a single larger car rather than an additional small one.

  • The nice thing about learning OpenMP is that it is supported by almost all modern compilers (e.g. C, C++ and Fortran) and you can learn it by just using a standard multicore laptop. For technical reasons you can't use OpenMP from Python (it's an interpreted language and can't really cope with threads) although you can do message-passing with MPI via mpi4py.

  • That's a very good point. The way to mitigate this is for the scheduler to try and allocate nodes that are close together in the network, e.g. for ARCHER always try and give a job nodes that are in the same cabinet as inter-cabinet communications is more costly.

  • Scheduling is normally done automatically and not by a human operator. However, visualisation can be very useful to understand what the scheduling algorithm is doing. If you think things are going wrong, e.g. you suspect the compute nodes are not being efficiently used, then visualisation can make what is going on much clearer. We had a summer student look at...

  • Point 1) is a very interesting one. The traffic model is a useful example here. No matter how large the problem is, i.e. regardless of the length of road on each process, you have to communicate information for a single cell to your nearest neighbours. In this simple case the overhead is therefore independent of problem size. In real calculations it is not...

  • I have rather simplified things to make the point clear! You're right that Concorde didn't have the range to get to Australia in a single hop.

  • That's right - using multiple nodes means you really have no choice but to use message-passing. I addressed using both methods in a previous comment - https://www.futurelearn.com/courses/supercomputing/3/comments/25462595/

  • You're right that cache coherency is difficult at this scale. The naive approach I outlined in the article - everyone broadcasts all changes to everyone else - doesn't work here. You need to do more sophisticated book-keeping, for example keep a directory of which core holds which data in its cache so you can look it up and go directly to the right place.

  • In reality you would use much more sophisticated simulations. You can use them to do short-term predictions about where congestion is likely to occur, which might enable you to prevent it by altering traffic light sequences or changing road priorities. You can also use them for longer term planning, for example trying to find out whether adding an extra lane...

  • These are very interesting questions - could you please repost them in the "Ask an Expert" section of Week 3 so they're in the list of potential topics?

  • This is a very interesting point - I answered a similar question here: https://www.futurelearn.com/courses/supercomputing/3/comments/25664655/

  • @JacqR It's really a question of economics - it would be expensive to construct a building using any kind of non-standard parts so bespoke doors would add to the cost. Plus there's the issues of transport, fitting the cabinets on forklift trucks etc. If a manufacturer produced an oversized cabinet then it could severely restrict their market.

  • @DougBoniface It's always good to discuss things, especially when the answer is not clear! The bottom line is that, in supercomputing, it *is* faster to send one large message compared to many small ones but exactly why this is may not be 100% obvious on terms of the low-level networking.

  • @DougBoniface I'm not an expert in network hardware but my understanding is that there is an initial setup cost when the processor says "I want to send a message over the network". Once all the setup is complete, the processor can push data onto the network very quickly where it is subsequently broken up into packets. An analogy might be boarding a plane -...

  • @GillianC Modern networks are actually quite resilient against single packet failures, so the packet should be resent via a different pathway.

  • That's correct - the overhead of making a connection on a supercomputer network is quite high.

  • You are right that, at the lowest level, a message is split into independent packets which can in principle take different paths between source and destination to avoid congested areas of the network; if there are errors then packets are resent. It turns out that the major overhead in supercomputers is getting the data from the processor onto / off of the...

  • I've slightly skipped over the details here. For a cellular automaton the updates are supposed to be done as if they were instantaneous, i.e. independent of the ordering. What you are supposed to do is look at all the cars and say "this one can move, this one can't, ..." but without actually moving the cars. You then do a second pass and move all the cars that...

  • @PeteTaylor In our full outreach presentations we explain this a bit further. Although the dinosaur racing game is really designed for fun, it is based on real research - see http://www.animalsimulation.org/page2/styled-4/

  • If there is a node failure then the batch system, which allocates user jobs to compute nodes, will notice this and make sure that no further jobs are sent to the node until it is fixed. However, any programs running on that node will simply crash and fail. Even if your job uses hundreds of nodes (thousands of CPU-cores) a single node failure will cause the...

  • Thanks for spotting this error - I've corrected it. The reason I say "around" is that the distribution of compute nodes is not even across cabinets - some cabinets have fewer compute nodes as they include a few system nodes for IO etc. However, as you spotted, 9000 was completely wrong!

  • @StephenC This is exactly the approach taken in real parallel applications - issue all your sends and receives asynchronously then get on with your own work until the messages are all complete. Your comment about id-ing the messages is an interesting one - MPI (the standard message-passing library) is designed so that you normally don't have to worry about...

  • My understanding of the history is that the US didn't want to sell China large numbers of top-end Intel chips so the Chinese produced their own CPU for their latest supercomputer. You are correct that not all systems will appear on the list, for example the specs of those used by intelligence agencies are unlikely to be made public.

  • @CharbelSolís This may not be true right now but it could be an issue in the future - new processors may have many tens of cores which might not be of any use to a home user. This isn't unique to supercomputing - many people spend tens of thousands of euros buying incredibly fast cars when, in the UK at least, the legal speed limit is under 115 km/h. Some...

  • @DougBoniface The fact that modern processors do not operate directly on memory means that you have to use some additional technique to ensure that two CPU-cores do not try and alter the same data at the same time. One approach is to lock the data - in the solution video I use the analogy of there being a single pen that you have to own to be allowed to write...

  • You are correct that cache coherency is enforced in almost all modern multicore processors. However, it is really done at the hardware level and not by the operating system. It was this mismatch that is at the root of the recent Meltdown bug - the hardware is caching sensitive data even though the operating system is trying to prevent a user from accessing it.

  • A very interesting question! My guess is that you'd need to do some kind of hard reset so that the nodes didn't just reboot themselves but also talked to each other to find out where they were on the network and work out the various network paths between them. However, I would expect the cabinets are physically the same so you could swap them about provided...

  • You've identified the key issue here. The standard cloud computing model envisages that data is only transferred between the user and the server. Although there may be many servers, they do not communicate with each other. This is because it all came about from internet searching and, as you've spotted, the searches are independent of each other. For typical...

  • @JasonPolyik A useful figure to remember is that, with a 3GHz processor, light only travels around 10cm every clock cycle.

  • Following up on what @WeronikaFilinger has said, all simulations have their limitations. Bloodhound SSC is an extreme case, but a more standard example in car design would be crash testing where every new design has to pass safety tests based on crashing a real car. Although safety is the ultimate goal, it's not an issue in the tests themselves as the car will...

  • I'm afraid that I've been a bit loose with terminology here - I've corrected it to cycles. Thanks for pointing this out.

  • Hello Bayu - before I started working at EPCC I did research in lattice gauge theory for several years, which is actually how I started doing parallel computing.

  • Although GPUs were originally designed for doing graphics, they are capable of general-purpose calculations and can run computer programs that have nothing to do with visualisation, for example using NVIDIA's CUDA language. The GPUs are part of the supercomputer in the same way that the CPUs are. If you look at Step 2.14 you will see that ARCHER has two...

  • That's absolutely correct - on a supercomputer node we try and make sure that there is almost nothing running except for the user's processes, e.g. on ARCHER that would be no more than 24 processes (one per CPU-core). The operating system will still need to run a few essential tasks but again we try and keep them to a minimum.

  • @BartWauters You're right that things have become a little more complicated since 2016 when two existing systems (Piz Daint and Piz Dora) were combined into an upgraded machine. However, the point I was trying to make was that the Swiss system still has a similar number of nodes to ARCHER (around 6,500 vs 5000) but a much higher performance. This must mean...