Just another technical day

Advance MPI

By K4Paul November 17, 2013 0 Supercomputing HPC, MPI, Parallelism, Supercomputing

One of the Intel tutorials I attended this year at SC13 (Supercomputing 2013) was on advance MPI concepts and the improvements made in the new MPI 3 standard. There notes can be found at http://www.mcs.anl.gov/~thakur/sc13-mpi-tutorial/. This post will cover a few of the topics they mentioned in the lecture along with briefly reviewing basic MPI concepts. I will be adding models and figures in a post update.

Refresher

MPI stands for Message-Passing Interface. Basically it is a standard for sharing information between nodes and processes. This is appealing for a supercomputing environment because MPI can be used to communicate between nodes. So when a cluster is processing large amounts of data, the load can be divided among nodes with MPI implemented to communicate data when necessary.

Timeline

The following timeline highlights on a few of the MPI introduced features.

1994, MPI1
- supported point-to-point communication
- custom datatypes for message passing
- communication to select process groups known as collectives
1997, MPI2
- added parallel I/O functionality with thread support
- one sided operations
2012, MPI3
- Nonblocking and neighborhood collectives
- Tools interface
- Improved one-sided communications

Datatypes

MPI requires developers to explicitly implement parallel algorithms. A developer has to create a datatype or state data for MPI processes. Basically, the programmer needs to know and plan out how to divide the dataset workload among processes with MPI constructs and datatypes.

Datatypes let users specify the data type, content and size. MPI has a library of familiar types such as int, double, float, etc and together these types can be used to create customized datatypes. Process data is not always aligned sequentially in memory, instead it may be stripped across message segments. Customized datatypes can recognize a pattern and create a structure that reads the correct pieces of data in a stripped segment to be utilized by a node process. The more general the datatypes, the harder it is for MPI to optimize.

MPI 3 Collective Improvements

MPI collectives are actually a pretty cool concept. A programmer can of course use MPI to talk to all processes at once (COMM WORLD) but there are times where maybe only certain processes need to be utilized. For instance only odd numbered processes should deal with a specific data chunk. Collectives are sub groups of processes that MPI can call upon.

What gets even better is that the new MPI3 standard includes additional collectives for analyzing neighboring node data. In the past there has been a problem with calculating border data for a mesh segment on a single node without being aware of neighboring datasets. An example was presented in which MPI can be used to implement a halo of neighbor data around a process dataset. This would eliminate the border problem because this collective allows a node to become aware of neighboring segments.

MPI 3 Windows

Windows may be used to specify public data within a process. This details that some data is private from other processes and also sets up an environment for one sided communications. These communication calls include commands like GET and PUSH in order to transfer data across processes. These calls are useful because they allow a developer to implement code that accesses data without requiring synchronization across all processes and therefore accessing data without being affected by individual process delays.

MPI 3 Multi-threading

In a single threaded environment, each process consists of one thread. By introducing multi-threading, a process can simultaneously execute multiple threads at the same time. However, I am not quite sure this is ideal for all environments. True multi-threading models where each thread can make MPI calls can introduce a lot of problems into an environment. As noted by the conference presenters, it can be buggy. One such problem developers should be made aware of is lock granularity. Such projects require more resources and planning. Other multi-threading methods that tend to be less resource heavy include MPI_THREAD_FUNNELED and MPI_THREAD_SERIALIZED. MPI_THREAD_FUNNELED is where a single main thread within a process makes the MPI call. Thread serialized allows only one thread at a time to make MPI calls.

Conclusion

The new MPI3 standard introduces a lot of cool features that can be implemented to improve communications in a high performance computing environment. The idea that processes can analyze data in parallel purposes a possible reduction in processing large datasets. Future studies will include more research into MPI 3 and File I/O where processes can read and write to a file at the same time.