- •Contents
- •List of Figures
- •List of Tables
- •Acknowledgments
- •Introduction to MPI
- •Overview and Goals
- •Background of MPI-1.0
- •Background of MPI-1.1, MPI-1.2, and MPI-2.0
- •Background of MPI-1.3 and MPI-2.1
- •Background of MPI-2.2
- •Who Should Use This Standard?
- •What Platforms Are Targets For Implementation?
- •What Is Included In The Standard?
- •What Is Not Included In The Standard?
- •Organization of this Document
- •MPI Terms and Conventions
- •Document Notation
- •Naming Conventions
- •Semantic Terms
- •Data Types
- •Opaque Objects
- •Array Arguments
- •State
- •Named Constants
- •Choice
- •Addresses
- •Language Binding
- •Deprecated Names and Functions
- •Fortran Binding Issues
- •C Binding Issues
- •C++ Binding Issues
- •Functions and Macros
- •Processes
- •Error Handling
- •Implementation Issues
- •Independence of Basic Runtime Routines
- •Interaction with Signals
- •Examples
- •Point-to-Point Communication
- •Introduction
- •Blocking Send and Receive Operations
- •Blocking Send
- •Message Data
- •Message Envelope
- •Blocking Receive
- •Return Status
- •Passing MPI_STATUS_IGNORE for Status
- •Data Type Matching and Data Conversion
- •Type Matching Rules
- •Type MPI_CHARACTER
- •Data Conversion
- •Communication Modes
- •Semantics of Point-to-Point Communication
- •Buffer Allocation and Usage
- •Nonblocking Communication
- •Communication Request Objects
- •Communication Initiation
- •Communication Completion
- •Semantics of Nonblocking Communications
- •Multiple Completions
- •Non-destructive Test of status
- •Probe and Cancel
- •Persistent Communication Requests
- •Send-Receive
- •Null Processes
- •Datatypes
- •Derived Datatypes
- •Type Constructors with Explicit Addresses
- •Datatype Constructors
- •Subarray Datatype Constructor
- •Distributed Array Datatype Constructor
- •Address and Size Functions
- •Lower-Bound and Upper-Bound Markers
- •Extent and Bounds of Datatypes
- •True Extent of Datatypes
- •Commit and Free
- •Duplicating a Datatype
- •Use of General Datatypes in Communication
- •Correct Use of Addresses
- •Decoding a Datatype
- •Examples
- •Pack and Unpack
- •Canonical MPI_PACK and MPI_UNPACK
- •Collective Communication
- •Introduction and Overview
- •Communicator Argument
- •Applying Collective Operations to Intercommunicators
- •Barrier Synchronization
- •Broadcast
- •Example using MPI_BCAST
- •Gather
- •Examples using MPI_GATHER, MPI_GATHERV
- •Scatter
- •Examples using MPI_SCATTER, MPI_SCATTERV
- •Example using MPI_ALLGATHER
- •All-to-All Scatter/Gather
- •Global Reduction Operations
- •Reduce
- •Signed Characters and Reductions
- •MINLOC and MAXLOC
- •All-Reduce
- •Process-local reduction
- •Reduce-Scatter
- •MPI_REDUCE_SCATTER_BLOCK
- •MPI_REDUCE_SCATTER
- •Scan
- •Inclusive Scan
- •Exclusive Scan
- •Example using MPI_SCAN
- •Correctness
- •Introduction
- •Features Needed to Support Libraries
- •MPI's Support for Libraries
- •Basic Concepts
- •Groups
- •Contexts
- •Intra-Communicators
- •Group Management
- •Group Accessors
- •Group Constructors
- •Group Destructors
- •Communicator Management
- •Communicator Accessors
- •Communicator Constructors
- •Communicator Destructors
- •Motivating Examples
- •Current Practice #1
- •Current Practice #2
- •(Approximate) Current Practice #3
- •Example #4
- •Library Example #1
- •Library Example #2
- •Inter-Communication
- •Inter-communicator Accessors
- •Inter-communicator Operations
- •Inter-Communication Examples
- •Caching
- •Functionality
- •Communicators
- •Windows
- •Datatypes
- •Error Class for Invalid Keyval
- •Attributes Example
- •Naming Objects
- •Formalizing the Loosely Synchronous Model
- •Basic Statements
- •Models of Execution
- •Static communicator allocation
- •Dynamic communicator allocation
- •The General case
- •Process Topologies
- •Introduction
- •Virtual Topologies
- •Embedding in MPI
- •Overview of the Functions
- •Topology Constructors
- •Cartesian Constructor
- •Cartesian Convenience Function: MPI_DIMS_CREATE
- •General (Graph) Constructor
- •Distributed (Graph) Constructor
- •Topology Inquiry Functions
- •Cartesian Shift Coordinates
- •Partitioning of Cartesian structures
- •Low-Level Topology Functions
- •An Application Example
- •MPI Environmental Management
- •Implementation Information
- •Version Inquiries
- •Environmental Inquiries
- •Tag Values
- •Host Rank
- •IO Rank
- •Clock Synchronization
- •Memory Allocation
- •Error Handling
- •Error Handlers for Communicators
- •Error Handlers for Windows
- •Error Handlers for Files
- •Freeing Errorhandlers and Retrieving Error Strings
- •Error Codes and Classes
- •Error Classes, Error Codes, and Error Handlers
- •Timers and Synchronization
- •Startup
- •Allowing User Functions at Process Termination
- •Determining Whether MPI Has Finished
- •Portable MPI Process Startup
- •The Info Object
- •Process Creation and Management
- •Introduction
- •The Dynamic Process Model
- •Starting Processes
- •The Runtime Environment
- •Process Manager Interface
- •Processes in MPI
- •Starting Processes and Establishing Communication
- •Reserved Keys
- •Spawn Example
- •Manager-worker Example, Using MPI_COMM_SPAWN.
- •Establishing Communication
- •Names, Addresses, Ports, and All That
- •Server Routines
- •Client Routines
- •Name Publishing
- •Reserved Key Values
- •Client/Server Examples
- •Ocean/Atmosphere - Relies on Name Publishing
- •Simple Client-Server Example.
- •Other Functionality
- •Universe Size
- •Singleton MPI_INIT
- •MPI_APPNUM
- •Releasing Connections
- •Another Way to Establish MPI Communication
- •One-Sided Communications
- •Introduction
- •Initialization
- •Window Creation
- •Window Attributes
- •Communication Calls
- •Examples
- •Accumulate Functions
- •Synchronization Calls
- •Fence
- •General Active Target Synchronization
- •Lock
- •Assertions
- •Examples
- •Error Handling
- •Error Handlers
- •Error Classes
- •Semantics and Correctness
- •Atomicity
- •Progress
- •Registers and Compiler Optimizations
- •External Interfaces
- •Introduction
- •Generalized Requests
- •Examples
- •Associating Information with Status
- •MPI and Threads
- •General
- •Initialization
- •Introduction
- •File Manipulation
- •Opening a File
- •Closing a File
- •Deleting a File
- •Resizing a File
- •Preallocating Space for a File
- •Querying the Size of a File
- •Querying File Parameters
- •File Info
- •Reserved File Hints
- •File Views
- •Data Access
- •Data Access Routines
- •Positioning
- •Synchronism
- •Coordination
- •Data Access Conventions
- •Data Access with Individual File Pointers
- •Data Access with Shared File Pointers
- •Noncollective Operations
- •Collective Operations
- •Seek
- •Split Collective Data Access Routines
- •File Interoperability
- •Datatypes for File Interoperability
- •Extent Callback
- •Datarep Conversion Functions
- •Matching Data Representations
- •Consistency and Semantics
- •File Consistency
- •Random Access vs. Sequential Files
- •Progress
- •Collective File Operations
- •Type Matching
- •Logical vs. Physical File Layout
- •File Size
- •Examples
- •Asynchronous I/O
- •I/O Error Handling
- •I/O Error Classes
- •Examples
- •Subarray Filetype Constructor
- •Requirements
- •Discussion
- •Logic of the Design
- •Examples
- •MPI Library Implementation
- •Systems with Weak Symbols
- •Systems Without Weak Symbols
- •Complications
- •Multiple Counting
- •Linker Oddities
- •Multiple Levels of Interception
- •Deprecated Functions
- •Deprecated since MPI-2.0
- •Deprecated since MPI-2.2
- •Language Bindings
- •Overview
- •Design
- •C++ Classes for MPI
- •Class Member Functions for MPI
- •Semantics
- •C++ Datatypes
- •Communicators
- •Exceptions
- •Mixed-Language Operability
- •Problems With Fortran Bindings for MPI
- •Problems Due to Strong Typing
- •Problems Due to Data Copying and Sequence Association
- •Special Constants
- •Fortran 90 Derived Types
- •A Problem with Register Optimization
- •Basic Fortran Support
- •Extended Fortran Support
- •The mpi Module
- •No Type Mismatch Problems for Subroutines with Choice Arguments
- •Additional Support for Fortran Numeric Intrinsic Types
- •Language Interoperability
- •Introduction
- •Assumptions
- •Initialization
- •Transfer of Handles
- •Status
- •MPI Opaque Objects
- •Datatypes
- •Callback Functions
- •Error Handlers
- •Reduce Operations
- •Addresses
- •Attributes
- •Extra State
- •Constants
- •Interlanguage Communication
- •Language Bindings Summary
- •Groups, Contexts, Communicators, and Caching Fortran Bindings
- •External Interfaces C++ Bindings
- •Change-Log
- •Bibliography
- •Examples Index
- •MPI Declarations Index
- •MPI Function Index
162 |
CHAPTER 5. COLLECTIVE COMMUNICATION |
1The type signature associated with sendcounts[j], sendtypes[j] at process i must be equal
2to the type signature associated with recvcounts[i], recvtypes[i] at process j. This implies
3that the amount of data sent must be equal to the amount of data received, pairwise between
4every pair of processes. Distinct type maps between sender and receiver are still allowed.
5The outcome is as if each process sent a message to every other process with
6
7MPI_Send(sendbuf + sdispls[i]; sendcounts[i]; sendtypes[i]; i; :::);
8
9
10
11
and received a message from every other process with a call to
MPI_Recv(recvbuf + rdispls[i]; recvcounts[i]; recvtypes[i]; i; :::):
12All arguments on all processes are signi cant. The argument comm must describe the
13same communicator on all processes.
14Like for MPI_ALLTOALLV, the \in place" option for intracommunicators is speci ed by
15passing MPI_IN_PLACE to the argument sendbuf at all processes. In such a case, sendcounts,
16sdispls and sendtypes are ignored. The data to be sent is taken from the recvbuf and replaced
17by the received data. Data sent and received must have the same type map as speci ed
18by the recvcounts and recvtypes arrays, and is taken from the locations of the receive bu er
19speci ed by rdispls.
20If comm is an intercommunicator, then the outcome is as if each process in group A
21sends a message to each process in group B, and vice versa. The j-th send bu er of process
22i in group A should be consistent with the i-th receive bu er of process j in group B, and
23vice versa.
24
25Rationale. The MPI_ALLTOALLW function generalizes several MPI functions by care-
26fully selecting the input arguments. For example, by making all but one process have
27sendcounts[i] = 0, this achieves an MPI_SCATTERW function. (End of rationale.)
28
29 |
5.9 Global Reduction Operations |
|
|
30 |
|
31The functions in this section perform a global reduce operation (for example sum, maximum,
32and logical and) across all members of a group. The reduction operation can be either one of
33a prede ned list of operations, or a user-de ned operation. The global reduction functions
34come in several avors: a reduce that returns the result of the reduction to one member of a
35group, an all-reduce that returns this result to all members of a group, and two scan (parallel
36pre x) operations. In addition, a reduce-scatter operation combines the functionality of a
37reduce and of a scatter operation.
38
39
40
41
42
43
44
45
46
47
48
5.9. GLOBAL REDUCTION OPERATIONS |
163 |
5.9.1 Reduce
MPI_REDUCE( sendbuf, recvbuf, count, datatype, op, root, comm)
IN |
sendbuf |
address of send bu er (choice) |
OUT |
recvbuf |
address of receive bu er (choice, signi cant only at |
|
|
root) |
IN |
count |
number of elements in send bu er (non-negative inte- |
|
|
ger) |
IN |
datatype |
data type of elements of send bu er (handle) |
IN |
op |
reduce operation (handle) |
IN |
root |
rank of root process (integer) |
IN |
comm |
communicator (handle) |
int MPI_Reduce(void* sendbuf, void* recvbuf, int count,
MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)
MPI_REDUCE(SENDBUF, RECVBUF, COUNT, DATATYPE, OP, ROOT, COMM, IERROR) <type> SENDBUF(*), RECVBUF(*)
INTEGER COUNT, DATATYPE, OP, ROOT, COMM, IERROR
fvoid MPI::Comm::Reduce(const void* sendbuf, void* recvbuf, int count, const MPI::Datatype& datatype, const MPI::Op& op, int root) const = 0 (binding deprecated, see Section 15.2) g
If comm is an intracommunicator, MPI_REDUCE combines the elements provided in the input bu er of each process in the group, using the operation op, and returns the combined value in the output bu er of the process with rank root. The input bu er is de ned by the arguments sendbuf, count and datatype; the output bu er is de ned by the arguments recvbuf, count and datatype; both have the same number of elements, with the same type. The routine is called by all group members using the same arguments for count, datatype, op, root and comm. Thus, all processes provide input bu ers and output bu ers of the same length, with elements of the same type. Each process can provide one element, or a sequence of elements, in which case the combine operation is executed element-wise on each entry of the sequence. For example, if the operation is MPI_MAX and the send bu er contains two elements that are oating point numbers (count = 2 and datatype = MPI_FLOAT), then recvbuf(1) = global max(sendbuf(1)) and recvbuf(2) = global max(sendbuf(2)).
Section 5.9.2, lists the set of prede ned operations provided by MPI. That section also enumerates the datatypes to which each operation can be applied. In addition, users may de ne their own operations that can be overloaded to operate on several datatypes, either basic or derived. This is further explained in Section 5.9.5.
The operation op is always assumed to be associative. All prede ned operations are also assumed to be commutative. Users may de ne operations that are assumed to be associative, but not commutative. The \canonical" evaluation order of a reduction is determined by the ranks of the processes in the group. However, the implementation can take advantage of associativity, or associativity and commutativity in order to change the order of evaluation.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
164 |
CHAPTER 5. COLLECTIVE COMMUNICATION |
1This may change the result of the reduction for operations that are not strictly associative
2
3
and commutative, such as oating point addition.
4Advice to implementors. It is strongly recommended that MPI_REDUCE be imple-
5mented so that the same result be obtained whenever the function is applied on the
6same arguments, appearing in the same order. Note that this may prevent optimiza-
7tions that take advantage of the physical location of processors. (End of advice to
8implementors.)
9
10Advice to users. Some applications may not be able to ignore the non-associative na-
11ture of oating-point operations or may use user-de ned operations (see Section 5.9.5)
12that require a special reduction order and cannot be treated as associative. Such
13applications should enforce the order of evaluation explicitly. For example, in the
14case of operations that require a strict left-to-right (or right-to-left) evaluation or-
15 |
der, this could be done by gathering all operands at a single process (e.g., with |
16MPI_GATHER), applying the reduction operation in the desired order (e.g., with
17MPI_REDUCE_LOCAL), and if needed, broadcast or scatter the result to the other
18processes (e.g., with MPI_BCAST). (End of advice to users.)
19
The datatype argument of MPI_REDUCE must be compatible with op. Prede ned op-
20
erators work only with the MPI types listed in Section 5.9.2 and Section 5.9.4. Furthermore,
21
the datatype and op given for prede ned operators must be the same on all processes.
22
Note that it is possible for users to supply di erent user-de ned operations to
23
MPI_REDUCE in each process. MPI does not de ne which operations are used on which
24
operands in this case. User-de ned operators may operate on general, derived datatypes.
25
In this case, each argument that the reduce operation is applied to is one element described
26
by such a datatype, which may contain several basic values. This is further explained in
27
Section 5.9.5.
28
29Advice to users. Users should make no assumptions about how MPI_REDUCE is
30implemented. It is safest to ensure that the same function is passed to MPI_REDUCE
31by each process. (End of advice to users.)
32
Overlapping datatypes are permitted in \send" bu ers. Overlapping datatypes in \re-
33
ceive" bu ers are erroneous and may give unpredictable results.
34
The \in place" option for intracommunicators is speci ed by passing the value
35
MPI_IN_PLACE to the argument sendbuf at the root. In such a case, the input data is taken
36
at the root from the receive bu er, where it will be replaced by the output data.
37
If comm is an intercommunicator, then the call involves all processes in the intercom-
38
municator, but with one group (group A) de ning the root process. All processes in the
39
other group (group B) pass the same value in argument root, which is the rank of the root
40
in group A. The root passes the value MPI_ROOT in root. All other processes in group A
41
pass the value MPI_PROC_NULL in root. Only send bu er arguments are signi cant in group
42
B and only receive bu er arguments are signi cant at the root.
43
44
5.9.2 Prede ned Reduction Operations
45
46The following prede ned operations are supplied for MPI_REDUCE and related functions
47MPI_ALLREDUCE, MPI_REDUCE_SCATTER, MPI_SCAN, and MPI_EXSCAN. These oper-
48ations are invoked by placing the following in op.
5.9. GLOBAL REDUCTION OPERATIONS |
165 |
|
Name |
Meaning |
|
MPI_MAX |
maximum |
|
MPI_MIN |
minimum |
|
MPI_SUM |
sum |
|
MPI_PROD |
product |
|
MPI_LAND |
logical and |
|
MPI_BAND |
bit-wise and |
|
MPI_LOR |
logical or |
|
MPI_BOR |
bit-wise or |
|
MPI_LXOR |
logical exclusive or (xor) |
|
MPI_BXOR |
bit-wise exclusive or (xor) |
|
MPI_MAXLOC |
max value and location |
|
MPI_MINLOC |
min value and location |
|
The two operations MPI_MINLOC and MPI_MAXLOC are discussed separately in Section 5.9.4. For the other prede ned operations, we enumerate below the allowed combinations of op and datatype arguments. First, de ne groups of MPI basic datatypes in the following way.
C integer: |
MPI_INT, MPI_LONG, MPI_SHORT, |
|
MPI_UNSIGNED_SHORT, MPI_UNSIGNED, |
|
MPI_UNSIGNED_LONG, |
|
MPI_LONG_LONG_INT, |
|
MPI_LONG_LONG (as synonym), |
|
MPI_UNSIGNED_LONG_LONG, |
|
MPI_SIGNED_CHAR, |
|
MPI_UNSIGNED_CHAR, |
|
MPI_INT8_T, MPI_INT16_T, |
|
MPI_INT32_T, MPI_INT64_T, |
|
MPI_UINT8_T, MPI_UINT16_T, |
|
MPI_UINT32_T, MPI_UINT64_T |
Fortran integer: |
MPI_INTEGER, MPI_AINT, MPI_OFFSET, |
|
and handles returned from |
|
MPI_TYPE_CREATE_F90_INTEGER, |
|
and if available: MPI_INTEGER1, |
|
MPI_INTEGER2, MPI_INTEGER4, |
|
MPI_INTEGER8, MPI_INTEGER16 |
Floating point: |
MPI_FLOAT, MPI_DOUBLE, MPI_REAL, |
|
MPI_DOUBLE_PRECISION |
|
MPI_LONG_DOUBLE |
|
and handles returned from |
|
MPI_TYPE_CREATE_F90_REAL, |
|
and if available: MPI_REAL2, |
|
MPI_REAL4, MPI_REAL8, MPI_REAL16 |
Logical: |
MPI_LOGICAL, MPI_C_BOOL |
Complex: |
MPI_COMPLEX, |
|
MPI_C_FLOAT_COMPLEX, |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
1
2
3
4
5
6
7
8
9
10
11
12
13
166 |
CHAPTER 5. COLLECTIVE COMMUNICATION |
|
MPI_C_DOUBLE_COMPLEX, |
|
MPI_C_LONG_DOUBLE_COMPLEX, |
|
and handles returned from |
|
MPI_TYPE_CREATE_F90_COMPLEX, |
|
and if available: MPI_DOUBLE_COMPLEX, |
|
MPI_COMPLEX4, MPI_COMPLEX8, |
|
MPI_COMPLEX16, MPI_COMPLEX32 |
Byte: |
MPI_BYTE |
Now, the valid datatypes for each option is speci ed below.
Op |
Allowed Types |
14
15
16
17
18
MPI_MAX, MPI_MIN |
C integer, Fortran integer, Floating point |
MPI_SUM, MPI_PROD |
C integer, Fortran integer, Floating point, Complex |
MPI_LAND, MPI_LOR, MPI_LXOR |
C integer, Logical |
MPI_BAND, MPI_BOR, MPI_BXOR |
C integer, Fortran integer, Byte |
The following examples use intracommunicators.
19 |
|
|
20 |
Example 5.15 A routine that computes the dot product of two vectors that are distributed |
|
|
||
21 |
across a group of processes and returns the answer at node zero. |
|
|
||
22 |
|
|
23 |
SUBROUTINE PAR_BLAS1(m, a, b, c, comm) |
|
|
||
24 |
REAL a(m), b(m) |
! local slice of array |
|
||
25 |
REAL c |
! result (at node zero) |
|
26REAL sum
27INTEGER m, comm, i, ierr
28 |
|
29 |
! local sum |
|
|
30 |
sum = 0.0 |
|
|
31 |
DO i = 1, m |
|
|
32 |
sum = sum + a(i)*b(i) |
|
|
33 |
END DO |
|
|
34 |
|
35! global sum
36CALL MPI_REDUCE(sum, c, 1, MPI_REAL, MPI_SUM, 0, comm, ierr)
37RETURN
38
39Example 5.16 A routine that computes the product of a vector and an array that are
40distributed across a group of processes and returns the answer at node zero.
41 |
|
|
|
|
|
42 |
SUBROUTINE |
PAR_BLAS2(m, n, a, b, c, comm) |
|||
43 |
REAL |
a(m), |
b(m,n) |
! |
local slice of array |
44 |
REAL |
c(n) |
|
! |
result |
45REAL sum(n)
46INTEGER n, comm, i, j, ierr
47
48 |
! local sum |