6.4.3. Group Construction

For purposes of constructing PMIx groups, PMIx defines two classes of group members:

leaders have some global view of the group at time of construction. This might consist of knowing the number of leaders in the group, or knowing the process IDs of all group leaders. All leaders must, of course, know the group ID they are attempting to construct.

members know only that they are to participate in a given group ID, but have no other knowledge of the group. For example, a member may not know how many processes will be in the group or any of their process IDs. The only requirement for membership is that the process know the group ID to which they are to belong.

Within that context, PMIx supports three methods for constructing PMIx groups. The collective method is considered the more traditional form of the operation but requires all group leaders to know the process ID of all other leaders prior to calling the API.

In contrast, the bootstrap method is a somewhat more dynamic form of the operation that assumes each leader only knows the number of group leaders, but does not know their process IDs. This is commonly the case when two or mote collections of processes wish to join together (e.g., in an MPI connect/accept operation), but only the “root” processes know of each other. In such cases, each root process typically knows the process ID of all processes in its collection, but only the root process ID in the other collection(s).

In either of these two methods, additional group members can be specified by any leader via the PMIX_GROUP_ADD_MEMBERS attribute. The PMIx server library and host are jointly responsible for aggregating the additional group members specified across leaders. Processes that are on the “additional member” list must call PMIx_Group_construct with NULL in the procs argument - this indicates that the process is to be added to the group when the group construct operation has completed.

Note that the group construct operation cannot complete until all “leaders” and all add members” have called PMIx_Group_construct. This is required so that any group and/or endpoint information provided by the added members can be included in the returned pmix_info_t array.

Finally, the invite method represents the most dynamic form of group construction as it is executed in an ad hoc manner that revolves around a single leader that asynchronously invites processes to join a group.

Each of these methods is explained further below. The PMIx library currently conducts Continuous Integration (CI) tests for all three methods on each pull request (PR) submitted to the library.

6.4.3.1. Collective Method

All leaders know the ID of all other leaders, and thus call PMIx_Group_construct with the array of all leader process IDs. Note that in this method, all leaders must call the API with the array of process IDs. However, the array does not need to be ordered - i.e., the order of IDs in the array does not need to be the same across leaders.

An example of the collective method can be seen in the group.c example taken from the PMIx library.

6.4.3.1.1. Library responsibilities

The PMIx client library will collect all information posted by the client on a PMIX_REMOTE scope and include it in the participation message sent to its local server. Only “remote” data is included as all local procs already have access to any “local” data that has been posted. The message also includes any pmix_info_t values that were provided by the caller, including any PMIX_GROUP_INFO contributions.

The local server will aggregate participation from clients operating as leaders before passing the request up to the host. If a timeout value was provided by one or more clients, then the server must monitor the request for timeout until it is passed up to the host. This is to protect against the case where multiple local clients are participating, and one of those clients fails to call the construct API (thus preventing the server from passing the request to the host) within the specified timeout. The server no longer has responsibility for detecting a timeout condition once it determines the operation is locally complete and has passed it to the host.

The local server will also immediately pass to the host a participation request from a client operating as a “member” - i.e., a process that is being included in the construct operation via an “add member” specification. These participants are not counted against local completion of the operation.

Upon notification of operation completion, the server will locally store all information collected from the participants and then notify each participating client (leaders as well as members). Notification will include an array of the resulting membership IDs, plus any assigned context ID and contributed group info.

Upon concluding the operation, each client shall have access to all job-level information for namespaces that had a process participating in the group, plus all information provided by the individual participants (e.g., posted “remote” data).

6.4.3.1.2. Host responsibilities

The host is responsible for performing a collective allgather operation across participants, returning all provided information to the involved PMIx servers. This includes a complete membership list, including the process ID of all leaders and added members. The host must ensure that all “members” have participated prior to declaring the operation to be complete.

Note that the group construct operation requires that each participant have access to the job-level data of each namespace with a participating process in the group. Some hosts (e.g., PRRTE) automatically register each namespace with all PMIx servers, thus ensuring the job-level information is always available. Other hosts may need to assemble and register the participating namespaces separately.

6.4.3.2. Bootstrap Method

Bootstrap is used when the processes leading a group construct operation do not know the identity of all other processes that will be participating, but at least know how may leaders will be involved. Leaders provide only their own process ID in the procs parameter to the PMIx_Group_construct API, and are required to include the PMIX_GROUP_BOOTSTRAP attribute in their array of pmix_info_t directives, with the value in that attribute set to equal the number of leaders in the group construct operation. They may also provide the PMIX_ADD_MEMBERS attribute with an array of process IDs that are to belong to the final group - each of those processes will also call the group construct, but with a NULL process ID to indicate they are joining as “add members” and not leaders.

An example of the bootstrap method can be seen in the group_bootstrap.c example taken from the PMIx library.

6.4.3.2.1. Library responsibilities

The PMIx client library will collect all information posted by the client on a PMIX_REMOTE scope and include it in the participation message sent to its local server. Only “remote” data is included as all local procs have access to any “local” data that has been posted. The message also includes any pmix_info_t values that were provided by the caller, including any PMIX_GROUP_INFO contributions.

Upon receiving the participation from a local client (whether “leader” or “member”), the PMIx server will immediately pass the request up to the host. The server bears no responsibility for aggregating the local participants or for monitoring timeout conditions.

Upon notification of operation completion, the server will locally store all information collected from the participants and then notify each participating client (leaders as well as members). Notification will include an array of the resulting membership IDs, plus any assigned context ID and contributed group info.

Upon concluding the operation, each client shall have access to all job-level information for namespaces that had a process participating in the group, plus all information provided by the individual participants (e.g., posted “remote” data).

6.4.3.2.2. Host responsibilities

The host is responsible for performing a collective allgather operation across participants, returning all provided information to the involved PMIx servers. This includes a complete membership list, including the process ID of all leaders and added members. The host in this scenario must look for and count participation by the specified number of leaders, plus each individual member, before declaring the operation to be complete.

Note that the group construct operation requires that each participant have access to the job-level data of each namespace with a participating process in the group. Some hosts (e.g., PRRTE) automatically register each namespace with all PMIx servers, thus ensuring the job-level information is always available. Other hosts may need to assemble and register the participating namespaces separately.

6.4.3.3. Invite Method

In contrast to the prior methods, the “invite” method does not involve any collective operation. Instead, it relies solely on the event notification subsystem (e.g., PMIx_Notify) for its underlying execution.

Applications wishing to utilize this method must first register two event handlers to receive (see the asyncgroup.c example from the PMIx library for details):

  • the PMIX_GROUP_INVITED event. This will be triggered in a process when someone calls PMIx_Group_invite and includes that process in the array of desired members. The handler must retrieve the specified PMIX_GROUP_ID from the provided array of attributes, and then call PMIx_Group_join_nb to indicate the desired response (e.g., a value of PMIX_GROUP_ACCEPT to join the group).

    Note

    The handler cannot call the blocking form of the “group join” API as it is invoked from inside the PMIx library’s progress thread. Doing so will cause a thread deadlock condition.

  • the PMIX_GROUP_COMPLETE event, which will be triggered once the construct operation has completed. This can be used to receive the final group membership, along with any provided group info or other data.

The construct procedure is initiated by a single “leader” that calls the PMIx_Group_Invite API, providing (among other optional things) an array of process IDs that it wishes to have join the group. Prior to doing so, the leader may choose to register an event handler for the PMIX_GROUP_INVITE_FAILED event. This will allow the library to notify the process should any of the specified prospective members reject the invitation, or terminate prior to responding to it. Upon receiving a “failed” event, the leader can optionally replace the rejecting process with another, can terminate the group construct operation, or can ignore the failure (thereby accepting a reduced final group membership).

The leader will return from the PMIx_Group_invite function once all specified members have responded to the invitation. In addition, the leader will (since it is a member of the group) receive the PMIX_GROUP_COMPLETE event specifying the status return of the operation (PMIX_SUCCESS to indicate that the group successfully constructed, or else an appropriate error value) and, if successful, containing the resulting information.

6.4.3.3.1. Library responsibilities

The PMIx client library is solely responsible for executing the underlying handshakes to support this method of group construction. The library will collect all information posted by each participant on a PMIX_REMOTE scope and include it in the invitation event (if the process is the leader) or in the “join” event (if a participating member). Only “remote” data is included as all local procs have access to any “local” data that has been posted. The event also includes any pmix_info_t values that were provided by the caller, including any PMIX_GROUP_INFO contributions.

The client library tracks all contributions and assembles the final “complete” event payload, which includes a complete membership array plus all provided information. In this case, the provided information (e.g., that posted by each participant) will not be stored on the server, but instead delivered to each individual participant’s data store for subsequent retrieval via PMIx_Get.

The PMIx server library’s sole responsibility is transferring the events generated by its local clients to/from the host.

6.4.3.3.2. Host responsibilities

The host is responsible solely for propagating event notifications across participating processes.