|
NAMEmpiexec - Execute a MPI process SYNOPSISSingle Process, Multiple Data (SPMD) mode: mpiexec [ options ] <program> [ <args> ] Multiple Process, Multiple Data (MPMD) mode : mpiexec [ global_options ] [ local_options1 ] <program1> [ <args1> ] : [ local_options2 ] <program2> [ <args2> ] : ... : [ local_optionsN ] <programN> [ <argsN> ]
QUICK SUMMARYIf you just want to run an MPI program, you probably want to use a command line like:
mpiexec [ -n # ] [ -file <hostfile> ] <program> <program args>
This will start a # rank process for the program using the hostfile to work out which machines to run on. The remote processes will be launched via ssh. If there are more ranks launched than are described in the hostfile , then the extra ranks will be allocated in a round-robin pattern amoungst the available machines. GLOBAL OPTIONS
<program> - MPI application to launch
<args> - The command-line arguments to be passed to the MPI application
-?, -help - Prints a short usage message
-P, -port <N> - Use a nonstandard port for MPI communication. Sets the MPI_PORT environment variable in the context of the MPI application.
-x, -xterm - Create an xterm for every machine. Run the MPI application inside each xterm.
-d, -debug - Create an xterm for every machine, and run the MPI application inside each xterm using gdb.
-v, -verbose - Be more verbose when running an MPI application. Sets the MPI_VERBOSE environment variable in the context of the MPI application.
-V, -version - Print the Lockless MPI version number, and MPI specification version.
-q, -quiet - Don't output connection messages.
-t, -timeout <N> - Changes the default timeout for startup to occur (in seconds). Sets the MPI_TIMEOUT environment variable in the context of the MPI application.
-S, -sshport <N> - Changes the default port to log on to ssh on remote machines.
-c, -configfile <filename> - Use a config file to launch a MPI application.
-notunnel - Do not tunnel stdout from each machine over the ssh link. Sets the MPI_NOTUNNEL environment variable in the context of the MPI application.
-noinfiniband - Disables infiniband uage. Sets the MPI_NOINFINIBAND environment variable in the context of the MPI application.
LOCAL OPTIONS
-n, -np <N> - Execute this many ranks
-s, -soft {<A>, <A:B>, <A:B:C>, <>} - Range of numbers of ranks to execute. A processes, A to B processes, A to B in increments of C processes. Multiple triplets can be used, seperated by commas.
-h, -host <machine> - Use this host
-a, -arch <architecture> - The only values accepted are: "x86_64", "X86_64", "amd64" or "AMD64".
-w, -wdir <directory> - The working directory of the executing process on the remote machine
-p, -path <path> - The path to use to find the remote executable
-f, -file <filename> - Hostfile to find extra information about the host.
DESCRIPTION
Specifying Host NodesA host is specified by using the -host option. The number of ranks to run on that host must be specified with a -n or -np option. So to run 3 ranks on c1 we would use: mpiexec -n 3 -host c1 <prog> <args> If multiple hosts are needed, seperate them by using colons. Each should specify how many ranks to use. Note that you should not use a hostname that is not globally constant, like "localhost", as DNS lookups will give differing answers on each machine. To launch a single rank on c1, c2 and c3 each: mpiexec -n 1 -host c1 <prog> : -n 1 -host c2 <prog> : -n 1 -host c3 <prog> Obviously, the above form can be tedious for large numbers of machines. So using hostfiles is much simpler. The format of a hostfile is: # Comments have a leading hash. Blank lines are ignored. # A single rank: user@machine1 # Multiple ranks: user2@machine2 slots=4 # Use default login: machine3 slots=2 Using such a hostfile is done by including the -file option on the command line. The number of slots is the default number of ranks to run on that machine before moving to the next in the list. If the total number of ranks requested is less than the number described in the hostfile, mpiexec will use the first machines listed in the hostfile when selecting ranks. If the total number of ranks is greater, then mpiexec will assign them in a round-robin fashion. i.e. mpiexec -n 1 -file hostfile <prog> <args> will execute <prog> with one rank on machine1, using user1 as a login name. mpiexec -n 3 -file hostfile <prog> <args> will execute one rank on machine1, and two ranks on machine2. mpiexec -n 9 -file hostfile <prog> <args> will launch <prog> with 2 ranks on machine1, 5 on machine2, and 2 on machine3. Multiple hostfiles may be used, seperated by colons: mpiexec -n 12 -file hostfile1 <prog1> <args> : -n 10 -file hostfile2 <prog2> <, args> will execute 12 processes described by hostfile1, and 10 by hostfile2. The number of processes can also be selected by using the -soft option. However, this implementation of MPI will simply calculate the maximal number of ranks allowed by the syntax, and use that as if it were the value entered in a -n or -np option. Note that any ranks launched on the same machine should involve the same program, and same colon-described subsection of the command line. This allows the ranks to use ultra-fast same address-space communication. This requirement means that MPMD applications need to have their programs run on seperate machines. If possible, recompile MPMD applications into SPMD ones, where this limitation is avoided. Current Working DirectoryThe current working directory of a rank on a remote machine will be the default directory for ssh logins. This will most likely be the home directory. A locally launched rank will not change the current working directory by default. To change this use the -wdir command line option. Standard I/OThe stdout file descriptor of remote ranks will by default be tunneled over the MPI protocol. To alter this, use the -notunnel command line option. If enabled, then stdout will not be redirected. This may be useful when debugging using xterminals for each remote machine. The stderr file descriptor is not redirected. This allows an MPI process to choose where its output goes. By using the dup2() system call, MPI applications can choose at runtime to output to a local terminal window, or have all output tunneled to the root rank. The stdin file descriptor is not redirected. This allows input at remote machines by using their xterminals. Note that multiple ranks may share a common stdin due to multiple ranks running as threads on the same machine. A programmer can use the same techniques as is used in threaded programs to make sure input goes to the rank required. SignalsThis implementation of MPI uses a real-time signal to accelerate blocked message sends. It uses the undocumented __libc_allocate_rtsig() function to allocate an unused signal on startup. The lowest priority unallocated realtime-signal is used for this purpose. If you are debugging MPI applications in gdb you may want to use the (gdb) handle SIG64 noprint command to turn off halting when this signal is raised. (Assuming that SIG64 is the lowest unallocated one.) If a MPI process running on a machine is killed via a signal, other processes in the same job will notice. They will exit immediately when their socket connection to the dead process is shut down. No signals are propagated by MPI, so local ranks will need to inform remote ranks programatically if for example, SIGUSR1 is received. Since this MPI implementation uses threads to implement MPI ranks on the same machine, users of signals will need to use the pthreads signal handling functions instead of the single-threaded versions. Use pthread_kill() instead of raise() and pthread_sigmask() instead of sigprocmask(). The IO thread, which handles communication to other machines, runs with signals blocked. The other threads, which correspond to MPI ranks, can have their signal properties adjusted at will (provided that the low priority real-time signal is left unblocked). Process TerminationIf during the execution of an MPI application any rank dies unexpectedly, the corresponding process on that machine will also die. Remote machines will exit immediately when their socket connection to the dead process is shut down. User signal handlers should avoid calling MPI functions. This MPI implementation is not async-signal safe. For example, if a MPI_Send() is called with a bad buffer and a segmentation fault occurs, the internal state of MPI is undefined. If the SIGSEGV signal handler attempts to call MPI_Finalize() it is unlikely the necessary communication can take place. If a critical error like this occurs, the safest thing to do is clean up non-MPI state, and then exit. Config FilesThe -configfile <filename> option allows most of the command line given to mpiexec to be stored in a file. The lines of <filename> are of the form seperated by colons of the input to mpiexec. Lines begininng with '#' are comments, and are ignored. Lines ending with a slash '´ continue onto the next line. The advantage of this invocation method is that it allows arguments to programs that contain colons. Whereas this is impossible with the normal execution method. InfinibandBy default, if an Infiniband network interface is detected, then it will be used by MPI. However, if the -noinfiniband command line argument is given then MPI will fall back to purely using ethernet for inter-machine communication. This can be useful for debugging Infinband issues. Note that all machines in a cluster should have Infiniband available, or all have it not available. Mixed-use clusters are not supported. It is assumed that all machines are routable to each other over Infiniband. i.e. the Infiniband network isn't split. It is also assumed that there is an TCP-IP link to each machine in the cluster. The MPI implementation will start each node via ssh. Once an MPI process is running on each node after startup, communication will switch to using Infiniband. This MPI implementation will only use one Infiniband port per machine. If multiple ports are available, it will detect the first connected one, and use that for communication. Note that if an MPI application is misbehaving whilst using Infiniband, try testing the following: Test to see if your application is using the fork() system call. If so, memory corruption could result if the child process alters memory mapped by the Infiniband hardware. To test for this, use the pthread_atfork() function to register handlers that will notify you on this condition. This MPI implementation can stress your machine in ways that other applications do not. This may expose latent hardware issues: Try updating your Infinband HCA firmware. Some older firmware may crash when subjected to the high loads this MPI implementation can sustain. Try testing your ram. The DMA from Infiniband to marginally working ram may trigger issues that are rarely detectable in normal use. DebuggingAlthough it is possible to manually start up a MPI job by setting environment variables (see below). This is very tedious. Instead, it is possible to get mpiexec to do this for you, and launch a seperate xterm window for each machine with the environment set up. Use the -xterm flag to do this. By launching a shell, instead of a MPI program, it is possible to use your favourite debugger. Another option is to include the debugger on the command line passed to mpiexec. For example, to run a MPI job inside valgrind with 8 ranks, using hostfile to determine which machines to use: mpiexec -x11 -notunnel -n 8 -file hostfile valgrind --tool=memcheck <prog> <args> However, for simplicity, there is also an option to do the above for gdb. Use the -debug option to imply -xterm and then start up a gdb session inside each xterm. When your application is executing it will have n+1 threads. The first n threads will correspond to the ranks running on this machine. The last thread is an internally used thread that handles IO for MPI. StartupOnce the command line parameters are parsed, mpiexec will then attempt to launch the requested programs on the remote machines. It does this by invoking the ssh executable. In order for this to work, ssh needs to be accessable somewhere in your PATH. The port of the remote ssh server can be altered by using the -sshport option. By default, ssh will use the name of the current user to log in to the remote machines. Other user names can be used by using the "name@host" syntax instead of "host" to describe the remote machines. Since entering the passwords every time a MPI job is started is very tedious, it may be useful to set up ssh-agent. Once the public keys of all the remote machines are known, it is possible to have password-less logins. Use ssh-agent bash to start a bash shell with an associated agent. Then use ssh-add to add your keys. Finally, run mpiexec and no passwords will be needed if the agent is working correctly. Note that if the machine connection messages are now no longer needed, use the -quiet flag to disable them. In order to discover the location of the remote executables, an altered PATH may be needed. To set one, use the -path option. This will replace the default PATH. The login and connection process may fail. To prevent MPI jobs waiting forever for a process to start up, there is a timeout. By default, this is 60 seconds. To increase this timeout (i.e. for a large cluster) use the -timeout option. This flag sets the MPI_TIMEOUT environment variable. Finally, the MPI jobs will attempt to connect to each other over ethernet. They do this by creating a server IO thread at each machine. These threads then attempt to open sockets connecting to the root machine. This implementation of MPI uses 7077 as the default port for MPI communication. If this needs to be altered, use the -port option. This flag sets the MPI_PORT environment variable. Default StartupBy default, if a MPI application in this implementation is run without using mpiexec to launch it, it will execute as a MPI process that will autodetect a number of ranks to use. In a multicore system, this will use the number of cores available as the number of ranks. A user application may then detect the number of launched ranks at run time. If this is too many, then a new communicator can be constructed that contains only the wanted ranks. By using it in lue of MPI_COMM_WORLD a smaller job can effectively be run. Note that the unwanted ranks can call MPI_Finalize() and will sleep taking up very little machine resources until the MPI application exits. If the Platform LSF batch system is detected (through the existence of the LSB_HOSTS environment variable), this MPI implementation will auto-detect how many ranks to execute and on which machines. It will then launch the requested threads using the Platform API. No complex mpi launch system is needed, just describe the number of ranks required: bsub -n # -x -q <queue_name> <commmand> <args> Will start # ranks in exclusive mode on batch queue <queue_name> for the MPI application <command>. ENVIRONMENT VARIABLESThis MPI implementation uses environment variables to determine how to execute an MPI process. No command line arguments are used, so a program may inspect them before calling MPI_Init() if required. mpiexec sets the environment based on its command line. It is possible to avoid using mpiexec and set these manually if needed. This may be useful for debugging purposes.
MPI_NOINFINIBAND - This disables the use of Infiniband for MPI communication
MPI_IB_MR_MAX - This sets the maximum number of Infiniband Memory Regions to use. The default is to use all available. Using this allows other Infiniband applications to share use of the HCA.
MPI_COMMAND_LINE# - This sets command line for the rank number #. Normally, the command line of a rank will be set to the command line passed to MPI_Init() by the root process on a machine. This environment variable can be used in the case where the caller of MPI_Init() may not know argc and argv such as in a library. Note that mpiexec does not set this variable, but it is exposed due to its usefulness.
MPI_NUMRANKS - The number of ranks to use on this machine. This defaults to an auto-detected amount if not set.
MPI_NUMPROCS - The total number of ranks in the MPI job. This defaults to MPI_NUM_RANKS if not set.
MPI_NUMMACHS - The total number of machines used in the MPI job. This defaults to one.
MPI_KEYWORD - A secret keyword used in initial connection. This must be the same on all machines in a MPI job. A connecting socket that advertises an incorrect keyword will be rejected.
LSB_HOSTS - This is set by the Platform Batch System. If set, MPI will try to automatically start - setting MPI_NUMRANKS MPI_NUMPROCS and MPI_NUMMACHS based on information provided by the batch system.
MPI_SLAVE - This is the name of the machine executing the root rank. It must be set on all remote machines, and not set on the root machine.
MPI_BUFCOPY_SIZE - Sets the maximal size where messages will be automatically buffered. See MPI_bufcopy_set()
MPI_VERBOSE - Show extra information when starting a MPI job.
MPI_TIMEOUT - Timeout in seconds allowed before startup fails. Defaults to 60s.
MPI_NOTUNNEL - Do not tunnel stdout if set.
MPI_PORT - MPI communication port. Defaults to tcp 7077.
RETURN VALUEThe return value of mpiexec will be whatever the return value of the root rank of the MPI application is. This will be 0 if all ranks started exit by calling MPI_Finalize() and then the root process exits with EXIT_SUCCESS. If a remote process shuts down prematurely, then its error code will not be propagated to the output of mpiexec. Instead, the root process will notice the failing machine by the disconnection of its tcp socket, and then will exit with a status of 1. If a remote machine dies using the MPI_Abort() function. The error code passed to it will be propagated to the root rank. The root rank will then exit with the requested error code, and so then will mpiexec. COPYRIGHTCopyright (C) 2011 Lockless Inc. SEE ALSOmpicpp (1) mpicc (1) MPI_Init (3)
|
|
| About Us | Returns Policy | Privacy Policy | Send us Feedback |
|
Company Info |
Product Index |
Category Index |
Help |
Terms of Use
Copyright © Lockless Inc All Rights Reserved. |