			        ParGCL/MPI
	       Comments to:  Gene Cooperman (gene@ccs.neu.edu)

ParGCL provides three parallel programming interfaces, each described
individually, below.  The three interfaces are:

  slave-listener - master sends LISP commands to a slave, and then receives
  		 replies.
  MPI -          Simplified LISP interface to MPI.  At this time, it
  		 primarily implements the point-to-point layer of MPI.
		 See the MPI standard on the web for a full manual of the
		 MPI functions.
  master-slave - simple, easy-to-use master-worker model of parallelism
  		 based on TOP-C (http://www.ccs.neu.edu/home/gene/topc.html)
		 It also supports non-trivial parallelism, in which any
		 slave process can cause the "shared data" to be updated,
		 and so made visible to other slaves.

To begin, make sure that there is a procgroup file in the directory where
you start pargcl, or else call:  pargcl -p4pg FULL_PATH_OF_PROCGROUP_FILE
The procgroup file specifies the remote "slave" machines and the path
to pargcl on those machines.  pargcl/bin/procgroup is a sample file.

Most people will want to begin with the slave-listener interface.
Note that (help 'COMMAND) works for most commands.
There are example parallel programs in pargcl/examples/.
Here is a sample usage of some simple commands:
  (send-receive-message '(1+ 5)) ; send message, recv answer (default: slave 1)
  (send-message '(+ 3 4) 2)  ; send to slave 2 for evaluation
  (send-message "(+ 5 6)" 1) ; send to slave 1 for evaluation
  (send-message "(+ 7 8)" 1) ; send to slave 1 for evaluation
  (receive-message 2)
  (receive-message 1)
  (probe-message 1) ; Slave 1 has message pending
  (flush-all-messages)
  (probe-message 1) ; Slave 1 now has no messages
  (broadcast-message '(setq myrank (MPI-Comm-rank))) ; rank (id) of slave
  (par-eval '(defun myfnc (x) (sqrt x)))
  (send-message '(dotimes (i 1000000000) (myfnc i)) 1)
  (par-reset) ; useful if slave not responding.
  (par-load "/home/joe/myfile.lsp") ; Load same file in all processes
  (defun par-mapcar (fnc list &aux (i -1) result)
    (par-funcall #'par-mapcar-helper fnc list))
  (defun par-mapcar-helper (fnc list &aux (i -1) result)
    (if (master-p)
        (setq result (make-array (length list))))
    (master-slave
      :generate-task-input #'(lambda () (incf i) (if (< i (length list)) i nil))
      :do-task             #'(lambda (j) (funcall fnc (nth j list)))
      :check-task-result   #'(lambda (task-in task-out)
                               (setf (aref result task-in) task-out)
                               'NO-ACTION)
      :trace t)
    (if (master-p) (coerce result 'list) nil)
  )
  (par-mapcar #'sqrt '(1 2 3 4 5 6 7 8 9 10))
  (help '*master-slave-time*)
  *master-slave-time*
  (send-receive-message '*master-slave-time*) ; time spent on slave 1

;; The definition of par-mapcar above demonstrates an easy way to
;;  generate master-worker parallelism.
      ;; In :generate-task-input, Variable i is the next task input;
      ;;   this is generated on master
      ;; In :do-task, task is to call fnc on (nth i list);
      ;;   Note that the argument "list" is shared on all processes
      ;;   The task is then executed on the next available slave
      ;; In :check-task-result, the master, gets a pair: (i, task-out),
      ;;   where task-out is (funcall fnc (nth i list))
      ;;   Here, we return 'NO-ACTION, but more general patterns are also
      ;;   possible.  See "Overview" in
      ;;   http://www.ccs.neu.edu/home/gene/topc.html  for other patterns.

;;;;USAGE:
;;;; A remote process is set up for each one specified in your procgroup file.
;;;; Available commands:
;;;; (send-message <lisp-expr> <optional dest = 1>)
;;;; (broadcast-message <lisp-expr>) ; no reply message
;;;; (receive-message <optional source = (MPI-Any-source)>)
;;;; (defun send-receive-message <lisp-expr> <optional dest = 1>)
;;;; (send-receive-message <lisp-expr> <optional dest = 1>)
;;;; (free-msg-buffer <buf>) ; optimization for greater efficiency, only
;;;;                         ; frees message buffer for re-use in next MPI-recv
;;;; (probe-message <optional source = (MPI-Any-source)>)
;;;; (flush-all-messages)  ; flush all commands pending by slaves
;;;; (bye) modified to also delete remote processes
;;;; (get-last-source) ; not needed, but can be useful in special cases
;;;; (get-last-tag)    ; not needed, but can be useful in special cases
;;;; ---
;;;; It works to call (send-message ... 1) n times, and then call
;;;;   (receive-message 1) n times, although this may be less efficient.
;;;; Commands sent to the same processor, are evaluated in sequence.
;;;; ---
;;;; Send/receive/probe commands also accept optional tag with obvious default
;;;; CAUTION:  If you add the optional tag parameter to send-message,
;;;; note that the slave-listener does not reply if tag = broadcast-tag
;;;; Also, tags larger than broadcast-tag are interpreted as (vector fixnum)
;;;; or (vector float), and use the corresponding MPI data types.

====================

The current implementation has a built-in MPI subset (MPINU).  Note,
however, that one can invoke:  ./configure --with-mpicc=XXX
where XXX is an mpicc script that compiles an application to use MPI.
This allows you to use other dialects of MPI.

Note that messages are converted to (vector fixnum), (vector float),
or (by default) strings (general print representations) before being sent.
Conversion from a general object to its print representation as a string
implies significant overhead for very large messages

Also, a timer will kill any process
that has not received a message for some time (60 minutes by default).

Finally, there is an obscure interaction of MPINU and ANSI GCL and
the GNU libpthread library.  (Interestingly, the debugging-enabled version,
/usr/lib/debug/libpthread.so.0, seems to work fine under Debian.)  For this
reason, if you configure GCL with --enable-ansi or ParGCL
with --with-mpi="...ansi_gcl", then MPINU is built with "-DHAVE_PTHREAD=0".
The result is that under ANSI GCL, if two processes simultaneously
end very large messages to each other, they can deadlock.  This should not
affect typical programs.

Following is a summary of the master-slave layer, slave-listener layer,
and MPI layer.

=========================================================================
[ The master-slave layer uses the TOP-C model.  For a detailed description
  of the TOP-C model, see http://www.ccs.neu.edu/home/gene/topc.html ]

MASTER SLAVE LAYER 

(master-slave             - see related article for full details.
			    The user is responsible for writing the FNC's:
  :generate-task-input FNC - (generate-task-input) returns TASK-IN, arb. user
			     data structure
  :do-task FNC		 - (do-task TASK-IN) returns TASK-OUT, arb. user
                             data struct
  :check-task-result FNC    - (check-task-result TASK-IN TASK-OUT)
				 must return T, NIL, ?, or (CONTINUATION . *)
			   Equivalently, can return symbols:
				UPDATE, NO-ACTION, REDO, (CONTINUATION . *)
			   T means call (update-shared-data TASK-IN TASK-OUT)
			   NIL means do nothing more with this task
			   ? means re-send TASK for additional comp. by do-task
			   CONTINUATION is like ?, but it allows a user
                              parameter for *.  (CONTINUATION . *) becomes
                              the next TASK for the slave.
  :update-shared-data FNC - (update-shared-data TASK-IN TASK-OUT) executed on
			     master and slaves, used to modify global var's
  :trace T-OR-NIL)	  - obvious
*master-slave-trace*	  - global variable, default value for :trace keyword
        EXAMPLE:  see file, myfactor.lsp

*master-slave-time* - variable set after each call to master-slave;
        see (help '*master-slave-time*);
        On each processor shows time spent.

(up-to-date-p)		  - utility to test if update-shared-data was called
			    between time when most recent result was
			    generated, using generate-task-input, and
			    when the correspond.
			    result was obtained, using check-task-result
	EXAMPLE: (defun check-task-result-fnc (result)
		   (if (up-to-date-p)
		       t    ;data is consistent, update all processors
		       'REDO)) ;data structures changed, re-do computation

(result-task)             - Utility returning task corresponding to most recent
			    result returned; Used only in check-task-result
        EXAMPLE: (defun check-task-result-fnc (result)
	           (case (first (result-task))
		       (TASK-A ...)
                       (TASK-B ...)
                       ...))

(par-eval 'EXP)	          - EXP evaluated on all processors
         EXAMPLE: (par-eval '(defun foo (x) (1+ x)))
(par-funcall FNC ARG1 ...) - like (funcall FNC ARG1 ...) with value of ARG's
			    taken from lexical environment on master
         EXAMPLE: (par-funcall #'load "file.lsp")
(par-load FILE)		  - like (load FILE), loads FILE on all processors
(par-reset)		  - resets slave processors if wedged (deadlocked)
	 EXAMPLE: (par-reset)

(master-p) - returns t when executed on master, and nil when on slave.
        EXAMPLE:  (par-eval '(if (master-p)
	            (set-up-large-master-data-structure)
		    (set-up-special-slave-stuff))

=========================================================================

SLAVE LISTENER LAYER - master distributes commands to slaves;  These
     commands should be executed on the master ONLY.  The slave-listener
     layer sets the initial slave directory to the same as the master.
     The master can arbitrarily	intermix sending commands and getting results

(send-message LISP-EXPR OPT-MPI-DEST-ID OPT-TAG)  - send arb. expression for
				 evaluation; send to OPT-MPI-DEST-ID with TAG,
				 default is OPT-MPI-DEST-ID = 1, TAG = 0
	EXAMPLE: (send-message '(+ 3 4))
	NOTE: if LISP-EXPR is a string, it assumes this is a print
	   representation of a LISP object, and not a raw LISP string

(receive-message OPTIONAL-MPI-ID OPT-TAG)  - return result from next msg from MPI-ID
					that was sent with TAG, default is next
				        available slave, any tag
        EXAMPLE:  (receive-message)  -> returns 7 for above case

(broadcast-message LISP-EXPR) - send arb. expression to all slaves.  No result
                                is returned.

(free-msg-buffer BUF) - optimization for greater efficiency;  Allows
                        end-user to free message buffer for re-use by
                        slave-listener, saving the slave-listener from
		        generating a new buffer.

(mpi-comm-rank)		  - returns MPI process ID, 0 = master, > 0 for slaves
        EXAMPLE: (par-eval '(if (= (mpi-comm-rank) 2) (print x)))

(get-last-source), (get-last-tag), (get-last-count), (get-last-datatype)
  all exist.
These last should be avoided, but get-last-source can be useful for matching
initial response of slave with later continuation by same slave.

=========================================================================

MPI LAYER - not all commands are implemented.  The MPI manual is also
  a good source for what the commands do.  The calling sequences can
  be found via (help 'mpi-command) or (describe 'mpi-command)

(mpi-iprobe) - non-blocking probe for remaining messages. Returns t or nil.
        EXAMPLE: (if (null (mpi-iprobe)) (print "READY"))

(mpi-probe) - blocking probe for remaining msg's. Always returns t, eventually.
        EXAMPLE: (progn (mpi-probe) (print "READY"))

(autologout X) - if no msg rec'd in X minutes, process aborted.  Returns old
	value of X.  Can be called with 0 arg's, leaving old value unchanged.

(nice X) - nice val between -19 and 20 (higher is lower prio.); returns old val

(limit-rss X) - limit Resident Set Size, when others also want RAM;
	returns old val

(chdir "string") - change current directory

(getcwd) - get current working directory as string

Many other MPI commands.  Not recommended, except for special cases.
One could load the MPI layer alone, for special requirements.

DEBUGGING -  It is recommended, when debugging, to first run on the master and
all slaves on a single processor.  (The procgroup file can easily be set up
this way.)  The :trace keyword in the function, master-slave, is also useful.

In more difficult debugging situations, one may run the user code
with master-slave-seq.lsp in a single process (e.g.: in gcl, not gcl-mpi).
In this case, one can use (step ...) to debug routines that would reside
on the slave.  However, the user code should first be modified to
removed any uses of "(master-p)" or other constructs explicitly referring
to a distinct master and slave.

(par-eval '(si::use-fast-links nil)) is a good precaution during development,
to find at what function any crashes occur.

If communication time is too much, then you should break up any very large
messages into elementary MPI datatypes.  So, consider breaking up messages of
over several thousand bytes into several messages, where each large messages
is of LISP type: (vector fixnum) or (vector float).

The variable master-slave-time may also be useful for seeing where
all the time is going.

PERFORMANCE - Try:
1.  Bundling several tasks into one to reduce communication overhead
2.  Setting up multiple slave processes on the same processor to gain
    greater CPU utilization on the processor
3.  Using free-msg-buffer in check-task-result to save on garbage collections:
    Important if do-task is returning long vectors of fixnum or of float.
4.  The file, master-slave.lsp, currently has a throttling mechanism in case
    a slave is very slow, perhaps due to paging against other users.
    Any slave running CYCLES-BEFORE-DEAD times slower (default = 20)
    will stop being used.  After CYCLES-BEFORE-UNDEAD (default = 200)
    task times, they will be resuscitated and tried again.  These
    constants in master-slave.lsp can be changed, after which STAR/MPI
    should be re-made.
