Dynamic Connection Establishment for Distributed Mozart
A design document
By Erik Klintskog
   Anna Neiderud

-----------------------------------------------------------------------

Introduction

During the development of Distributed Mozart several issues that do
not have one single solution have arisen. While some applications
require a high throughput achievable by using a new transport media,
others need to be possible to contact through a tough firewall. Yet
others may need encrypted communication or to be able to reconnect
after a laptop has moved.

Since the application programmer is the expert on the environment
where his or her application will run, the different aspects of how
communication is done should be dynamically changeable by him or
her. This paper describes a design that keeps the simplicity of
completely transparent distributed Mozart programming, while offering
control on several different levels of abstraction to the advanced
Mozart programmer.

-----------------------------------------------------------------------

Goals of the design

A distributed Mozart process may use shared memory, a local area
network or the Internet for its communication. This implies that
Mozart processes exist in very dynamic environments. They need to be
able to adapt to different circumstances continuously.

Considering the Internet, machines connect and disconnect continuously,
they can even change their address in runtime. Connection
establishment can take an arbitrary amount of time. A decision to use
a different transportation protocol may dramatically change the
performance of the communication between two peers, while it
barely works between two other peers.
The Internet is also divided into different domains separated through
fire-walls. Establishing an outbound connection through a fire-wall is
often easy, but doing so in the opposite direction may be impossible.

Mozart processes must be able to work and function in this harsh
environment. By this design we try to achieve this by meeting the
following goals:


1. Add new connection protocols. 
   The application programmer must be able to implement new connection
   protocols that fulfill the demands that the application puts on
   its distributed environment. Implementing new connection protocols
   should not require extended knowledge of the Mozart system and its 
   implementation. New connection protocols should be distributed 
   automatically, since a Mozart process that runs a custom connection
   protocol must be possible to contact by any Mozart process that
   receives a reference to it.
   
2. Handle Mozart processes that change their addresses during runtime.
   A Mozart process can change its address during runtime. The address can 
   therefore not be statically allocated. A remote process must be able to 
   find a remote peer even if it has changed its address.

3. Use different transportation mediums.
   Mozart processes have a number of different mediums to transport their 
   data over. The different transportation mediums have different
   properties and are suitable for different tasks. Two Mozart
   processes must be able to dynamically define what medium to use
   dependent on the demands and current locations of the two peers.

4. Mozart processes with multiple addresses must be possible to contact.
   Mozart processes that run on machines with  multiple addresses must
   be  able to make contact and to be connected on all their
   addresses.  Multiple addresses should be available for
   communication in parallel.   

5. Execute foreign connection protocols in a secure way. 
   Executing foreign defined connection protocols must be done in a
   controlled environment. Connecting two Mozart processes will be done
   with the use of native functionality and resources. The use of
   native functionality and resources must be restricted and monitored 
   to hinder damage to the running Mozart process.
   
-----------------------------------------------------------------------

The design model

Definition

When running distributed Mozart every process is called a site. Oz entities
on one site can be referenced at another site transparantely as if they were 
in the same global memory. The distributed subsystem ensures consistency of
distributed entities with different entity protocols.

*Picture* 
    Showing centralized Mozart vs. distributed subsystem entities
    present in both parts
*p*

Internally in the distributed subsystem, references to remote sites are
represented by DSite objects.

*Picture*
    Showing sites with DSites referencing other sites.
*p*

To let entity protocols communicate with other sites, every DSite can
have a ComObject to handle doing connection requests, queueing messages and
monitoring fault. ComObjects are located on the communication layer. 

When a physical channel is needed the ComObject uses a dedicated
connection functor to acquire a TransObject to carry out the actual
transportation of data. TransObjects are located on the transport
layer and represent physical channels of some arbitrary type. Every
ComObject has exactly one TransObject whenever it is connected to a
remote peer.

The transport layer has different parts that can be extended and
adapted in different ways by the Mozart application programmer. It
consists of one or more transport modules (written in C++)
and a ConnectAcceptModule (written in Oz). Each transport module
implements communication over a certain transport media provided
through a common interface and is expected to consist of a TransObject
and a TransController. The ConnectAcceptModule runs code able to
establish some type of connection to a remote site and use this to negotiate
what transport media to use.

*Picture*
    The distributed subsystem with layering and color coded areas
    representing Oz or c code.
*p*

The ConnectAcceptModule

The ConnectAcceptModule is the interface to all distribution for
centralized Mozart and the application programmer. It ensures that all
necessary initialization has been done before any communication will be
attempted.

The distributed subsystem of each site has two functors called connection
functor and accept functor known through the ConnectAcceptModule. The
accept functor is run in an Oz-thread locally at the site to listen
for any incoming connection requests. A handle to the connection
functor is distributed together with any reference to a site. The
connection functor is to be called by the remote site whenever it
wishes to establish a connection. The connection and accept functors
must be carefully designed to work together since they provide the
only way of connection establishment between sites. 

Distributing a connection functor

There are currently three different types of handles to the connection
functor: default, replicated and dynamic.

default: Use the default connection functor known by every site
running the current Mozart version.

replicated: The handle contains an URL to a connection functor that
should be loaded once and then stored together with the DSite. 

dynamic: The handle contains an URL to a connection functor that
should be reloaded at every connection establishment.

Replicated enables the Mozart application programmer to alter the
behaviour of connection establishment in the distributed subsystem by
implementeing custom accept-connection functor pairs.
Once the connection functor is fetched, it may be cached and will then
be recognized later by a unique id.

Dynamic works as Replicated exept connection functors may not be
locally stored. This can be used to change addresses or behaviour in
runtime.

By using these types of handles, a great variety of different
behaviours can be defined. E.g. by using replicated a cheap way of
connecting to sites with dynamicly changing addresses could be
implemented: If connection to some previously known address fails, a
known nameserver could be queried for the correct address. The correct
address could then be stored for later usage and used.

These types of handles are sufficient to meet the goals of this
design, but the design is general and more types could be added later.

Transport modules

A transport media for Mozart could be anything that is able to
transport data from one ComObject to another. The basic requirements
are that all messages must reach their destination and do so in order
and that the operation must not block the Oz-process. 

Use of a transport media is implemented in a transport module that could use:
TCP, Reliable UDP, Shared Memory, encrypted TCP, compressed TCP a.s.o.

The ComObject does not know what transport media is used. It just
requests a connection from the ConnectAcceptModule and if the
connection establishment succeeds, it will receive a working
TransObject. The connection functor has to do the work of finding
out what transport medias are available at the local site, allocate
resources of that type, i.e. request a transport object of a certain
type, and suggest this to the accept functor. Once a transport
mechanism is agreed upon, the corresponding TransObjects should be
handed over to the requesting ComObject at the connecting site, and to
a new ComObject at the accepting site. Any TransObject that will not
be used should be returned to the corresponding transport module.

Security

To maintain security the ConnectAcceptModule implements a sand-box
where connection functors are allowed to run a limited subset of
Mozart. This sand-box is called a connection wrapper. Running
connection functors within a connection wrapper enables control of what
functionality is exposed to the connection functor. This is used to
prevent attacks to the host system and to control the amount of
allocated resources.

Meeting the goals with the proposed design

Referring back to the goals in section "Goals of the design" this
section describes how those goals are met.

1. Implement new connection protocols. 
   With the possibility to install arbitrary Accept-Connect pairs and
   then automatically distribute them to remote sites the application 
   programmer has the full freedom to use what ever connection protocol
   that suits his application. With a few parameters to the site 
   any remote site that ever will get a reference to the site will
   be able to make a contact.

2. Handle Mozart processes that change their addresses during runtime.
   With help from a third well known party a site that changes its
   address could always be found. The moving site informs the third
   party about its new address and remote peers that try to make
   contact should ask the third party for the new address. 
   This schema can be realized in at least two ways. The third party could be
   a name server and then the connection functor can ask the name server
   for the address at every connection attempt. The third party could
   also act as a repository for connection functors. With uncashable
   conenction functors connecting peers are forced to fetch the
   functor at every connection atempt.

3. Use different transportation mediums.
   During the negotiating step the two peers can agree on what transportation 
   medium to use out of the available mediums. If they choose to use
   another  medium than the one that the contact was made on they
   close  that medium and start a new channel with the chosen one and
   returns the needed parameters to the distributed  subsystem. 

4. Sites with multiple addresses must be contactable.
   With a dedicated connection functor a site that wants to make
   contact can deduce what address to chose for the connection
   atempt. The suitable address can often be calculated from the
   contacting sites domain. In the cases where this is not possible
   the contacting site can simply try all available addresses. With
   help from a authentication protocoll implemented in the
   connect-accept pair the identity of the answering machine can be
   discoverd. 

5. Execute foreign Connection functors in a secure environment.
   A Connection functor will execute in the controlled environment of
   the Connection wrapper. The only interface available for the
   connection functor to the Mozart environment is trough a dedicated
   interface provided by the connection wrapper. All interface procedures
   that works with native functionality or resources are monitored by
   the connection wrapper. The connection wrapper can be replaced if 
   an application needs another level of security.

-----------------------------------------------------------------------
-----------------------------------------------------------------------

Detailed description
(still a draft)

The ConnectAcceptModule

The ConnectAcceptModule is the single entry-point to run the
distributed subsystem from Oz. Before it can be used it has to be
initialized. Initializing the subsystem will cause an accept functor
to be loaded and started, information on this site to be created and
stored ready to be shared with others, and a thread to be ready to
start connection functors in fresh connection wrappers.

The connection wrapper contains the environment exposed to connection
functors. In the future the contents of this environment should be
adaptable by the application programmer. It is implemented using a
module manager of Mozart to run the connection functor in and by
restricting the import of the connection functor to one single module
called ConnectionWrapper.

Some of the functionality exposed by the ConnectionWrapper, and some
available to be used by any accept functor, is implemented in
C++. E.g. getting a TransObject from some suitable transport module. A
thin interface between the Oz and C++ code is therefore implemented by
a C++ module named ConnectionStub. This module is also used by the
ComObject to initially request a connection, and to hand back or
cancel connections that it does no longer need.

DSite representation

All sites in Mozart have a globally unique identifier. This is used to
differentiate DSites from each other and is also used to make sure that
a remote site that is contacted is the intended one.  

A DSite is represented as a C++ object in the distributed subsystem
(DS). The DSite has three fields of information: 
* Id:         The unique identifier of the remote site
* DistState:  Globally replicated information about the remote site.
* LocalState: Locally stored information about the remote site.

The Id is currently implemented as the sequence of a timestamp, the
process id and the IP address of the site. This will have to change to
fullfil goal 2, concerning changing IP addresses. It may be sufficient
to declare the Id to contain a timestamp, a process id and the
original IP address of the site. 

The DistState contains information on how to fetch the connection
functor and a parameter to be passed as an argument to the connection
functor. The parameter is thought to contain some kind of address to
the site, in the default implementation an IP address and a port, in
some other implementation perhaps an address to a nameserver or proxy.

The LocalState contains information stored locally only. If the
connection functor is stored, it is stored here. There is also a field
for any local state information that the connection functor wishes to store.

The DistState and LocalState are stored as Oz terms. Since they
contain Oz terms and mostely will be handled by Oz code this
simplifies the implementation. (The only time they need to be handled
by C++ code is at marshaling and unmarshaling). The tradeoff is that
garbage collection must be taken care of. 

References to sites can be marshaled into a serialized format and
delivered to other sites. The serialized representation of a site is
called a SiteRep. A SiteRep must contain enough information to enable
the creation of a DSite from it. To create a SiteRep we use a
marshaling functor and to create a DSite from a SiteRep we use the
marshaling functors inverse, a unmarshal procedure.
/* marshaling functor??? + inverse = procedure */

The SiteRep is built as an extendable structure. It is an aim is to keep
the format as small as possible. All SiteReps must have at least two
fields to enable unmarshaling and creation of a proper DSite, the type
of the SiteRep and the identity. 

TYPE: The type defines how the connection functor should be fetched
      and implicitly what fields to expect in the SiteRep.

ID:   The unique identity of the site. 

Depending on the type of the SiteRep there may be some additional
fields in the SiteRep. These will define the connection functor and
the parameter to be sent to the connection functor. The types used
correspond to the types defined in section "Distributing a conection
functor".

* Default:
  PARAMETER:   The address where the site listens for incoming TCP 
	       connections.
   
* Replicated:	
  PARAMETER:   Some paramterer to the connection functor, possibly the
	       address where the site may listen for incoming TCP
	       connections, or the address of a nameserver. What it is
	       should be defined by the accept functor and known by
	       the connection functor. 

  CONNECT_ID:  A globally unique Id of the connection functor to use.
  CONNECT_URL: The location of the connection functor.
  
* Dynamic
  PARAMETER:   Some parameter to the connection functor. Since this
	       parameter may not change, this should not contain the
	       address of the site if the site has a dynamic
	       address. In that case the address could be hardwired
	       into the connection procedure. 
  CONNECT_URL: The location of the connection functor.
	       
Fetching connection functors

After unmarshaling a SiteRep a DSite is created with the identity of
the remote site. After this the DistState and the LocalState fields
must be created. The LocalState is initialized to some default
value. The DistState should be initialized in accordance with the
SiteRep.

With all three supported types the parameter is unmarshaled and filled
in. Then the connection functor needs to be taken care of.
In the default case a reference is just created to the globally known
default connection functor. 
In the replicated case the SiteRep comes with a GName and an URL. A
check is then first done in the GName table to see if the functor is
already present. If it is it is filled in, and if it is not it is left
blank until fetched. The DSite can then be used imidiately, but will
suspend until a connection procedure is present.

A connection functor can either be fetched lazily or eagerly. A
Dynamic site is restricted to do lazy fetching but a Replicated DSite
can do eager fetching. It is possible to initiate a connection without
a present connection functor independent of fetching strategy. When
the connection wrapper is to execute the connection functor it will
either execute if it is present or load it from the specified URL. A
connection functor could have been imported over some other channel
than by an fetch instrumented from the DSite. A name lookup should
always be done before a fetch operation is initiated if a GName is
present. A site might be in a waiting state, where it waits for a
fetch operation to finish, the connection wrapper should then not
initiate a fetch operation to avoid double work. This can happen if we
use eager fetching or if we abort some connection attempt due to some
timeout. 

Connection establishment

This section describes the actual flow of control when establishing a
connection.

At the connecting site:
1. The ComObject gets a need for a channel since someone is trying to
send a message. This causes a request for a connection (to the
ComObject equivalent to requesting a working TransObject).
2. The ConnectionStub fetches the state fields out of the DSite and
forwards the request to the ConnectAcceptModule. It also creates
itself a requester id that will be used in communication with the
ConnectAcceptModule to identify which ComObject this request belongs
to.
3. The ConnectAcceptModule retrieves information from the state
fields and sets up a ConnectionWrapper. The connection functor is then
invoked with the parameter retrieved from the DistState field.
4. The connection functor gets to run. It is allowed to establish a
connection however it wants, given the limited environment. Once
finished it hands over the connection to the ConnectAcceptModule. The
default connection functor runs a-d: 
  a) A TCP socket is setup and connected to the remote site.
  b) A grant to use a resource of the TCP transport module is
  requested. This request is forwarded to the ConnectionStub by the
  ConnectionWrapper and may cause the request to be queued if no
  resources are available.
  c) Using the TCP protocol is suggested to the remote site.
  d) The response is read and will hopefully result in a handover. The
  used file descriptor is then specified as a setup parameter.
  Any of these steps may fail, and may cause the connection attempt to
  be retried, or resources to be handed back and a failed connection to
  be reported to the ConnectionWrapper.
  Steps b-d should be repeated to be able to decide on some other
  transport media once such are available. 
5. The ConnectAcceptModule cleans up any missed resources and
forwards the hand over with the setup parameter to the ConnectionStub.
6. The ConnectionStub sets up the TransObject used with the setup
parameter and hands it to the requesting ComObject.

At the accepting site:
1. The running accept functor accepts some incoming channel and runs
its protocol. This could result in a failure that will just be
dropped, or a handover to the ConnectionStub, specifying accept as the
requester. The default accept functor runs a-c:
  a) A TCP channel is accepted.
  b) An incoming suggestion to use TCP is read.
  c) A grant to use a resource of the TCP transport module is
  requested. Here it will be specified not to queue the request if
  resources are out. In that case we will simply answer no, allowing
  the remote site to try using some other transport medium or retry later.
  If a resource was granted, we answer yes and do a handover as
  specified above.
2. A handover to the Connection causes the TransObject granted to be
set up using any setup parameter, and a fresh anonymous ComObject to
be created.
3. The anonymous ComObject is handed the TransObject and ordered to
accept. 

After these steps the open protocol of the ComObjects is run and then
messages will be transferred.

- Security 

The only interface to the Mozart host where the connection functor is
executing goes over the model exposed by the connection wrapper. The
connection wrapper does only expose a limited subset of available
functionality in Mozart. All interface procedures are monitored by
the connection wrapper, all allocated resources are accounted for the
executing connection functor. The connection functor has no direct
interface to the Mozart engine. 
A connection functor can finish its execution in three ways; it
succeeds to establish the connection, it fails to establish its
connection or it is aborted by the ComObject's timer. In all three cases
the connection wrapper ensures that all resources are freed. The
connection functor can handover resources to the DS, handed over
resources will not be frees by the connection wrapper, that is the
responsibility of the DS. 

Fault detection and propagation

There are two types of possible faults that the DS can discover,
Permanently Lost Site (perm) and Temporarily Lost Site (temp).  
Temp is discovered through measuring of the round-trip. Measuring the
round-trip is done by the ComObject since it contains the knowledge of
the  history of the channel. Discovering Perm is done by the
connection functor. The connection functor has the explicit knowledge
of the behavior of the remote peer and can make better assumptions on
when to define sites as perm. It is important to understand that perm
only should be defined when there is a 100% sureness of a perm since
perm is a non transient state and affects all the protocols of the
DS. Defining sites as perm is possible through the connection wrapper
module. 

Resource handling

Connection functors will use OS resources for ther communication and
cause data structures that use memory and other OS resources to be
allocated in the transport layer. Since both memory and other OS
resouces are limited the resource usage of the connection functors must be
controlled in some way. This is done by the ConnectAcceptModule by
keeping track of all resources used, which is possible since no
modules other than the ConnectionWrapper may be imported and used.
When a connection functor is done, all resources allocated but not
handed over must be freed. This is also true when a connection attempt
is cancelled from the ComObject or aborted due to a timeout.

A resource in the transport layer is represented by a grant to use a
TransObject of a certain type. When such a grant is requested by the
connection functor, it may specify if it is willing to wait for
resources to be freed, or if it wants an immediate return to be able
to try some other transport module if no resources are available.

File Descriptors

File Descriptors is the resource used by TCP. It must be treated with extra
care since many parts of the site needs fds. E.g. the graphics engine,
loading of pickles, contacting of other sites, receiving connections,
transportation of data in some transport modules, user modules a.s.o..
It is scarse. In some implementations of unix there is only a maximum
256 FD:s available per process. 
Since connection attempts and accepts use FD:s a number of FD:s must
be allocated for these jobs. If a site runs out of FD:s the site will
be unable to respond to incoming connecting attempts and to do connection attempts of its own.

To avoid monopolization of FD:s from the different parts of the DS
three thresholds are defined. One for connection attempts, one for
accepts and one for the different transport modules. These thresholds
define the maximum number of FD:s that can be allocated by each activity. If
one activity runs out it must do some kind of caching. The Mozart
system comes with default values for these thresholds, typically a
fraction of the available amount. The values are adjustable through
ConnectAcceptModule. 

Since not only the DS needs filedescriptors a centralized control
mechanism of filedescriptors should be implemented in the Mozart
system in the future.

-----------------------------------------------------------------------

A default implementation

-----------------------------------------------------------------------

Current status of the Mozart Distributed Subsystem

-----------------------------------------------------------------------

Appendix: Interfaces

The new module DPinit:

A new module that takes responsibility for setting up the DS
when it is needed is added to the Mozart environment. The module is
named DPinit.Oz and is loaded on need. DPinit takes over the response
of loading the dynamically linked C++ binaries and setting up all
parameters for the communication. 
DPinit will be the interface to all changeable parameters in the DS
that today goes through procedures  exposed by DPMisc. Having a
separate Mozart module that takes care of installing the DS will
simplify parameter passing to the DS. 
Important to notice is that DPinit must be loaded before any of
the builtins from the DS can be used.

init() starts the DS.
 
The connection wrapper module:

A connection procedure will be run within a module manager where a
connection wrapper is inserted. The connection wrapper provides the
following interface:

getConnGrant(+Type +CanWait ?Grant)
handover(+Grant +SetupParameter)
connFailed(+Reason)
freeConnGrant(+Grant)
getLocalState(?State)
putLocalState(State)

getAppState(+Key ?Val)
condGetAppState(+Key AltVal ?Val)
putAppState(+Key Val)

The first six are provided via the ConnectionStub builtin. The
corresponding procedures there take an additional argument consisting
of the Requester.

ConnectAcceptModule
initConnect(Stream)
initAccept()
getAppState(+Key ?Val)
condGetAppState(+Key AltVal ?Val)
putAppState(+Key Val)

The acceptprocedure is run in a not so well guarded accept wrapper
that has the six procedures of the connection wrapper above and uses
accept as the requester.

DSite contains:
      id
      DistOzState:  diststate(location:URL
			      gname:GName
			      parameter:PString
			      type:Atom)
      LocalOzState: localstate(connectionFunctor:Funct
			       localState:OzTerm)

A connection functor must export a procedure Connect:
{Connect DistOzState.parameter}

To call a connectprocedure a special record must be sent on the Port
corresponding to the stream in initConnect:
connect(Requester LocalOzState DistOzState)
this in turn will invoke Connect of LocalOzState.connectionFunctor in
a suitable environment.

	* Transport medias










