Affiliation:
1. The MITRE Corporation, Bedford, MA
Abstract
MITRE — through its Future Generation Computer Architectures program — has conducted research in parallel computing since 1983 [2-5.7]. Our research is currently directed toward operating systems for massive distributed-memory MIMDs running general-purpose, object-oriented programs. Scalability and reliability are central concerns in our research. To us, scalability means that a system can be expanded incrementally, and the addition of processors always increases the processing power of the system. Reliability means that application programs continue to run, and run correctly, in spite of isolated hardware failures.
For our research, we assume a message-passing system with no shared memory and no broadcast facility. We also assume the network changes with time because processors fail and because processors are added to the running system. The system must be able to recognize failed processors and avoid them. It must also be able to recognize new processors and make them useful members of the working community while the system is running. The only kind of failure we consider is catastrophic processor failure. We assume that conventional error-detecting and correcting techniques are used to ensure that processors that function do so in a fault-free manner and that communication between processors is reliable.
Our research has taught us that the constraints imposed by a massive message-passing architecture, with even the narrow fault tolerance goals we have described, demand a view that emphasizes control at the local level, coordination at the global level, and the ability to tolerate inexact information. The foremost lesson we have learned about such systems is that they are nonintuitive: algorithms that work well with few processors may not scale well to systems with many processors — in general, you must do things differently to keep overhead from overwhelming the system. We have observed algorithms perform well in simulated systems of 256 processors and break down in systems of 1024. (We look forward to seeing what happens when we can simulate systems of substantially greater size.) Another lesson we have learned is that the cost of implementing a fault-tolerant algorithm can be very high and depends a great deal on the nature of the algorithm — the degree to which it employs local control and can tolerate inexact information.
Our model of computation is object-oriented programming; for us, objects represent the independent computational units that can enable parallelism within a program. We have not yet committed to a particular object-oriented language; instead, we have concentrated on operating system issues we feel are common to supporting many distributed object-oriented programming systems. Our research specifically concerns distributed techniques for resource management, object addressing, garbage collection, and computation management. We constrain ourselves to techniques that do not require centralized control or global information, and can be made tolerant of hardware failures.
We expect that objects are dynamically created and discarded as a program executes. We think of objects as representing units of work assigned to processors. Memory management must provide a means of finding processors with sufficient free memory to store new objects. Processor management must provide a means of dynamically balancing the load on processors by distributing objects in a relatively equitable manner.
We have developed a resource management strategy that meets these requirements. One feature of the scheme is that objects are distributed among processors in an equitable manner when they are created; another feature is that objects are redistributed among processors dynamically to maintain a relatively equitable distribution.
The scheme is based on the use of resource
agents
— operating system servers distributed throughout the system. Each agent manages memory allocation for the processors in its local communication neighborhood. A processor sends allocation requests to its local agent; that agent assigns each request to whichever processor in its neighborhood it deems most appropriate. If there is none, the agent forwards the request to a
super
agent — another operating system server that overseas activity in several agents' neighborhoods. If the superagent has another
Time Warp
[6] to ensure that messages are processed in the correct order by each object. Time Warp was proposed to synchronize the execution of discrete-event simulations on multiprocessors. While our approach is based on Time Warp, it extends the mechanism to facilitate general-purpose programming.
In Time Warp, an object processes messages as they arrive, but before processing each message, it saves its state. If a message arrives with a simulation time earlier than a message already processed, the object rolls back to a state at or before the time of the new message, processes that message at the correct simulation time, and reprocesses messages over which it rolled back.
There are two kinds of messages that can be sent in Time Warp —
event
messages and
query
messages. A
query
message cannot cause side effects, and always returns a
reply
message to the sender. An
event
message can cause side effects, and never returns a
reply
message. In general-purpose computation, it is common for a method to send a message and use the result that is returned. If that message is an
event
message, Time Warp forces the programmer to actually write two methods — the
event
message is sent in the first and the result is used in the second. A second
event
message must be introduced to signal the availability of the result and trigger the execution of the second method. It is up to the programmer to make sure that the second
event
message is processed correctly if other messages can be received before it.
These restrictions may be natural in the context of simulations of real-world situations; however, they force a programmer to structure general-purpose programs in an unnatural way that is by no means trivial.
Time Warp places another restriction on the programmer. A cycle of recursive
query
messages can be processed at the same simulation time; however, a cycle of recursive
event
messages cannot. The purpose of this restriction is to avoid Time Warp's equivalent of deadlock — infinite rollback. However, it makes side-effecting recursion difficult to accomplish. It requires the programmer to manage the timing of events so that no intervening messages are processed while the recursion is in progress. This is not an easy task, since it may be hard to predict (until execution time) the depth of a recursion and, hence, the number of messages involved.
In our model, a sending object can use the result of a side-effecting message it sent later in the same method, and side-effecting recursion is fully supported. There are two reasons for this. First, our model of execution controls time in a manner that is completely transparent to the programmer. Our computation time dynamically and automatically attains as fine a granularity as is necessary to support replies from side-effecting messages and side-effecting recursion. Second, our model allows method execution to be rolled back so that side-effecting recursion can be handled correctly.
We currently have code for a computation manager that embodies our model of execution and runs compiled programs on a multiprocessor simulator. The compiler takes object-oriented programs written in a subset of Common Lisp using Flavors and produces programs in which each object class is combined with an executive class that implements the operations of the model of execution. The compiler generates compiled methods in which each pseudo-instruction is a piece of Lisp code; method execution consists of stepping through these instructions, saving state information as appropriate.
Our next goal is to implement a fault-tolerant version of our computation manager on a multiprocessor. Our objective this year was to design a model of execution for distributed, general-purpose, object-oriented computation to support concurrency in a manner transparent to the programmer. The mechanisms we have described involve considerable overhead; a major goal for the future is to develop techniques for reducing overhead sufficiently to make this approach practical.
We also plan to add computation management to an existing simulation that includes fault-tolerant resource management as described above. We will also implement one of the object-addressing schemes and one of the garbage-collection schemes we have described. The integrated simulation will be used to observe the functioning of the operating system as a whole and will itself be implemented on a multiprocessor system.
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Graphics and Computer-Aided Design,Software