Affiliation:
1. Department of Math and Computer Science, Emory University, Atlanta, GA
30322, USA,
Abstract
We propose lightweight middleware solutions that facilitate and simplify the execution of failure-resilient Message Passing Interface (MPI) programs across multidomain clusters. The system described in this paper leverages H2O, a distributed metacomputing framework, to route MPI message passing across heterogeneous aggregates located in different administrative or network domains. MPI communication is aided by a specially written H2O pluglet; messages that are destined for remote sites are intercepted and transparently forwarded to their final destinations. We demonstrate that the proposed technique is indeed effective in enabling communication by MPI programs across distinct clusters and across firewalls. Only marginally lowered performance was observed in our tests, and we believe the substantially increased functionality would compensate for this overhead in most situations. In addition to enabling multicluster communications, we note that with the increasing size and distribution of metacomputing environments, fault tolerance aspects become critically important. We argue that the fault tolerance model proposed by FT-MPI fits well in geographically distributed environments, even though its current implementation is confined to a single administrative domain. We describe extensions to overcome these limitations by combining FT-MPI with the H2O framework. Our holistic approach allows users to run fault-tolerant MPI programs on heterogeneous, geographically distributed shared machines, without sacrificing performance and with minimal involvement of resource providers.
Subject
Hardware and Architecture,Theoretical Computer Science,Software
Cited by
6 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Supporting data management on cluster grids;Future Generation Computer Systems;2008-02
2. A Web Services Gateway for the H2O Lightweight Grid Computing Framework;Towards a Service-Based Internet;2008
3. Parallel computation on multilayer cluster grids;Concurrency and Computation: Practice and Experience;2007
4. Running PVM Applications on Multidomain Clusters;Recent Advances in Parallel Virtual Machine and Message Passing Interface;2006
5. Exploiting Multidomain Non Routable Networks;Parallel and Distributed Processing and Applications;2006