Affiliation:
1. Dept. of Computer Science, University of Maryland, College Park, MD
2. Dept. of Computer Science, University of California, Santa Barbara, CA
3. Dept. of Computer Science, University of Maryland, College Park, MD and Dept. of Pathology, Johns Hopkins Medical Institutions, Baltimore, MD
Abstract
As computational power and storage capacity increase, processing
and analyzing large volumes of data play an increasingly important
part in many domains of scientific research. Typical examples of
large scientific datasets include long running simulations of
time-dependent phenomena that periodically generate snapshots of
their state (e.g. hydrodynamics and chemical transport simulation
for estimating pollution impact on water bodies [4, 6, 20],
magnetohydrodynamics simulation of planetary magnetospheres [32],
simulation of a flame sweeping through a volume [28], airplane wake
simulations [21]), archives of raw and processed remote sensing
data (e.g. AVHRR [25], Thematic Mapper [17], MODIS [22]), and
archives of medical images (e.g. confocal light microscopy, CT
imaging, MRI, sonography).
These datasets are usually multi-dimensional. The data
dimensions can be spatial coordinates, time, or experimental
conditions such as temperature, velocity or magnetic field. The
importance of such datasets has been recognized by several database
research groups and vendors, and several systems have been
developed for managing and/or visualizing them [2, 7, 14, 19, 26,
27, 29, 31].
These systems, however, focus on lineage management, retrieval
and visualization of multi-dimensional datasets. They provide
little or no support for analyzing or processing these datasets --
the assumption is that this is too application-specific to warrant
common support. As a result, applications that process these
datasets are usually decoupled from data storage and management,
resulting in inefficiency due to copying and loss of locality.
Furthermore, every application developer has to implement complex
support for managing and scheduling the processing.
Over the past three years, we have been working with several
scientific research groups to understand the processing
requirements for such applications [1, 5, 6, 10, 18, 23, 24, 28].
Our study of a large set of applications indicates that the
processing for such datasets is often highly stylized and shares
several important characteristics. Usually, both the input dataset
as well as the result being computed have underlying
multi-dimensional grids, and queries into the dataset are in the
form of ranges within each dimension of the grid. The basic
processing step usually consists of transforming individual input
items, mapping the transformed items to the output grid and
computing output items by aggregating, in some way, all the
transformed input items mapped to the corresponding grid point. For
example, remote-sensing earth images are often generated by
performing atmospheric correction on several days worth of raw
telemetry data, mapping all the data to a latitude-longitude grid
and selecting those measurements that provide the clearest
view.
In this paper, we present
T2,
a customizable parallel
database that integrates storage, retrieval and processing of
multi-dimensional datasets. T2 provides support for many operations
including index generation, data retrieval, memory management,
scheduling of processing across a parallel machine and user
interaction. It achieves its primary advantage from the ability to
seamlessly integrate data retrieval and processing for a wide
variety of applications and from the ability to maintain and
process multiple datasets with different underlying grids. Most
other systems for multi-dimensional data have focused on uniformly
distributed datasets, such as images, maps, and dense
multi-dimensional arrays. Many real datasets, however, are
non-uniform or unstructured. For example, satellite data is a two
dimensional strip that is embedded in a three dimensional space;
water contamination studies use unstructured meshes to selectively
simulate regions and so on. T2 can handle both uniform and
non-uniform datasets.
T2 has been developed as a set of modular services. Since its
structure mirrors that of a wide variety of applications, T2 is
easy to customize for different types of processing. To build a
version of T2 customized for a particular application, a user has
to provide functions to pre-process the input data, map input data
to elements in the output data, and aggregate multiple input data
items that map to the same output element.
T2 presents a uniform interface to the end users (the clients of
the database system). Users specify the dataset(s) of interest, a
region of interest within the dataset(s), and the desired format
and resolution of the output. In addition, they select the mapping
and aggregation functions to be used. T2 analyzes the user request,
builds a suitable plan to retrieve and process the datasets,
executes the plan and presents the results in the desired
format.
In Section 2 we first present several motivating applications
and illustrate their common structure. Section 3 then presents an
overview of T2, including its distinguishing features and a running
example. Section 4 describes each database service in some detail.
An example of how to customize several of the database services for
a particular application is given in Section 5. T2 is a system in
evolution. We conclude in Section 6 with a description of the
current status of both the T2 design and the implementation of
various applications with T2.
Publisher
Association for Computing Machinery (ACM)
Subject
Information Systems,Software
Cited by
12 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献