A

B

S

T

R

A

C

T

S

INTERFACE 2000
CONTRIBUTED PAPER ABSTRACTS

Strategies for Investigating Geophysical and Other Complex Data
Chair: Samantha Bates
University of Washington
 
Compressing Massive Geophysical Datasets Using Quantization
Amy Braverman
Jet Propulsion Laboratory
, California Institute of Technology
In this work we set forth a method for compressing massive geophysical datasets like those that will be obtained from NASA's Earth Observing System Terra satellite. We develop a statistical model for studying relationships between compressed and uncompressed data, and use it to evaluate compressors found by an iterative clustering method based on the ECVQ algorithm of Chou, Lookabaugh, and Gray (1989). The method arbitrates between error induced by compression and level of data reduction. Error explicitly includes a component that accounts for uncertainty due to multiple local minima of the ECVQ loss function. Dataset compressibility is identified as an important characteristic for setting parameters that determine the balance error and data reduction. We demonstrate this procedure using a well known dataset from our motivating field of application, Earth science.
 
 
Finding Bent-Double Radio Galaxies: A Case Study in Data Mining
I. K. Fodor, C. Baldwin, E. Cantú-Paz, C. Kamath, and N. Tang
Lawrence Livermore National Laboratory

Sapphire (http://www.llnl.gov/casc/sapphire/) is a project on large scale data mining and pattern recognition at the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory. We are using state-of-the-art computational methods in order to help scientists extract useful information hidden in massive datasets. In one of our applications, we are collaborating with astronomers on the FIRST (Faint Images of the Radio Sky at Twenty-cm) survey to find radio galaxies with bent-double morphology.

We present a brief overview of Sapphire, then illustrate the challenges particular to finding bent-double galaxies. We derive features from the FIRST catalog and from raw images, then apply decision trees to classify the radio sources based on those features. Defining meaningful features poses a real difficulty, as bent-doubles vary considerably. The features we use must not only be scale, translation, and rotation invariant, but should also be robust to small changes in the data. We describe the features we use to discriminate bent-doubles from non-bent-doubles, and report on the sensitivity of our decision tree results to changes in the features.

 
 
Characterizing the Complexity of a High-Dimensional Classification Problem
Carey E. Priebe
Johns Hopkins University
David J. Marchette
Naval Surface Warfare Center
Classification of high-dimensional data is inherently difficult. We present an exploratory data analysis methodology for obtaining information about the high-dimensional decision boundary, and provide a nonlinear projection in which to perform classification. We focus on the two-class problem, although the methodology can be extended to the multiclass case. The idea is to characterize the support of one class as a collection of spheres covering the support, with each sphere centered at an observation in that class such that the radius is maximal without containing too many observations from the other class. A greedy algorithm for fitting the spheres is proposed. The spheres then provide a description of the support of the class, with information about the decision boundary implicit in the position, radii and adjacency of the spheres. Clustering the spheres by radius and projecting the data based on distances to the clusters yields a nonlinear projection to a lower-dimensional space in which classification can be performed. We illustrate the algorithm with pedagogical simulations and a chemical sensor data analysis application.
 
 
Multiresolution Stochastic Models for Object Recognition in Self-Similar Texture Images
Richard J. Barton, Jennifer Davidson, Lili Chen, and Fei Wan
Iowa State University
We consider the application of multiresolution stochastic modeling techniques to the analysis and synthesis of texture images. We adopt the approach of Crouse, et al., in which the wavelet coefficients of a texture image are modeled using a hidden Markov tree model (HMTM). The assumed tree structure arises naturally from a decomposition of the original image in terms of an orthonormal wavelet basis, and the Markov structure is imposed under the assumption that the wavelet coefficients decorrelate rapidly across both space and scale. One of the most common characteristics of image data that results in wavelet coefficients with this property is self-similarity. The existence of self-similarity in the image data allows us to reduce dramatically the number of parameters that must be estimated in order to accurately model the wavelet decomposition of the image using an HMTM. In addition, we extend the HMTM structure by modeling the relationship of the wavelet coefficients within a particular scale using a partially ordered Markov model (POMM). We show that POMMs fit naturally within the multiresolution HMTM paradigm, and that the multiresolution POMM structure leads to accurate and parsimonious models for texture data that are useful for texture segmentation, texture discrimination, and object recognition.
 
 
Predicting the Phase Transition in 3-Colorable Graphs
Hao Zhang
Washington State University
The phase transition in 3-colorable refers to the phenomenon that a graph abruptly becomes not 3-colorable when the connectivity of a large graph increases. Percentages of three colorable graphs were obtained by generating random graphs for each pair of different sizes and connectivity. We use these data to build a model and predict what the phase transition will occur. We will also address the model validation and distributions of estimators in the model.
 
 
On the Decomposition of Spatial Processes
Reinhard Furrer
Swiss Federal Institute of Technology

 

Contributed Sessions Home

 
Computational and Estimation Issues in Modeling
Chair: Barbara Bailey
University of Illinois at Urbana-Champaign
 
Bayesian Computations for Random Environment Models
Dhaifalla K. Al-Mutairi
Kuwait University
This paper deals with reliability data analysis from Bayesian perspective using Random Environment (RE) models. We review current literature on RE models and study statistical computational problems for these models that will arise in posterior and predictive analysis, test of hypothesis, and model selections. Computational methods to solve such problems are presented and we also give illustrative examples.
 
 
Borrowing Strength without Explicit Data Pooling: Estimating with External Constraints
Jerome Reiter
Williams College

When using regression models where units can be classified into distinct groups, similar parameters in each group can be estimated via explicit data pooling, such as in hierarchical models. Sometimes, however, external constraints prohibit explicit data pooling. For example, in the plan for the 2000 census that includes the Integrated Coverage Measurement survey, the Census Bureau avoids pooling data across states because the law may not allow data from one state to affect the population estimates in another state. Similar constraints may exist when auditing or comparing several groups' performances.

I present techniques that may be acceptable under such external constraints and yield more accurate estimates than those obtained by regressing separately in each group. These techniques utilize the information in multiple groups' parameter estimates to specify the model in each group, but ultimately estimate the parameters selected for each group's model using only that group's data. The techniques can be conceptualized as existing on a continuum ordered by how directly each relies on data pooling those techniques that look more like explicit data pooling are typically more accurate yet less likely to be acceptable. I investigate the techniques in a variety of simulation studies.

 
 
Do Blocks Make a Neighborhood?
Approaches to Estimating Neighborhood Parameters Based on Localized Observations

Carolyn A. Carroll
Stat Tech, Inc.
The talk will compare some approaches to parameter estimation based on "local" data readings. Data in some fields (e.g., environmental) can be viewed as "samples". But the samples may be poorly designed and contain some inherent but unknown bias. Finding methods of combining readings and making defensible, general statements across a larger area e.g., the neighborhood/community is difficult.
 
 
Semi-Parametric Nonlinear Mixed Effects Models
Yuedong Wang and Chunlei Ke
University of California, Santa Barbara
We present a class of semi-parametric nonlinear mixed effects models (SNMM) for repeated measures data. A SNMM assumes that the mean function depends on some parameters and nonparametric functions. These parameters provide interpretable data summary and these nonparametric functions provide the flexibility to allow data to decide some unknown/uncertain components. A second stage model with fixed and random effects are used to model the parameters. Smoothing splines are used to model the nonparametric functions. Covariate effects on parameters can be built into the second stage model and covariate effects on nonparametric functions can be constructed using smoothing spline ANOVA decompositions. SNMMs contain many existing models such as nonlinear mixed effects models and self-modeling nonlinear regression models as special cases. Therefore they can be used as diagnostic tools for many parametric and nonparametric models. Applications will be illustrated using real data sets.
 
 
Computational Aspects of Fitting Generalized Nonparametric Mixed Effects Models
Peter Karcher and Yuedong Wang
University of California, Santa Barbara

Generalized Linear Mixed Effects Models (GLMM) provide useful tools for correlated and overdispersed non-Gaussian data. In this paper we consider Generalized Nonparametric Mixed Effects Models (GNMM) which relax the rigid linear assumption on the conditional predictor in a GLMM.

We use smoothing splines to model fixed effects. The random effects are general and may also contain stochastic processes corresponding to smoothing splines. We show how to construct smoothing spline ANOVA (SS ANOVA) decompositions for the predictor function. Components in a SS ANOVA decomposition have nice interpretations as main effects and interactions. We estimate all parameters and spline functions using stochastic approximation with Markov Chain Monte-Carlo. We illustrate our method using a simulated data.

 
 
Likelihood Based Tests for Over and Underdispersion Against General Alternatives
Gordon K. Smyth and Heather Podlich
University of Queensland
Variations from Poisson and binomial variation are a common concern when modelling count data. Tests for overdispersion are usually based on unrealistically specific alternatives, such as the negative binomial or beta-binomial distributions, or are not model based and therefore lack power. Convincing methods for detecting and modelling underdispersion are not generally available. We use extended Poisson process models, in which an arbitrary count distribution can be represented as the realization of a pure birth process. Under and overdispersion relative to the Poisson or binomial distributions can be represented in terms of the slope and curvature of the unobserved birth rate sequence. We give a new saddlepoint approximation for birth processes which is exact in the neighborhood of Poisson, negative binomial and binomial models. This allows us to compute score tests for the goodness of fit of standard models against very general alternatives.
 

Contributed Sessions Home

 
Wavelets, Splines, State-Space and Adaptive Models
Chair: Hyunjoong Kim
Worcester Polytechnic Institute
 
NORM Thresholding Method in Wavelet Regression
Dongfeng Wu
University of Texas
We present a new method called the NORM method for finding threshold values in wavelet regression. We use Wavethresh software in S-Plus to implement this method, and compare it with existing methods, such as Donoho & Johnstone's SureShrink, AdaptShrink and Nason's Cross-Validation, and with optimal thresholding. The goal is to minimize the average mean squared error. We use 3 different kind of noise: iid normal, iid t variable, and correlated noise, on 8 different test applications, including BLOCK, BUMP, HEAVISINE and DOPPLER. For iid normal noise, any method could be the best. The Cross-Validation method works best for independent long tailed t noise. In the case of correlated noise, the Norm method behaves best when the lag one correlation is large positive. The SureShrink or AdaptShrink method behaves best when the lag one correlation is large negative. We give a heuristic explanations of these behaviors. We also evaluate the accuracy of our method for estimation of noise level sigma.
 
 
Data-Driven Optimal Denoising and Recovery of Derivatives Noisy Signals Using Multiwavelets
Nathaniel Tymes and Sam Efromovich
University of New Mexico
Multiwavelets are relative newcomers into the world of wavelets. Thus it has not been a surprise that the used methods of denoising are modified universal threshold procedures developed for uniwavelets. On the other hand, the specific of a multiwavelet discrete transform is that typical errors are not identically distributed and correlated whereas the theory of the universal thresholding is based on the assumption of identically distributed and independent normal errors. Thus we suggest an alternative denoising procedure based on Efromovich-Pinsker algorithm. We show that this procedure is asymptotically optimal over a wide class of spatially inhomogeneous functions. Moreover, together with a new "cristina" class of biorthogonal multiwavelets the procedure implies an optimal method for recovering the derivative of a noisy signal. The asymptotic results are supported by intensive Monte Carlo experiments.
 
 
Adaptive Splines and Genetic Algorithms for Optimal Low-Dimensional Statistical Modeling
Jennifer I. Pittman
Pennsylvania State University

Due in part to the increased availability of computational power, spatially adaptive smoothing methods involving regression splines have become a popular and rapidly developing class of nonparametric modeling techniques. Most existing algorithms for fitting adaptive splines are based on non-linear optimization and/or stepwise selection. Although computationally fast and spatially adaptive, stepwise knot selection is necessarily suboptimal while determining the best model over the space of adaptive knot splines is a very poorly behaved non-linear optimization problem. A possible alternative is to use more intensive numerical optimization techniques such as genetic algorithms to perform knot selection.

A spatially adaptive modeling technique referred to as adaptive genetic splines (AGS) is introduced which combines the optimization power of a genetic algorithm with the flexibility of polynomial splines. Preliminary simulation results comparing the performance of the genetic algorithm method to other current methods, such as HAS (Luo and Wahba 1997) and SUREshrink (Donoho and Johnstone 1995), will be discussed, as well as a current application of AGS in the engineering sciences. Topics for future research will also be mentioned.

 
 
Partially Adaptive Bandwidth Used in Prediction and Local Regression
Janis Grabis
Riga Technical University
A bandwidth parameter of the local regression model can be set either globally or locally. This paper considers partially adaptive bandwidth selection. The partially adaptive bandwidth is used to adjust the global bandwidth for particular data points. The adjustment takes place if a specified quality criterion of the given local model fails. The localized Akaike's Information Criterion acts as this quality measure. The global bandwidth is set either by cross-validation or arbitrary. In the first case the partially adaptive bandwidth is designated to improve the results of cross-validation. The second approach allows skipping of cross-validation. That provides substantial computational savings with small accuracy losses. The partially adaptive bandwidth attempts to encompass both the robustness of the global bandwidth and the flexibility of the local bandwidth. Performance of the partially adaptive bandwidth is evaluated by prediction of several empirical times series.
 
 
Self-Modeling Regression with Random Effects Using Penalized Regression Splines
Naomi S. Altman and Julio C. Villarreal
Cornell University

Self-modeling regression is a semi-parametric method for describing a family of similar curves. The overall shape of the curve is estimated nonparametrically, but differences among the curves are described through a parametric model. In designed experiments in which the response is a curve, modeling the parameters through a random effects model is often desirable.

In this paper, we describe a self-modeling regression model in which the nonparametric curve is defined by a penalized regression spline. Since the spline can be estimated as a linear random effects model, this allows us considerable simplification in both computation and inference. This simplicity can then be exploited to extend the model to generalized regressions.

 
 
Estimation of Nonlinear State-Space Models in the Presence of Censored Observations
Craig Johns
Robert H. Shumway
Colorado University and NCAR
University of California, Davis
State-space models involving time varying parameters are often used for describing broad classes of biological and physical phenomena. In some cases, measurement devices that produce data suitable for such models are hampered by an inability to measure beyond certain specified upper or lower detection limits. Traditional approaches to estimation for nonlinear state-space models use maximum likelihood procedures. These procedures depend on being able to compute conditional expectations via Kalman filtering and smoothing and are intractable under censoring or when using nonlinear models.

Carlin, Poulson and Stoffer (1992) develop a Markov Chain Monte Carlo (MCMC) estimation procedure for nonlinear state-space models. This MCMC method is extended to fit linear and non-linear state-space models when observations have been censored due to detection limits.

These MCMC estimation procedures are applied to filtering and parameter estimation for nonlinear state-space models of spatio-temporal data collected by a laser detector (lidar) measuring airborne particulate matter created by a moving point source.

 

Contributed Sessions Home

 
Applications of Exploring, Modeling and Presenting Large Datasets
Chair: Derek Stanford
MathSoft, Inc.
 
Predictive Statistical Models for Detecting Anomalies and Congestion in IP Based Networks
Elisa M. Santos
Telcordia Technologies

Network performance monitoring has become a necessity to ensure quality of service. Currently, the tools available for network monitoring are mainly geared towards monitor to destination performance and cannot identify midway (node/link level) congestion or anomalies.

A methodology is presented to evaluate the performance of an IP based network with respect to the delay metric at individual node/link level using available methods of data collection. Delay data are collected with respect to each link and a statistical model is developed, taking into consideration potential trends and periodicities. The model is automatically upgraded to reflect normal changes of delay level and reflects the normal state of the network and, for this, is based on under control delay data. Congestion detection is based on model predictions and user-defined thresholds. The use of the model concentrates the detection effort on real signal, instead of chasing peaks/bursts that are mostly natural random variation. A formal statistical test is proposed for identifying congestion. To detect anomalies in a link it is necessary to model the delay under normal network conditions, regularly looking for abnormal behavior in the latest data, suggested by significant patterns in the residuals using appropriate statistics.

 
 
A Hierarchical Mixture Model for WWW-Usage
Dee Denteneer
Philips Research

Access networks (e.g. cable networks) are currently being standardised (e.g. DOCSIS, IEEE, DVB) and are the focus of extensive commercial activity.

Characteristic for such networks is a sequential procedure for data transfer from a station at the customer premises to a central node, which consists of two stages. First, a contention stage is carried out, in which a station requests a number of data slots (in contention with other stations). Second, a data transfer stage is carried out, in which the data is transferred in the data slots that have been reserved for this station.

This procedure implies that the performance of access networks is sensitive to both long-range dependencies and short-range dependencies in the traffic carried over the network. Hence it is mandatory that the traffic models used in performance analysis accurately reflect both types of dependencies.

We propose hierarchical mixture models for this purpose and develop an EM-algorithm to fit them. Relevance and use are demonstrated by applying the model to data on WWW-usage.

 
 
Tracking Timing Patterns for Millions of Customers in Real-Time
Jose C. Pinheiro, Diane Lambert, and Don X. Sun
Bell Labs-Lucent Technologies
Business applications such as credit card fraud detection and e-commerce require tracking the behavior of millions of customers who are making transactions. The behavior of each customer must be summarized separately and updated whenever the customer makes a transaction. Because storage space may be limited to a few hundred bytes per customer and computing time may be limited to milliseconds, the updating can depend only on the new transaction and the current summary for a customer. If a characteristic, such as the amount of a purchase is observed at random, then its distribution can be updated by exponentially weighted moving averaging. Timing variables, like day-of-week, however, are not observed at random, and standard sequential estimates of their distribution can be badly biased. To develop good estimates, we model timing variables by a Poisson process that has piecewise constant rates that evolve over time. This leads to a variant of exponentially weighted moving averaging that requires little storage space, is simple to compute, and is appropriate for timing variables. The new sequential estimator approximates the mean under a dynamic timing model and has good asymptotic properties. Simulations show that it also has good finite sample properties.
 
 
Data Mining on Time Series: An Illustration Using Fast Food Restaurant Franchise Data
Lon-Mu Liu, S. Bhattacharyya, S. L. Sclove, R. Chen, and W. J. Lattyak
University of Illinois, Chicago, Scientific Computing Associates Corp.
With the prevalent use of modern information technology, a large number of time series may be collected during normal business operations. We use a fast-food restaurant franchise as an example to illustrate how data mining can be applied to such time series, and help the franchise reap the benefits of such an effort. Time series data mining at both the store level and corporate level are discussed. Related data warehousing issues are also addressed. Box-Jenkins seasonal ARIMA models are employed to analyze and forecast the time series. Instead of a traditional manual approach of Box-Jenkins modeling, an automatic time series modeling procedure is employed to analyze a large number of highly periodic time series. In addition, an automatic outlier detection and adjustment procedure is used for both model estimation and forecasting. The improvement in forecast performance due to outlier adjustment is demonstrated. Adjustment of forecasts based on stored historical estimates of like-events is also discussed. To illustrate the feasibility and simplicity of the above automatic procedures for time series data mining, the SCA Statistical System is employed to perform the related analysis.
 
 
Redesigning Tables and Graphics for Federal Statistical Agencies
Daniel B. Carr
George Mason University
This talk describes the redesign of tables and graphs for communicating federal statistics summaries. The goal is to use perceptual and cognitive principles to make improvements while abiding within a given set of constraints. A set of constraints might be that the result to be a table, half-toning is limited to one or two colors, the table body elements many not highlight atypical values and the table must not take up much more space than in the previous publications. The talk focuses on a small set of examples from federal applications that are chosen to provide diversity. One example uses perceptual grouping and layering to better convey table header hierarchies and improve the readability of table body elements. Logical consideration even lead to improved wording for footnotes and suggest further modification. Another example shows conversion of a table to graphics and inclusion of metadata. Some changes simply make the result more attractive and provide a sense of value added. The examples illustrate the use of Splus™ and published statistical summaries from various agencies such as the Bureau of Labor Statistics and the Bureau of Transportation Statistics.
 
 
Remote Medical Evaluation and Diagnostics: A Testbed for Hypertensive Patient Monitoring
John C. Dumer, Timothy P. Hanratty, and Barry A. Bodt
H. Mitchell Perry and Sharon E. Carmody
U.S. Army Research Laboratory
Veterans Administration
Health care costs have surpassed $2 billion per day in the United States. For government agencies faced with shrinking budgets, reduced medical staffs, and increased patient commitments this poses an especially challenging problem. One approach touted to allay this situation is to use remote medical diagnostics, involving the collection, monitoring, and analysis of patient data from remote locations (home) via a communication device. Toward this end, the U.S. Army Research Laboratory is developing a system for remote medical evaluation and diagnostics (RMED) that combines remote monitoring capabilities with intelligent decision support technology.

The first area identified for use with the RMED system is hypertension. A current effort with the Veteran's Administration through the St. Louis Hypertension Program employs prototype blood pressure cuffs in the field. This paper will demonstrate the prototype system for data collection, storage, and retrieval, and summarize the results of a small pilot study to assess the reliability of the measurements relative to standard measurement procedures.

 

Contributed Sessions Home

 
Exploration and Visualization in High Dimensions
Sponsored by The Caucus for Women in Statistics
Chair: Matthias Schonlau
RAND
 
Exploration and Estimation of North American Climatological Data
James A. Shine and Paul F. Krause
U.S. Army Topographic Engineering Center
The availability of spatial and temporal earth data is increasing for example, NASA's Earth Observing System (EOS) will soon be producing a terabyte of earth data per day. This data will permit more detailed exploration and analysis of earth systems than was possible in the past. One application of particular interest to the authors is the estimation of contour surfaces from point values and the visualization of these surfaces in map form. The authors explored a multivariate data set of approximately 6,000 points in North America and other global locations. Each point contains climatological and other information such as temperature, elevation and precipitation. Spatial correlation was modeled using a semivariogram and several estimation approaches were used to create estimated surfaces and resulting contours. Maps of these results and comparisons between different approaches will be presented.
 
 
Visualizing Abandoned Hazardous Waste Sites in the United States
Carolyn K. Offutt
U.S. Environmental Protection Agency

The U.S. Environmental Protection Agency has identified over 35,000 abandoned hazardous waste sites across the country for investigation and possible remediation. Most of the sites do not require Federal action, but 1,400 sites are currently on the National Priorities List (NPL) for cleanup under the Superfund program.

The location and characteristics of the NPL sites form a large data set that has the capacity for analysis and subsequent visualization of the analyses. Characteristics of the sites include industrial activities, type of contamination, environmental media being contaminated (air, soil, surface water, ground water, sediment, structures, etc.), remediation planned or undertaken (excavation, in-situ bioremediation, soil vapor extraction, incineration, solidification, etc.), and many more characteristics. Spatial representation of this information allows further analysis of soil type, aquifers, endangered species habitats, and census data.

Public interest compels EPA to provide access to the data, as well as to the analytical tools. Much of the information is accessible on the Superfund Web site at: http://www.epa.gov/superfund.

This paper will demonstrate how this large data set is managed, updated, queried, and made accessible. In addition, the paper will discuss future plans for data accessibility and manipulation, as well as some of the data issues.

 
 
High-Dimensional Visualization Using Continuous Conditioning
William C. Wojciechowski and David W. Scott
Rice University
Many methods for visualizing hypervariate data have been employed. Among these are slicing, coplots (Cleveland, 1993), and dynamic coplots (Wojciechowski and Harner, 1995). Slicing and coplots display subsets of the data in static plots. The subsets are determined by the value of one or more conditioning variables. Dynamic coplots add animation and the displayed subset is updated in real-time as the conditioning values are modified. For all of these methods, the displayed points are determined by an indicator function. This produces a discontinuous transition as the value of the conditioning variable changes. The method we introduce uses a continuous weighting function based on a distance measure defined on the conditioning variable sub-space. The distance measure determines the color and transparency of a data point and produces a smooth visual effect during animation. The abstract notion of distance provides for many conditioning techniques. We present examples developed on Rice University's Immersadesk.
 
 
Graphical Techniques for the Exploration of Functional Data
E. Neely Atkinson
MD Anderson Cancer Center
This talk will demonstrate some tools for the graphical exploration of functional data, that is data in which some of the variables of interest may be considered as observations of an underlying smooth process. The functional data may be displayed or transformed in several ways and may be linked to a number of univariate and multivariate plots of covariates of interest. The methods are coded in LISP-STAT. The techniques will be illustrated on a data drawn from an ongoing study of the use of fluorescence spectroscopy to diagnosis cervical abnormalities.
 
 
Authenticating Vulnerability Measurements
Edward J. Wegman
George Mason University
The DoD is required by law to conduct "live-fire" tests on developing weapons systems in order to test their vulnerability. Because weapons systems are expensive, few actual live-fire tests are conducted. These are supplemented by simulations. Because weapons are complicated systems, the failure modes of the simulations usually do not correspond exactly to the live-fire tests. We use multidimensional clustering and visualization techniques to authenticate vulnerability measurements.
 
 
2D Classification Trees
Hyunjoong Kim Wei-Yin Loh
Worcester Polytechnic Institute University of Wisconsin, Madison
Many classification tree algorithms summarize the data in each terminal node with univariate statistics such as the proportion misclassified. We propose a new classification tree algorithm which yields at each node a 2-dimensional plot of the data with superimposed linear discriminant boundaries. Our algorithm is distinct from "linear combination tree" algorithms that partition the data space with hyper-planes. Intermediate nodes are split using the same techniques as other univariate split algorithms. The difference lies in the model fitted to each terminal node: we fit a statistical model and summarize it using a 2-dimensional plot. The new algorithm thus employs visualization to further enhance our understanding of the data structure. It is shown that the new algorithm has better prediction power than many other classification tree algorithms. Examples using real data will be given for illustration.
 
 

Contributed Sessions Home

 
Multivariate Modeling: Issues and Applications
Chair: Jerome Reiter
Williams College
 
Statistical Analysis of Rhizosphere Microbial Communities
Jayson D. Wilbur, Cindy H. Nakatsu, Sylvie M. Brouder, and R. W. Doerge
Purdue University
Rhizosphere soils of corn are used to determine if distinct microbial communities are associated with different root types, early plant growth stages, and history of soil used for planting. The exudates from the roots can promote microbial growth in the rhizosphere. However, very little is known about the diversity, composition and dynamics of this component of the terrestrial ecosystem. Corn plants were grown in disturbed and undisturbed soils with a 24 year history of growth as a monoculture crop or two crops grown in annual rotation. Both greenhouse and field experiments are presented. Characteristic profiles of the microbial communities were obtained by denaturing gradient gel electrophoresis (DGGE) of PCR amplified 16S rDNA from soil extracted DNA. The dominant rhizosphere bacterial populations, differed during plant growth and with soil treatment. Using various clustering algorithms, the microbial community DGGE fingerprints grouped according to agronomic treatment and within each agronomic treatment according to plant growth stage. Analysis of the community grown in the greenhouse showed less differentiation between growth stages and agronomic treatment but a distinct difference from the field community. The analysis of these data identified possible factors influencing the microbial ecology of the rhizosphere and aided in preliminary statistical modelling.
 
 
A Study of Faculty Equity Salary Using Derived Data
Trong Wu
Southern Illinois University, Edwardsville
Faculty's salary of a public institution is often depending upon the state revenue and governor budget. The faculty can be underpaid after several years of lower revenue and budget. Therefore, each university or college needs to study its faculty's salary intermittently to determine whether its faculty is underpaid or be paid sufficiently. This paper reports a study of faculty equity salary plan at a public institution in the state of Illinois. The plan selects a number of comparable institutions to be its salary peer institution then process the derived data of the salary information from these peer institutions. Together with the salary information for each discipline from a nationwide survey by the Oklahoma State University to be the target salary for the faculty of each discipline for the following year.
 
 
Use of Latent Variable Models in Air Quality Monitoring
William F. Christensen and Stephan R. Sain
Southern Methodist University
Latent variable analysis is a statistical approach for modeling the underlying structure in multivariate data in terms of a smaller number of latent variables or factors. In the environmental sciences, factor analytic techniques have been used to assess the number of pollution sources affecting the air quality at a monitoring site. Because air quality data often exhibit temporal and/or spatial dependence, we consider the importance for accounting for such correlation in estimating model parameters and making statistical inferences. Potential approaches for accounting for dependencies in the data are discussed, and the use of the block bootstrap as a tool for constructing appropriate inferential procedures is evaluated. An example using a set of air quality measurements is presented.
 
 
CCA: Cannonical Correlation or Correspondence Analysis?
Which is Better for Analysis and Interpretation of Multivariate Data?
A. Dale Magoun Linda Peyman
University of Louisiana at Monroe U.S.A.E./WES
Various researchers throughout the United States have recently studied patterns of bird distributions within riparian habitats. These studies focused on avian habitat use as observed by field biologists and physical data as measured from field sampling designs based on 0.25 ha plots. Multivariate studies, such as these, are extremely difficult to interpret due to the underlying interdependencies of the avian and the physical structures describing these habitats. Many researchers have used canonical correspondence analysis, factor analysis, and multivariate regression techniques as ways to explain the interdependencies found in data sets such as these. This paper presents some recent findings that relate satellite imagery data with avian habitat usage data and focuses on the use of multivariate techniques such as canonical correlation analysis and canonical correspondence analysis as tools for analysis and interpretation of such data. The paper further shows how each of these analysis techniques can be used to best interpret multivariate, environmental data.
 
 
Penalized Score Equations and Penalized GEE
Wenjiang J. Fu
Michigan State University
In a longitudinal study of environmental pollution, the levels of various pollutants are usually highly correlated. If these levels are predictors of a regression model for certain environmental concerns, the parameter estimator will have a large variance due to the collinearity. A penalty approach to deal with the collinearity is considered in the GEE model. The difficulty of this penalty approach due to the lack of joint likelihood in GEE models is overcome by a new approach to the penalty: the penalized score equations. A GEE model with the Lasso penalty will be demonstrated through an environmental study of air pollutants on asthma patients.
 
 
Assessing Deformation in Glaciers
S. Huzurbazar
University of Wyoming
A common method for obtaining data in glaciology is via borehole inclinometry. The data are then processed to study various aspects of the mechanics of glaciers, including construction of a three-dimensional deformation field. We consider a spatial data set collected over four time periods from the Worthington Glacier in Alaska. Problems with the data include measurement errors, censored observations as well as small sample sizes. As a first step, we construct confidence regions for the borehole trajectories and also discuss methods for dealing with the censored data and the modelling of the deformation field.
 
 
Some Statistical Measures on the National, Distribution Center
and Dealer Demands Along the Supply Chain
Nick T. Thomopoulous Wayne E. Bancroft Nick Z. Malham
Illinois Institute of Technology Motorola Corporation Forecasting & Inventory Consultants, Inc.
The monthly demands in three levels of the supply chain are measured using the coefficient of variation. The supply chain here includes the national, the distribution centers and the dealers. Two tables are presented. One table compares the national demands with distribution center demands, and another compares the national demands with the dealer demands.
 

Contributed Sessions Home

 
Innovations in Model Diagnostics and Fitting Algorithms
Chair: David van Dyk
Harvard University
 
Multiple Outlier Detection
David W. Scott
Rice University
The detection of more than one or two simultaneous outliers in regression and density analysis remains a challenging practical problem. A priori knowledge of the fraction of bad data would greatly facilitate a solution. In this paper, we describe an iterative algorithm that attempts to simultaneously estimate the fraction of outliers, the scale of the good data, as well as robust parameter estimates.
 
 
Parameter Selection for Constrained Solutions to Ill-Posed Problems
Bert W. Rust
National Institute of Standards and Technology
Many physical measurements are modelled by linear integral equations expressing each measurement as the sum of an instrumental smearing of the desired function and a random measuring error. Discretizing the integrals gives an ill-conditioned linear regression model with a matrix whose columns are discrete response functions of the instrument. Linear least squares solutions give wildly oscillating, physically impossible estimates of the function being measured. Such estimates are often stabilized either by truncating the singular value decomposition of the response matrix or by introducing a regularization constraint on the solution vector. In the former case it is necessary to choose a "numerical rank" for the matrix, and in the latter case to choose the value of the Lagrange multiplier in the constrained minimization. This paper suggests methods for using the statistical properties of the measuring errors and the residuals in making those choices.
 
 
Optimal Algorithms for Unimodal Regression
Quentin F. Stout and Janis Hardwick
University of Michigan
 
 
Using the Response Variable in Principal Components Regression
Roy E. Welsch
MIT

A recent study by Frank and Friedman (1994) indicated that ridge regression (RR) performed well when compared to partial least-squares regression (PLS) and traditional principal components regression (PCR). Both RR and PLS make use of the regression response variable in determining which latent variables or principal components to use while PCR does not. This study did not examine modified principal components regression procedures (MPCR) which take explicit account of the response variable. Rawlings (1988), Krzanowski (1992), and Sharma and Welsch (1997) present a number of MPCR procedures. In this paper, we discuss these ideas and some new ones and compare the results to those obtained by Frank and Friedman.

Frank, I. E. and Friedman, J. H. (1994). A Statistical View of Some Chemometrics Regression Tools. Technometrics, 35:109-148.

Krzanowski, W. J. (1992). Ranking Principal Components to Reflect Group Structure. Journal of Chemometrics, 6:97-102.

Rawlings, J. O. (1988). Applied Regression Analysis. Wadsworth, Belmont, CA. Sharma, V. and Welsch, R. E. (1997). Research on Variation Reduction Using Massive Data Streams. Bulletin of the International Statistical Institute, 51st Session, Istanbul 289-290.

 
 
Case Studies of Normal Diagnostics in Regression Using Recovered Errors
Donald E. Ramírez Donald R. Jensen
University of Virginia Virginia Polytechnic Institute
Diagnostics for normal errors in regression currently utilize ordinary residuals, despite the failure of assumptions validating their use. Case studies here show that such misuse may be critical even in samples of size exceeding currently accepted guidelines. A remedy is to employ recovered errors having the required properties.
 
 
Tree-Based Models for Fitting Stratified Linear Regression Models
William Shannon Maciej Faifer and Cezary Janikow
Washington University School of Medicine, St. Louis University of Missouri, St. Louis
This generalizes the methods developed in Shannon, Province and Rao (2000) to use recursive partitioning to identify subsets within which simple linear regression models are fit. This method is proposed as an alternative to multivariate regression modelling when the analyst is primarily concerned with the regression of an outcome onto a single predictor and needs to control for other covariates. Splitting rules and pruning methods are programmed in C and linked to 'RPART' to allow a full implementation of this method. Examples using data from the biomedical literature are presented.
 

Contributed Sessions Home

 
Time Series and Proportional Hazards
Chair: Jennifer I. Pittman
Pennsylvania State University
 
Ingerence About the Change-Points in a Sequence of Random Vectors
A. K. Gupta J. Chen
Bowling Green State University University of Missouri, Kansas City
In this talk, first I will review the change-point problems. Then the testing and estimation of multiple covariance change-points for a sequence of m-dimensional (m>1) gaussian random vectors by using Schwarz information criterion (SIC) studied. The unbiased SIC is also obtained. The asymptotic null distribution of the test statistics is also derived. The result is applied to the weekly prices of two stocks (m=2), Exxon and General Dynamics from 1990 to 1991, and changes are successfully detected.
 
 
Detecting Change in Variance for Unequally Spaced Time Series
Tze-San Lee and N. Hou
Western Illinois University
To detect a change in the variance of unequally spaced time series, unobservable errors are modeled by the Ornstein-Uhlenbeck process. By applying the principle of likelihood ratio, a statistical test is proposed for detecting the change-point. The asbestos exposure data collected at Lackland Air Force Base from a team of five workers is used to illustrate the proposed test.
 
 
Dynamic Modelling of Spectral Density Power Rhythms in
All-Night Electroencephalograph (EEG) Recordings

Matthew R. Marler, J. Christian Gillin, Arlene Schlosser, and Hanspeter Landolt
University of California, San Diego

When the digitized recordings of short EEG epochs (e.g. 4 seconds) are analysed by Fast Fourier transforms, the graphs of the spectral power in some frequency bands show a characteristic shape at night in a subset of people: peaks in power occur right after sleep onset and at approximately 90 minutes thereafter the heights of the peaks decline throughout the night the nadirs occur during episodes of rapid-eye-movement sleep (REM).

We investigate models in which: the REM sleep is the nonlinear output of a dynamic oscillator (Duffing, van der Pol) the power in some spectral bands (but not others) is a non-linear function of a compartment that accumulates an abstract "sleep propensity" a nonlinear output of the REM oscillator stifles discharge from the compartment that accumulates sleep propensity specific experimental manipulations (the "tryptophan-free" cocktail) affect some parameters of the model but not others. We display some data that are well described by the model, as well as some that are not well modelled. We assess the degree to which sets of parameter estimates differ between diagnosed groups (normal men versus age-matched depressed men.)

 
 
Importance Bootstrap Resampling for Proportional Hazards Regression
Kim-Anh Do Bradley M. Broom Xuemei Wang
MD Anderson Cancer Center Rice University MD Anderson Cancer Center
The use of importance resampling methods to reduce the amount of resampling necessary for the construction of nonparametric bootstrap confidence intervals in the context of survival data with censored observations is investigated. Simulation results showed that relative mean-squared-error (MSE) efficiency gains, when compared to uniform resampling, increased significantly with sample size, was mildly associated with amount of censoring, but decreased slightly as the number of bootstrap resamples increased. The extra CPU time requirement for calculating importance resamples was negligible when compared to the large factor of MSE efficiency gains. The method is applied to a real data set of chronic lymphocytic leukemia, and an SPLUS program is presented for general use. Importance resampling should be used whenever bootstrap methodology is implemented in a survival framework.
 
 
The Introduction of Local Spread as a Measure of Non-Stationarity
Robert A. Hedges and Bruce W. Suter
Air Force Research Laboratory

Establishing measures for local stationarity is an open problem in the field of time-frequency analysis. One promising theoretical measure, known as the spread, provides a means for quantizing potential correlation between signal elements. However, it has been noted that the spread is not robust the finiteness of the measure is dependent on the smoothness of the signal covariance.

In this paper we undertake the issue of implementing such a measure for discrete signals and investigate its robustness. A more robust measure, the local spread, is introduced theoretically and implemented. Issues arising from the finite and discrete nature of the data are discussed. The technique is then applied to several examples and the robustness of the method is investigated.

 
 
Multivariate Time Series Analysis in Principal Component Space
Joseph N. Ladalla
University of Illinois, Springfield
Box-Jenkins (1976) have provided perhaps the most practically convenient methodology to analyze univariate time series data. Multivariate Time Series analysis can also be made as simple. This is made possible by transforming the given data to its principal components. These principal components are independent under assumption of normality, though they may be cross-correlated. We fit univariate ARIMA models to each principal component, obtain forecasts and combine the individual models into a multivariate model for the vector of principal components. This model may be converted into a multivariate ARIMA model for the original time series. The residual series of the original time series is subjected to all necessary diagnostic checks including the cross correlation function. Strong theory is developed, interesting examples are presented.
 
 
Sequential Testing of Proportional Hazards Models
Victor D. Zurkowski
University of Toronto
We consider the distributions of failure times in the context of processes in which instantaneous failure hazards depend on "risk status" covariate processes. In order to assess consistency of observed data with a proportional hazards model (Cox model), we look at the log-likelihood ratio process comparing the Cox model to a non-parametric hazard model. We state properties of the log-likelihood ratio process, and obtain its large sample distribution by means of Martingale analysis. We combine various residuals to produce a test that converges to a power one test as the sample size increases. We present examples of implementation of the test on real data.
 

Contributed Sessions Home

 
Uncertainty Quantification in Complex Models
Chair: Todd L. Graves
Los Alamos National Laboratory
 

Quantifying the Effects of Noise on Biogeochemical Models
Barbara Bailey
Scott Doney
University of Illinois, Urbana-Champaign
NCAR
The need to understand the effects of anthropogenic perturbations on the ocean carbon cycle has sparked a new interest in biogeochemical models in recent years. These models are now being coupled with physical ocean circulation models. For this coupling to generate realistic fluxes of nutrients and carbon, the biogeochemical equations need to exhibit realistic dynamics.

We propose an experimental design approach to quantifying the effects of different types of noise on biogeochemical system dynamics. The biogeochemical model we use in our investigation is a model of plankton dynamics and nitrogen cycling. It is a compartmental model (NPZ) consisting of a compartment for nitrogen (N), phytoplankton (P), and zooplankton (Z). The flows or intercompartmental exchanges are modeled as a nonlinear system of three first order differential equations.

The types of noise of interest are both independent and correlated over time. We propose to study the effects of noise on the state variables of the system and parameters. Because the noise is an integral part of the system's dynamics, a nonlinear time series approach is used to quantify the dynamics and predictability of the system. This involves fitting nonlinear models and estimating dynamical systems quantities of interest such as global and local Lyapunov exponents, along with measures of uncertainty for these estimates.

 
 
The Selection of the Optimal Structure of the Earth's Model for Forecasting the Main Physical Parameters
Alexander Dmitrievich Gorobets
Sevastopol State Technical University

The Earth's model to be considered is a system of m nonlinear simultaneous equations F(Y,X,Ai)= U, where F is the true but unknown vector of models, Y is a m-vector of dependent (endogenous) variables, X is a k-vector of independent or controlled input (exogenous) variables, Ai is a vector of unknown parameters in i-th structural equation (i=1..m), and U is a vector of independent random variables with zero mean and variance-covariance matrix S. There is usually some prior information about the regions of possible values for variables: YeW1 and XeW2, where W1 and W2 are sets of possible values of the vectors Y and X.

The problem is selection the optimal structure of simultaneous equation system which minimize the expected error of prediction of endogenous variables in some region of interest for making predictions. The loss function for each of the system of models depends on the prediction error of the vector of endogenous variables Y. The objective of research is to present the method of selection of the optimal structure of the functions F(Y,X,Ai) and to investigate two criteria that assures one quality in the selection strategy, such as the average of the mean square error of prediction.

 
 
Cost-Effective Uncertainty Analysis
Daniela Stoevska-Kojouharov
Monmouth University
Uncertainty analysis is a tool for determining the relative influence of different inputs to a simulation model on the output(s) of interest. In many simulation models, at least some of the inputs are unknown parameters whose values must be estimated, thus inducing unwanted variation in the simulation output(s). A researcher who wishes to improve the simulation model by reducing the variation is not well served by current methods for uncertainty analysis, which rank the inputs based on their relative influence on the output(s), but do not account for any costs associated with improving the estimates. In our work we introduce an algorithm for obtaining the resource allocation that results, for a given level of expenses, in the greatest improvement of a simulation model. The method is called cost-effective uncertainty analysis. Cost-effective uncertainty analysis extends the concept of current methods for uncertainty analysis to include the fiscal cost of improving model precision. The method uses regression meta-modeling to study and make conclusions about the complicated simulation model. The information about the influence of the inputs is combined with information about the cost of obtaining additional observations on each input (as provided by the researcher) and the total amount of money available for improving the simulation model. An algorithm for best money-allocation is developed.
 
 
Statistical Quantification of Prediction Error Associated with Computational Predictions
Robert G. Easterling and Marcey Abate
Sandia National Laboratories
Confidence in computational predictions is enhanced if the potential 'error' in these predictions (the difference between the prediction and nature's outcome in the situation being modeled) can be credibly bounded. Determining such error-limits is a problem that has been solved for relatively simple mathematical models, not the complex, multi-physics codes used to predict, say, system responses in various environments. We develop a conceptual framework for solving the prediction-error-quantification problem, discuss several issues involved, and illustrate proposed methods through a damped spring-mass example and a contact-resistance experimental program. In general, this framework requires designing and conducting a suite of physical experiments and calculations (both ranging from phenomenological to integral levels), then analyzing the results to provide a basis for inferring the uncertainty of a model-prediction of system performance in a particular application, which may be in an environment or configuration that cannot be tested. Problems discussed include: the design and analysis of physical experiments for the purpose of quantifying uncertainty in computational predictions analysis methods for estimating prediction error at untested points in the parameter space merging prediction uncertainty results at single phenomenon or component levels to obtain system-level prediction uncertainty dimension reduction in order to make the problem tractable.
 
 
The Role of Statistical Methods in Atmospheric Model Intercomparison Projects
Christiane Jablonowski
University of Michigan

Since the early nineties, atmospheric model intercomparison projects have been initiated that reveal interesting agreements and disagreements among weather prediction and climate models. Especially the role of the dynamics (the so-called dynamical core of a general circulation model) has recently been discussed. This talk gives an overview of current results and presents ideas how to compare dynamical core experiments.

Special attention is given to easy-to-use statistical methods that help understand and explain model phenomena such as the influence of a varying resolution or diffusion parameter on the model's climate. The statistical framework provides insight into the significance of model variations. In particular, methods like the univariate Student-t and Fisher-F hypothesis test as well as a recurrence and empirical orthogonal function analysis (EOF) are shown. In addition, frequency and wavenumber analyses reveal characteristic model features.

The talk addresses the importance and use of these statistical analysis techniques using dynamical core examples of three different general circulation models. The models involved are two weather prediction models of the German Weather Service and the forecast model IFS of the European Center for Medium-Range Weather Forecasts. All models are different in design and numerics and therefore provide a suitable test bed.

 

Confidence Bands for Nonparametric Curve Estimates
David J. Cummins
Douglas Nychka
Eli Lilly & Company
NCAR

Curve estimates with no measure of their accuracy are not very useful. We address the problem of obtaining valid pointwise and simultaneous confidence bands for nonparametric curve estimates. Although recent work on confidence intervals show promising results, those intervals have coverage probability at the nominal level only when averaged across the design points and not uniformly at all design points.We present methods for obtaining bands which have more uniform coverage and at the same time are thinner than what are produced by established methods.

We also present a new approach which avoids the usual practice of inflating pointwise intervals. We proceed by estimating a confidence set that contains the true function with probability 1-$\alpha$ using an intersection of two balls in a Sobolev space. The upper and lower bands for the function are the boundary elements of this confidence set. Finding these boundary elements is an optimization problem which is solved using Nonlinear Programming. We provide an accurate approximation to this computer-intensive method that reduces the computation to be linear in the number of observations. This confidence set approach gains more uniform coverage, and unlike other methods, gives asymmetric bands that are not of a fixed width, adapting to the smoothness and shape of the estimated curve.

 

Contributed Sessions Home

 
Statistical Tests, Estimation and Stability
Chair: Imola K. Fodor
Lawrence Livermore National Laboratory
 
An Adjusted, Asymmetric Two Sample t-Test
Sandy D. Balkin Colin Mallows
Ernst & Young LLP AT&T Labs-Research
The Telecommunications Act of 1996 requires that Incumbent Local Exchange Carriers (ILECs) must provide, for a fair price, interconnection services to the customers of a Competitive Local Exchange Carrier (CLEC), these service being "...at least equal in quality to [those] provided by the local exchange carrier to itself...". To monitor the ILEC's performance, we need formal statistical tests of compliance. Inspection of data on several performance measures reveals severe positive skewness, violating the assumptions of the standard t-test. Also, since we want to detect not only shifts in mean but also increases in variance, use of the modified t-statistic of Brownie et al (Biometrics 1990) is indicated. Permutation testing would be preferable, but is unwieldy. In response to this need, we present adjustments (following the method of Johnson (JASA 1978)) to the standard and modified two-sample t-tests. We compare the resulting tests with permutation tests.
 
 
The Wilson-Hilferty Transform is Locally Saddlepoint
George R. Terrell
Virginia Polytechnic Institute

In 1931 Wilson and Hilferty discovered a quick, rough method for obtaining p-values for chi-squared statistics. Its usefulness declined with the advent of computers. Recently there has been interest in "saddlepoint" methods for approximate probability calculations. These are fairly general, and can therefore often be adapted to the ever more complicated test statistics that modern statisticians use. However, they do not as readily provide confidence intervals and simulated values as does a Wilson-Hilferty transform.

We will propose a generalized Wilson-Hilferty transform, and establish that it is locally a saddlepoint method. The method therefore combines traditional and modern virtues, and shows promise for difficult inference problems.

 
 
On Numerical Stability of MGF and CF
Jinhyo Kim
Cheju National University
, Korea
Regarding the numerical computing of the moment generating function and the characteristic function, a series of publications appeared in the 1990's. (cf. McCullagh(1994), Waller(1995), Luceño(1997)). The system matrix in a discretized linear system, generated by the MGF, exhibits a serious backward error due to ill-conditioning. The inherently perturbed error in the MGF makes it hard to implement on a digital computer whereas the CF is not. Those phenomena arise because the real Vandermonde matrix associated with the MGF is extremely susceptible to numerical error whereas the complex Vandermonde matrix associated with the CF is not. This article explains those phenomena using algorithm stability, specifically backward stability. We discuss originality and properties of the inherently perturbed error in the MGF and the CF. More general arguments are given to show that the CF is superior to the MGF in terms of numerically stable behavior.
 
 
A Test for Symmetry about a Known Median Based on a Runs Statistic
Alex Leonardo Rojas Peña Jimmy A. Corzo Salamanca
University of Puerto Rico National University of Colombia