A

B

S

T

R

A

C

T

S

INTERFACE 2000
INVITED SESSION ABSTRACTS

Models for the Earth's Atmosphere and Ocean
Organizer: Doug Nychka
(nychka@cgd.ucar.edu)
National Center for Atmospheric Research
 
Air-Sea Interaction in the Labrador Sea: Deep Water Formation and Climate
Ralph Milliff
National Center for Atmospheric Research
The large-scale mean north-to-south overturning circulation in the Atlantic Ocean transports warm water poleward at the surface (e.g., the Gulf Stream and North Atlantic currents), and cold water equatorward at depth (e.g., the Deep Western Boundary Current system). This is an important branch of the so-called ocean conveyor-belt circulation that provides a conceptual model of the global ocean as a sink for atmospheric concentrations of greenhouse gases, e.g., carbon dioxide. The ocean deep convection process at high-latitudes is the energetic downward component of the conveyor-belt model. Ocean deep water is said to be formed in this process wherein oceanic parcels recently in contact with the atmosphere are sequestered at depth, and subsequently isolated from the surface over climate timescales, e.g., O(1000 yrs). The Labrador Sea is one of a few locations in the world ocean where deep convection occurs. Ocean deep convection is driven there by vigorous air-sea exchanges of heat, momentum, moisture, and mechanical energy associated with the passages of intense storms in winter. An extensive observational program in oceanography and meteorology was implemented in the Labrador Sea region with a focus on the ocean deep convection process for the 1996-97 winter season. We will review the large observational dataset that is emerging from that campaign. In addition we will introduce the growing satellite observational datasets of sea-surface winds and sea-surface heights in the Labrador Sea region. This talk sets the stage for a description in the following presentation of Bayesian Hierarchical Model (BHM) approaches to geophysical problems, including an example of a BHM approach to ocean dynamics leading to ocean deep convection and deep water formation in the Labrador Sea.
 
 
Hierarchical, Space-Time Models: Physically Based Models for Combining Geophysical Data
L. Mark Berliner
Ohio State University
Phenomena studied in the geophysical sciences are high-dimensional, interrelated processes distributed in space and time. Modeling and prediction of such processes typically requires the combination of both scientific understanding, usually reflected in physical models, and observations. However, the physical models are often very complex and subject to a variety of uncertainties. Further, though very large to massive datasets are often available, they are typically composed of disparate types of observations, and almost paradoxically, cover a small portion of the processes of interest. The hierarchical Bayesian viewpoint is suggested to provide a framework for combining scientific reasoning and observational data, in a fashion that quantitatively accounts for our uncertainty. I review a basic strategy for developing such hierarchical models indicate the relation between the Bayesian models and their analysis via Markov chain Monte Carlo and review some preliminary examples, motivated by Ralph Milliff's talk that precedes this one.
 
 
Computational Advances in Gibbs Sampling of Massive Spatio-Temporal Models
Timothy Hoar
National Center for Atmospheric Research
Christopher K. Wikle
University of Missouri, Columbia
The analysis of the prodiguous amounts of data generated by orbiting platforms requires massive spatio-temporal models. Even for limited-domain cases, traditional covariance-based space-time statistical methods are generally not tractable. We explore programming methodologies for making hierarchical Bayesian simulations for such models (typically requiring a large number of iterations) a reality for very large datasets and spatio-temporal domains. The heirarchical model is fully described by Wikle et. al, JASA, in review. Our specific interest is the spatio-temporal prediction of the surface winds over the equatorial Pacific using remotely-sensed satellite wind observations and the output of a deterministic weather model.
 
 
Stochastic Parameterizations in General Circulation Models for the Atmosphere: Cloud Motion
Rachel Buchberger
National Center for Atmospheric Research and Colorado State University
Current models of the earth's atmosphere (General Circulation Models) rely on deterministic rules to decide the cloud amounts in a grid box. Clouds are an important feature of these models because they influence how incoming and reflected radiation interacts with the atmosphere. Unfortunately, such models do not capture the motion of the cloud field well. An alternative is to consider cloud amounts as a spatial and temporal process that evolves in a nonlinear and stochastic fashion over time. In this project, the form of this process is estimated from observational data and is found to be a good description of the measured cloud fields. The statistical models are Nongaussian and use a neural network form to represent the autoregressive relationship between current cloud amounts and those in the next time period.
 
 

Invited Sessions Home

 
Critical Infrastructure Modeling
Organizer: Sallie Keller-McNulty
(sallie@lanl.gov)
Los Alamos National Laboratory
 
Critical Infrastructure System of Systems Simulation
J. Darrell Morgeson

Los Alamos National Laboratory
For the past 10 years, Los Alamos has pursued an aggressive simulation R&D program focused on the nation's critical infrastructure. The products of the program are described below together with their respective sponsors and funding. With internal discretionary funding, the Lab began to integrate these efforts into an overall "system-of-systems" simulation environment last year with a view toward addressing near, mid-, and long-term policy, investment, and operational issues for use by both public and private sectors. The scale and scope of these simulations extend the state-of-the-art in very large-scale complex systems simulations and computation (i.e., human based interactions on the order of 107 - 1012 interactions per second). The combined simulation and analysis system will effectively address such complex issues as: convergence, electrical grid restructuring energy security and reliability, interdependencies among critical infrastructure systems, environmental impacts, response to natural and manmade disasters, and others. The entire simulation system is designed to run on multiple hardware platforms including ASCI architectures. An S&T roadmap has been developed to show the evolution of this program over the next 10 years for both applied products and advances in the science that forms the foundation for these tools and their use. To date, over $50M have been spent on these projects with another $42M planned through 2004—the majority provided by other federal agencies.
 
 
Generation and Measurement of Large Dynamical Systems
Chris Barrett
Los Alamos National Laboratory
We introduce a new mathematical object, a Sequential Dynamical System (SDS), and explore the insights we obtain in relation to computer simulation of large, composed, dynamical systems, their measurement and interpretation, and a range of issues surrounding validity of a simulation as used in decision making and analysis of such systems. In large infrastructure design, analysis and policy questions, representation using computers of present and/or future systems is desirable in many respects. However, firm theoretical foundations for simulation and its appropriate use are lacking, which limits their credible use. In particular the following issues impose a need for rigorous foundations, 1. the short and long term dynamics of these systems are complicated and characterized by massive interaction among large numbers of subsystems, 2. a potentially endless data collection requirement warrants care in choice of approaches to represent the system, 3. the fact that knowledge of piece-parts usually exceeds understanding of the composed system, and 4. a broad range of issues surrounding any meaningful use of the term "validity" as it might apply to a computer simulation used as a model to which decision making processes refer. We will examine these issues from the perspective of our emerging theoretical foundations for simulations as well as applications in transport and communications infrastructure analysis.
 
 
Statistics for complex Computer Models: Beyond Input-Output Analysis
Katherine Campbell and Richard Beckman
Los Alamos National Laboratory
At first glance, the simulation models described by the other speakers in this session may appear to be radical departures from more familiar types of numerical models of dynamical systems, such as models of oil reservoirs or accelerators. A closer look at the structure of such models, however, reveals more similarities than differences. Exploring these common features, we find a great variety of statistical opportunities, from the stochastic simulation of incompletely known input and boundary conditions to the design of experiments to optimize computational algorithms, from model calibration to model assessment. This talk will include examples from transportation simulation (an infrastructure model), vadose zone flow simulation at Yucca Mountain (the proposed site of a geological repository for high level nuclear waste), and climate modeling.
 
 
Cyber Crimes and Counter Measures
Matthias Schonlau
RAND
I will give an overview over several aspects related to Cyber Crime. Topics include private company concerns and security measures, academic efforts in intrusion detection and the position of the government. I will also discuss the work of the Cyber Assurance unit at RAND.
 
 
Statistical Visualization Methods in Intrusion Detection
Jeffrey L. Solka, David J. Marchette Naval Surface Warfare Center Bradley Wallet
Chroma, Inc.
This paper will describe some of our recent efforts in the application of modern statistical visualization methodologies to the problem of the detection of intruders on computer systems. We will illustrate the application of data imaging to various problems within the intrusion detection arena. We will also cover some new visualization frameworks which we have developed to aid the human analyst in their interpretation of network-based intrusion detection information. This work has been performed in support of the Secondary Heuristic Analysis for Defensive On-line Warfare (SHADOW) intrusion detection system. This is an operational intrusion detection system that has been deployed at numerous facilities worldwide. Some rudimentary background material on the SHADOW system will also be provided.
 
 
Hidden Marco Modeling of Freeways Traffic Status
Jaimyoung Kwon, Peter Bickel, and John Rice
UC Berkeley
We will discuss hidden Markov modeling of the velocity field observed from freeway stretches. Two slightly different models are proposed, each with different flavor and computational difficulty. The models can be used for simulation and prediction. Due to the dimensionality of the problem, the traditional algorithm that calculates the likelihood exactly is not applicable. As an alternative, Monte-Carlo EM (MCEM), possibly using sequential importance sampling (SIS), is used. The method of iterative conditional modes (ICM) will also be mentioned, though it is not consistent in general. Performance of each method and various computational issues will be discussed. Empirical works include analysis of data from I-880 and I-5 interstate highways.
 

Invited Sessions Home

 
Information Technology and Federal Statistics
Organizer: Cathryn Dippo
(dippoc@ore.psb.bls.gov)
Bureau of Labor Statistics
 
Wrapping and Mediating Survey Data
Amarnath Gupta, Chaitanya Baru, Richard Marciano, and llya Zaslavsky
University of California, San Diego
The task of information mediation is to enable a user to query across a number of information sources as if they were a single integrated source. To accomplish this integration, the data from the sources are transformed to a common representation, through a process called wrapping. We show how such information integration can be done on survey data, when the data is presented in the DDI format and information integration is achieved in the MIX framework developed at UCSD. We point out how the role of statistical expert knowledge must be exploited to make the mediation meaningful for users. We also demonstrate how a survey analysis tool, called Sociology Workbench uses integrated survey information to perform its analysis.
 
 
Statistical Information Seeking and System Design
Carol Hert
Syracuse University
Statistical Websites have made it possible for a huge variety of users to access statistical data. Understanding how these users locate and use data, what expectations they have for use, and what they understand about data will enable website designers to provide information and tools that enable better access. This paper reports on a series of user investigations related to United States Federal Statistical Agency websites highlighting findings to date as well as providing a perspective on how to understand and support users via system design.
 
 
The Role of Ontologies in Statistical Information Seeking
Ed Hovy
USC Information Sciences Institute
Judith Klavans
Columbia University
The old saying "there’s nothing like more data" is only true if you can successfully access the data you need, not get lost in it. The proliferation of statistical databases in U.S. Government Agencies has not been accompanied by a single widely used access system. To try to facilitate terminology standardization and cross-database access and linkup, we are connecting databases to a large 100,000-node taxonomy of ‘concepts’ called SENSUS. SENSUS will be used as the basis of a multi-database query planning and access system. This paper describes SENSUS, methods of linking other terminology systems to it, and the work now being done with some statistical databases.
 
 
Statistical Information Seeking and System Design
Carol Hert
Syracuse University
This poster complements the presentation of the same title. It will graphically depict the connections among several streams of research all concerned with improving system design for statistical information seeking. The streams included are research about users (and non-users), users' interactions with statistical information seeking tools (including metadata and search engines), interface design, metric and tool development, and organizational impact research. The poster will indicate how these projects contribute to our understanding of user behavior and how to support it on web-based systems.
 
 
The Role of Ontologies in Statistical Information Seeking
Ed Hovy
USC Information Sciences Institute
A poster accompanying the paper by Hovy and Klavans provides details about the SENSUS ontology and about the process of analysis and taxonomization required to integrate a database or collection of text with an ontology.
 
 

The Data Documentation Initiative: Current Status of
an Attempt to Specify an XML DTD for Empirical Social Science Documentation

Peter Joftis
University of Michigan

In early 1995, aided by a grant from the National Science Foundation, the Inter-university Consortium for Political and Social Research formed a committee which now numbers approximately 20 stakeholders in the social science research and archiving process. The goal of the committee was to develop a specification for the documentation of empirical social science data collections. The project is known as the Data Documentation Initiative (DDI).

The group decided that the best format for the specification was an XML (Extensible Markup Language) based Document Type Definition (DTD). A 13-site beta-test of an initial DDI DTD was completed in mid-1999. Comments resulting from the beta-test were reviewed and many of the suggestions have been implemented. Version 1 of the DTD will be released in early 2000.

This poster session will present the goals of the DDI project. The structure of the Version 1 DTD will be presented and explained. The types of data collections that may be marked-up using the DTD will be discussed. Issues held for Version 2 will be presented. Finally, some thoughts on the role that other parts of the XML suite (XML Schema, XML Data, XLink, XPointer) will be presented.

 

Invited Sessions Home

 
CSNA Sponsored Session:
Applications of Clustering and Classification to Large Datasets

Organizer: William Shannon
(shannon@osler.wustl.edu)
Washington University School of Medicine
 
Inner-Loop Statistics in Automated Scientific Discovery from Massive Datasets
Andrew Moore
Carnegie Mellon University and Schenley Park Research, Inc.

Intensive statistical analysis of massive data sources ("data mining") has been embraced as one of the final areas with a need for massive computation beyond that available on a $2000 computer or $200 videogame. We begin this talk with two examples of software, instead of hardware, giving 1000-fold speedups over traditional implementations of statistical algorithms for prediction, density estimation, and clustering.

We then pause to examine directions in which these software solutions seemed blocked when faced with Physics, Biology and commercial scientific data discovery problems. The primary blocks were a curse of dimensionality and limitations on machine main memories. This is followed by four examples of new pieces of research that circumvent these barriers: lazy cached sufficient statistics, exact accelerated k-means, multiresolution ball-trees for very high dimensional real-valued data, and filament identifiers.

We then reveal the reason for our new-found respect for super-computation: when an algorithm you previously ran overnight executes in seconds, you find yourself wanting to run it ten thousand times. We show the impact of being able to run intensive statistics as an inner loop has had on our analysis of cosmology data (preliminary data from the Sloan Digital Sky Survey) and biotoxin identification, where desirable but hopelessly extravagant operations such as model selection, bootstrapping, backfitting, randomization and graphical model design now become somewhat non-hopeless.

Joint work with Andy Connolly (U Pitt Physics), Artur Dubrawski (Schenley Park Research), Geoff Gordon (Auton Lab), Paul Komarek (Auton Lab), Bob Nichol (CMU Physics), Dan Pelleg (Auton Lab) and Larry Wasserman (CMU Statistics).

 

 
Current Approaches to Gene Chip Data Analysis
Daniel Weaver
Genomica Corporation
Information from the Human Genome Project is allowing scientists to perform systematic experiments and gather data in unprecedented amounts. This talk will review the mathematical classification techniques being applied to gene expression data and will frame the scientific questions that such data can address. Current techniques being applied to such data range from simple average-linkage clustering to self-organizing maps. While these techniques are sufficient for the existing data volumes, they are unlikely to work efficiently on the large data sets that will be generated in the coming years. Some key scientific questions that will be raised include: What constitutes a statistically significant, diagnostic gene expression pattern? How are gene expression data used to functionally classify genes? How can large volumes of gene expression data be used to predict the underlying gene expression control mechanisms? No biological background is needed; the relevant biological concepts will be described.
 
 
Preliminary Studies on Combining Wavelet and Cluster Analysis
for Gene Chip Data

William Shannon
Washington University School of Medicine
It is now recognized that examining patterns of gene expression (i.e., a gene's state of activity) in patients can assist health professionals in detecting, diagnosing, and treating human disease. Recently, a new technology named the nucleic acid array or `Gene Chip' allows clinicians to measure these patterns and begin relating them to clinical diagnosis and outcomes.

In this poster we present an overview of the effort underway at Washington University School of Medicine in St. Louis to develop and apply microarray technology to basic and clinical research in surgery, molecular microbiology, genetics, and cancer. Attention is focused on the bioinformatic and biostatistical challenges we are faced with, and early efforts to bring this under control. In addition, we will discuss a novel application of wavelet transformations to gene chip data we are currently exploring.
 
 
The UC Irvine Knowledge Discovery in Databases Archive
Stephen D. Bay, Dennis Kibler, Michael J. Pazzani, and Padhraic Smyth

University of California, Irvine
Advances in data collection and storage have allowed organizations to obtain massive, complex and heterogeneous databases, which have stymied traditional methods of analysis. This has led to the development of new analytical tools that often combine techniques from a variety of fields such as statistics, computer science and mathematics to extract meaningful knowledge from the data. To support research in this area, UC Irvine has created the UCI Knowledge Discovery in Databases (KDD) Archive (http://kdd.ics.uci.edu/). This is a new online repository of large and complex databases which encompass a wide variety of data types, analysis tasks and application areas. Our goal is to foster research in knowledge discovery by making these databases publicly available. The archive is supported by the Information and Data Management Program at the National Science Foundation, and is intended to expand the current UCI Machine Learning Database Repository to databases that are orders of magnitude larger and more complex.
 

Invited Sessions Home

 
Best of the Journal of Computational and Graphical Statistics:
New Developments in EM

Organizer: Andreas Buja
(andreas@research.att.com)
AT&T
 
An Interval Analysis Approach to the EM Algorithm
Kevin Wright
Pioneer Hi-Bred International
William J. Kennedy
Iowa State University
The EM algorithm is widely used in incomplete-data problems (and some complete-data problems) for parameter estimation. One limitation of the EM algorithm is that upon termination, it is not always near a global optimum. As reported by Wu, when several stationary points exist, convergence to a particular stationary point depends on the choice of starting point. Furthermore, convergence to a saddle point or local minimum is also possible. In the EM algorithm, although the loglikelihood is unknown, an interval containing the gradient of the EM q function can be computed at individual points using interval analysis methods. By using interval analysis to enclose the gradient of the EM q function (and, consequently, the loglikelihood), an algorithm is developed which is able to locate all stationary points of the loglikelihood within any designated region of the parameter space. The algorithm is applied to several examples. In one example involving the t distribution, the algorithm successfully locates (all) seven stationary points of the loglikelihood.
 
Fitting Mixed-Effects Models Using Efficient EM-Type Algorithms
David A. van Dyk
Harvard University
In recent years numerous advances in EM methodology have lead to algorithms which can be very efficient when compared with both their EM predecessors and other numerical methods (e.g., algorithms based on Newton-Raphson). In this paper we focus on mixed-effects models and combine several of these new methods to develop a set of mode-finding algorithms which are both fast and more reliable than standard algorithms such as proc mixed in SAS. We present efficient algorithms for Maximum Likelihood, Restricted Maximum Likelihood, and computing posterior modes. These algorithms are not only useful in their own right, but also illustrate how parameter expansion, conditional data augmentation, and ECME can be used in conjunction to form efficient algorithms. In particular, we illustrate a difficulty in using the typically very efficient parameter-expanded EM algorithm for posterior calculations, but show how algorithms based on conditional data augmentation can be used. We also present a result that extends Hobert and Casella's (JASA, 1996) result on the propriety of the posterior for the mixed-effects model under an improper prior, an important concern in Bayesian analysis. Finally, we show how similar methods applied to the Data Augmentation algorithm can lead to very efficient stochastic algorithms for posterior sampling.
 

Invited Sessions Home

 
Characterizing Large Complex Natural Systems and Beyond
Organizer: Lorraine Denby
(ld@research.bell-labs.com)
Bell Labs--Lucent Technologies
 
Statistics and Models for Complex Systems in Engineering and Biology
John Doyle
Caltech University
A great deal of attention has been given recently to describing features of complex systems in terms such as self-similarity, power laws, and entropy, phase transitions, criticality, fractals, chaos, and so on. This talk will focus on the fascinating statistical properties of web and internet traffic, and relate these to power law statistics in other domains, such as forest fires, power outages, natural and man-made disasters, and specie extinction. While it is now widely accepted that the commonly assumed Poisson traffic models poorly describe Internet traffic, it remains to be seen if these insights will lead to new approaches to network protocol design, which remains largely ad hoc. We critique the popular explanations from statistical physics and offer some novel explanations for the origins of power laws in terms of generalizations of source coding for data compression. The implications for future convergent, ubiquitous networking will also be discussed briefly as well as general issues of more rigorous approaches to analysis and robust design of complex multiscale systems in engineering and biology.
 

Invited Sessions Home

 
A Tutorial on Inverse Theory
Organizer: Vicki Lancaster
(vlancast@neptuneandco.com)
Neptune and Company, Inc.
 
A Tutorial on Statistical Inverse Theory
Luis Tenorio
Colorado School of Mines

Ill-posed inverse problems arise when we try to recover information about a process from partial, indirect noisy observations. These problems are common in physical sciences like geophysics, astronomy and astropysics where we have to rely on indirect measurements of processes in space or underneath the Earth's surface.

Through examples we will illustrate the questions that arise in ill-posed inverse problems and present some basic methods to determine a meaningful, stable solution. These methods include the Backus-Gilbert method, the method of regularization, singular value decomposition and wavelet estimation. We will also discuss methods to estimate the variability, and the bias of the inversion estimates, as well as minimax estimates of the mean square error.

 
 
Velocity Estimation in Exploration Geophysics, a Bootstrap Approach
Alberto Villarreal
Colorado School of Mines

The Seismic Reflection Experiment is one of the most important tools in geophysical exploration. This method consists of sending artificially generated seismic waves into the earth and recording them once they are reflected back to the surface by structural irregularities in the subsurface.

In a Seismic Reflection Experiment, the most important parameter of interest is the velocity at which waves travel through different media in the earth. The physical velocities in the subsurface covered by the seismic experiment define how the recorded data will look like, because seismic data consists of travel-time measurements (the time it takes for a wave to traverse the path source-reflector-receiver), and travel-time depends on the wave velocity in the earth. Therefore, we are interested in the inverse problem of estimating the medium velocities from data. These velocity estimates give important information about the structure and composition of the subsoil.

We use bootstrap resampling methods to improve and automate velocity estimation from seismic data. In the bootstrap approach, data samples are created by resampling the original seismic data. Next, an optimization procedure is used to obtain velocity estimates for each sample, and the variability of different velocity estimates is used to compute standard errors. This procedure is repeated iteratively with different trial velocities. The velocity estimate with the smallest error is selected. This is a computationally intensive method but can be efficiently implemented in parallel. Besides automating the velocity analysis, this method may be used to estimate errors of seismic velocities, which are essential for subsequent steps in the data processing sequence.

 
 
Generalizing Wiener-Levinson Deconvolution
Luis Tenorio and Alberto Villarreal
Colorado School of Mines

In reflection seismology a trace is modeled as a convolution of a seismic pulse with a reflectivity sequence that encodes information about the layering in the subsurface. To deconvolve the trace means to remove the blurring effect of the pulse to obtain an estimate of the reflectivity.

A usual assumption in deconvolution is that the reflectivity is a white random process. But, most of the time this assumption is not appropriate. We use Gaussian mixtures to generalize Wiener-Levinson deconvolution and obtain a procedure that is more robust to nonstationarities in the reflectivity and to correlation structure in noise.

 
 
Estimating the Influence of Random Noise on Measured Travel Times
Albena Mateeva
Colorado School of Mines
Tomography is used in seismic exploration for velocity-model building. Tomographic data consist of travel times of waves, excited near the earth's surface which after having penetrated to a certain depth have been reflected back by some geological inhomogeneity. As any inversion procedure, tomography requires good knowledge of data uncertainties. Errors in measured travel times are introduced by a variety of factors but one of them is always present—random noise. This paper discusses its contribution to travel time uncertainty. Two objectives were set. First, to understand the interaction between signal and random noise that leads to uncertainty in travel time. Second, to estimate that uncertainty from seismic data. The latter is a hard statistical problem that seismic industry tends to ignore rather than tackle. A practical solution of it should be of great interest not only to geophysicists but also to anybody dealing with travel times of band-limited signals.
 

Invited Sessions Home

 
Defining, Measuring, and Analyzing Quality of Care: Statistical
and Computational Challenges

Organizer: Sally C. Morton
(Sally_Morton@rand.org)
RAND

The HIV Performance Measurement Perspective from NYC, Framing the Clinical Context
Bruce Agins
New York State Department of Health

The NYSDOH AIDS Institute is responsible for systematic monitoring of the quality of medical care provided to HIV-infected individuals in NYS. Measurement of quality is based on indicators that are linked to optimal clinical outcomes, such as performance of PAP smears, PPD screening and use of antiretroviral therapy. Implemented in 1992, this initiative incorporates continuous quality improvement (CQI) techniques to stimulate health care providers to build and sustain quality within their organizations. Record abstraction at over 100 facilities is conducted annually, using standardized data collection forms. Through a new initiative, HIVQUAL,providers submit data directly using a software program developed in collaboration with HRSA. A subset of these records is reviewed to validate information. Results are presented as aggregated facility-specific data, comparatively and longitudinally to display historical trends. HIV performance data are also reported to the public. Accuracy and precision of data are paramount to the integrity and success of performance improvement efforts led by public health agencies. Impact at the state level ranges from individual agency actions to support for activities designed to improve care. This program represents an example of collaboration between a public health agency, clinicians and statisticians to incorporate sophisticated statistical techniques into routine CQI initiatives.
 
 
Measuring and Improving Quality in Managed Care:
Some Statistical and Computing Issues

Randall K. Spoeri
HIP Health Plans
, New York
With the rapid expansion of managed care, questions have been raised about the quality of care being delivered. In response to these concerns, quality management efforts have relied heavily on the measurement of performance. Associated with these measurement and improvement activities are various statistical and computing considerations. This talk will provide an overview of a number of these considerations, to include: measure definitions, data availability and quality, risk adjustment, analytical/interventional use of results, and predictive modeling. Thoughts about the future will conclude the talk.
 
 
Building Aggregate Health Care Quality Scales
John Adams
RAND
This talk will consider the emerging need to summarize information as multiple measures of health care quality proliferate, e.g., HEDIS, CAHPS and others. I will discuss various ways of building aggregate scales of health care quality and examine some of the technical issues that must be overcome to produce reliable aggregate information. These issues include case-mix adjustment, appropriate standard error calculations, developing weights for measures, and the communication of statistical uncertainty to the lay audience. The focus is on aggregate scales to profile HMO performance. Examples will be drawn from the development of the Combined Autos/UAW Reporting System (CARS), an HMO report card that is mailed to the employees of the big three auto makers during open enrollment.
 
 
Simulation in Models of Health Care Quality
Karl Heiner
SUNY at New Paltz
The AIDS Institute of the New York State Department of Health monitors the quality of care delivered by hospitals, community health centers and drug treatment centers to individuals infected with HIV. A medical peer review organization visits these facilities each year and applies a number of protocols reflecting the standard of care to random samples of medical records. Bayesian techniques are used to model aspects of the quality of care. For standards and indicators that have been employed for a number of years, conjugate analysis and simple dynamic models are useful when measuring facility specific performance. As standards and measures of quality of care evolve, or when comparisons among groups are required, more complicated models are indicated. In order to make inferences from the models, simulation methods are applied and trellis graphics are used to display overall and among facility trends.
 
 

Casemix Adjustment of the National CAHPS Benchmarking Data 1.0

Marc N. Elliott
RAND
Richard Swartz
Rice University
John Adams
RAND
Ron D. Hays
UCLA
Casemix adjustment of consumer ratings can provide more valid plan comparisons than unadjusted ratings by controlling for factors related to systematic response biases to questions about health care. Adjusted data are therefore potentially more appropriate for comparing the quality of care delivered. If members of a particular demographic group are less inclined than others to assign poor ratings to bad care, and members of this group are disproportionately enrolled in some plans, casemix adjustment for this systematic bias is useful when comparing assessments of different plans.

The CAHPS Implementation Handbook recommends adjusting for age and health status when comparing consumer assessments of health plans. Younger people and those in poorer health tend to report more problems and less positive evaluations of health care than do older people and those in better health. The current CAHPS approach uses a "health plan fixed effect" model to estimate the effects of casemix adjusters, which does not assume that all plans have equal true mean ratings (unlike simple casemix models). We also consider models that test whether casemix adjusters have different effects within different plans. We present graphical displays that compare the various casemix models considered.
 
 
Imputing Treatment Differences in Meta-Analyses with Missing Data
I. Elaine Allen
Babson College
Ingram Olkin
Stanford University
The problem of how to handle missing data occurs frequently in meta-analyses of clinical studies. Few cancer studies are randomized controlled trials and many include only one treatment arm. Often no comparative drug trials exist between competitive treatments in other therapeutic areas. Several techniques for imputing these differences will be described with examples from published meta-analyses and software imputation comparisons. These methods will include Bayesian hierarchical modeling to estimate missing random effects multilevel mixed models to estimate missing treatment differences and a proposed method to test the difference between active treatments when only placebo controlled studies exist.
 
 

Invited Sessions Home

 

The Use of Modeling and Statistics in Defense Analysis
Organizer: Nancy Spruill
(spruilnl@acq.osd.mil)
Office of the Under Secretary of Defense (Acquisition and Technology)

 
Using Advanced Modeling and Simulation Technologies and Techniques in the Analysis of Defense
Colonel William Crain
Defense Modeling and Simulation Office
Modeling and Simulation (M&S) has become a very powerful and cost effective tool for analyzing the technical performance and warfighting utility of technologies being developed in Defense Science and Technology (S&T) programs as part of the DoD M&S strategy. An increased use of M&S-based experimentation is being seen in Defense laboratories, engineering centers, operational warfighting experiments (to address S&T impacts on force structure, doctrine, tactics, etc.), Advanced Concept Technology Demonstrations, and other aspects of the S&T community. This increase has built on many "success stories" associated with M&S-based analysis. To maximize the value of using advanced modeling and simulation technologies and techniques in the S&T community, it is important to understand the differences between models and simulations, as well as their appropriate use in experimentation. Additionally, it is important to become aware of the new technologies and applications being developed in the M&S community and their benefits to the acquisition, training and analysis communities.
 
 
The Joint Warfare System (JWARS):
A Tool to Improve Combat Simulation for the Information Age Military

Lieutenant Colonel Daniel Maxwell
JWARS, Office of the Secretary of Defense
The Joint Warfare System (JWARS) is a campaign-level simulation of military operations that is being developed under contract by the Office of the Secretary of Defense (OSD) for use by OSD, Joint Staff, Services, and Warfighting Commands. The motivation for JWARS is to provide insight into cause and effect relationships that influence the success or failure of military forces, and ultimately to support critical operational planning and multi-billion dollar resource allocation decisions. JWARS is a closed form simulation. It is mixed-mode, with both stochastic and deterministic components key uncertainties are reflected as stochastic events that potentially cause significantly different outcomes. JWARS also represents explicitly the effects that differences in information availability may have on operational success, as well as explicitly representing most critical combat and logistical systems. This highly resolved view of warfare provides a holistic view of combat operations that has been previously unachievable. This paper describes the JWARS design, modeling concepts, and sample model results, like they might be presented to a senior leader.
 
 
Improving Inventory Performance by "Rightsizing" Inventory Reorder Points
Ron Fricker
RAND
This paper develops and applies a technique for "rightsizing" inventory levels. We demonstrate its power on a Marine Corps' inventory consisting of $24 million of stock for 13,000 different types of items. We show that our "rightsized" inventory works significantly better than the existing inventory: We achieve equivalent performance at one-half to one-third the cost. Conversely, we demonstrate significant improvement in fill rates and other inventory performance measures for an inventory of the same cost. The computationally intensive method, based on the bootstrap, is only now becoming possible to apply with the advent of today's powerful desktop computers. It is an alternative to the standard approaches, which are often inappropriate and only applied out of computational convenience.
 
 
Use of Decision Support Simulation Tool in Army Resource Decisions
Craig E. College
Program Analysis and Evaluation Office
Senior Army decision-makers constantly struggle to optimize the allocation of funds for readiness, modernization, and soldiers' quality of life. The analysis supporting resource decisions in the Army Program Objective Memorandum (POM) process comprises a large and complex series of tasks. There are time-delays between causes and effects, extensive "feedback" loops, and numerous critical qualitative and quantitative factors which must be incorporated in the analysis. Many quick-reaction "what-if" scenarios must be investigated to detail the impacts of alternative decisions. In the face of reduced personnel and dollar resources, essentially manual execution of these activities is increasingly difficult. This situation generates a requirement for a suite of analytical models to assist in: 1) developing the POM 2) articulating the impact of resource decisions and 3) analyzing resource trade-offs. Army Program, Analysis, and Evaluation Directorate's (PAED) Decision Support System (DSS) is being developed and tested to enhance the Army POM process. DSS techniques and methodologies include rank-based hierarchical assessment, Quality Function Deployment (QFD), and System Dynamics Simulation. Among several functions, this DSS tool prioritizes and optimizes over 500 Management Decision Packages (MDEPs) which are the key programming structures supporting the Army's mission and Title X responsibilities. Risks associated with resultant funding levels in various sectors, e.g., readiness, modernization, and soldiers' quality of life are then made evident. Finally, this tool provides the capability to support a future executive-level display of resource options for real time decisions making.
 
 
Modeling Support Infrastructure for the Expeditionary Aerospace Force
Lionel Galway
RAND
In response to a "new world" of frequent deployments, the U.S. Air Force has developed a new operational concept, the expeditionary Aerospace Force, which replaces large overseas forces with units that can be deployed quickly from the U.S. Meeting demanding timelines for deployment will require a rethinking and potential restructuring of all support functions, such as munitions, fuel, maintenance, and supply. Because of the large uncertainties regarding proposed support alternatives in future scenarios, we argue that strategic support planning (i.e. decisions on overall structure, technology and process improvements, etc.) should be done with relatively simple and transparent models that have modest data requirements and run quickly to allow runs over many different scenarios. These models allow quick exploration of wide ranges of alternatives and help direct detailed analysis to promising options.
 
 
Information and Knowledge Organization to Support Modeling and Statistics in Defense
Yvonne M. Martinez
Los Alamos National Laboratory
The creation of knowledge bases to support decision making has become an integral part of our Statistical Sciences Group's weapons reliability efforts. Designed properly, these knowledge bases become the critical infrastructure of expertise from which information (quantitative and qualitative) is extracted and modeled. An example of the Prototype Slapper Detonator Knowledge Base will be given. This is an electronic repository for capturing and preserving knowledge on a type of detonator—the Slapper. The Knowledge Base is a resource for the Department of Defense's (DoD's) decision making on the design, modeling, manufacturing, and reliability of the Slapper. Methods for developing the prototype originated with meetings of advisory experts and analysts on the contents, organization, and needed capabilities of the Knowledge Base. We refine the Knowledge Base through usability tests-observations of users as they search through the Knowledge Base and perform their decision-making tasks. The Statistical Sciences Group is further exploring the use of knowledge bases to integrate information, to perform statistical analysis and modeling, and to support collaborative updating of knowledge. This work will also be displayed through examples of PREDICT (Performance and Reliability Evaluation and Design by Information Combination and Tracking) and RETAIN (Repository for Expertise and Tools for Analyzing Integrating KNowledge).
 

Invited Sessions Home

 
Enterprise Modeling: Supply Chain Design to Statistical Performance Analysis
Organizers: Bonnie Ray
(borayx@m.njit.edu)
New Jersey Institute of Technology
and
Leslie M. Moore
(lmoore
@lanl.gov)
Los Alamos National Laboratory
 
Systems Thinking in Supply Chain Management
Charu Chandra
University of Michigan
The concept of supply chain is about managing coordinated information and material flows, plant operations, and logistics. It provides flexibility and agility in responding to consumer demand shifts with minimum cost overlays in resource utilization. The fundamental premise of this philosophy is synchronization among multiple autonomous entities represented in it. That is, improved coordination within and between various supply chain members. Coordination is achieved within the framework of commitments made by Members to each other. Members negotiate and compromise in a spirit of cooperation, in order to meet these commitments. Increased coordination can lead to reduction in lead times and costs, alignment of interdependent decision-making processes, and improvement in the overall performance of each Member, as well as the supply chain network (Group). Such an arrangement offers opportunities to design, model, and analyze problems with local perspective of a Member and global view of a Group. It also holds the potential of emergence of divergent supply chain network topologies, in order to satisfy dynamic market conditions. These unique configurations and associated problems require formulations in relation to a systems framework, recognizing their domain dependence within the domain independent environment of the supply chain. In this talk, we present a systems framework that atlases an interdisciplinary approach to supply chain management with methods and techniques incorporated from production operations management, management science, industrial engineering /operations research, systems sciences, and artificial intelligence and computer science fields.

One of the important supply chain management problems is to transform incomplete information about the market and available production resources into coordinated plans for production and replenishment of goods and services in the network formed by cooperating entities. We illustrate the proposed framework to address this problem, for a textile supply chain.

 
 
Planning Experiments with Computer Models of Complex Phenomena
Leslie M. Moore
Bonnie Ray
Dennis R. Powell
Los Alamos National Laboratory
New Jersey Institute of Technology
Los Alamos National Laboratory
Computer codes are being developed to model many complex phenomena including a manufacturing supply chain. Use of computer simulation models leads to the consideration of statistical methods for gaining understanding from these models of underlying processes, perhaps to stimulate the development of science or to support decision making. We describe statistical methods for sensitivity and performance analysis of complex computer simulation experiments. Analysis of variance-based methods or regression tree analysis are useful for determining variables having substantive influence on the experimental results and to investigate the structure of underlying relationships between inputs and outputs. An approach to analysis leads to the need to design computer experiments from which estimates of the quantities of interest can be obtained with reasonable efficiency. Inputs to simulation codes may number in the tens to hundreds and information that allows focus on subsets of important inputs is invaluable. Some experiment design approaches based on fractional factorial design, or orthogonal arrays, are described.
 
 
Forecasting Methods for Supply Chain Management
Bonnie Ray
New Jersey Institute of Technology
Forecasting item demand across different segments of the manufacturing processes and across different time horizons is an integral part of supply-chain management. In this talk, we review time series methods that are commonly used for demand forecasting in inventory management applications and discuss some issues that arise when the methods are extended to the supply chain framework. An example of forecasting items in a textile supply chain is used to illustrate the discussion.
 

Invited Sessions Home

 
Using Statistical Modeling to Identify Perturbations in Earth System Processes:
Examples from Landscape Evolution and Tectonics

Organizer: Dorothy Merritts
(D_Merritts@acad.FandM.edu)
Franklin and Marshall College
 
River Network Scaling Laws: Deviations and Fluctuations
Peter Dodds
MIT
The statistics and structure of river networks are commonly described by power laws. In practice, deviations in scaling are present making exact measurements difficult. The choice of parameter ranges used for regression analysis can markedly affect estimates of exponents. We show that many relationships possess several distinct scaling regimes linked by crossover regions. These scaling regimes, which may be present to varying degrees, pertain to the scale of linear, pre-network basins large length scales at which correlations in landscapes become negligible and outer length scales dictated by geology. We present evidence from real data for large-scale networks including the Mississippi, Amazon, Nile and Kansas river basins. We observe that improvements in topographic resolution would be unlikely to result in cleaner statistics that variations in measurements for small-scale basins are real and unavoidable and that strong deviations are indicative of geology being at work.
 
 
Quantitative Testing of Landform Evolution Models
Garry Willgoose and Greg Hancock
University of Newcastle, Australia

Over the last 15 years a range of landform evolution models have been developed. These models are highly nonlinear and sensitive to small perturbations. It is not possible to carry our repeated combinatorial experiments to collect the data to test these models. Finally a key aspect of landforms are their spatial organisation in the form of drainage networks. This paper outlines a novel methodology the authors have developed to address these limitations using key indicator statistics (width function, cumulative area function, area-slope relationship) and the GLUE hypothesis testing methodology (Beven and Binley, 1992). Results from studies will be shown. Some unresolved statistical challenges will be outlined including

(1) accounting explicitly for spatial organisation and patterns of drainage for "random" networks. We cannot do repeated experiments in the field but we can do repeated experiments in the computer and see if the field data is likely to have come the computer population of generated landscapes.

(2) developing likelihood functions for simultaneous use of several test statistics. We must understand the correlation between test statistics. For instance, the hypsometric curve and area-slope relationship both characterise elevation properties but what model discrimination power does hypsometry provide over and above area-slope alone?

 
 
Deviations in Slope-Area Relations that Indicate Geologically Recent Crustal Deformation
Tim C. Hesterberg
MathSoft, Inc.
Dorothy Merritts
Franklin and Marshall College
In 1811-1812 three great (8.0+) earthquakes occurred near New Madrid, Missouri. We estimate coseismic deformation in this area using stream elevation data from topographic maps. Streams have a natural profile, the gradient of which depends on the resistance of underlying sediment and the volume of stream flow. If tectonic processes elevate the upstream end of a segment a different amount than the downstream end, the stream will attempt to return to its natural gradient by incising, aggrading, or altering its sinuousity. This adjustment takes time, so deviations from the natural gradient may indicate geologically recent deformation. We use penalized regression splines to estimate the natural stream profile and the deformation of the ground surface. Estimation of the natural profile and deformation is based on nonparametric regression of the form y2 - y1 = f(x2) - f(x1), where the x's are univariate or bivariate.
 
 
Conjugate Gradient Methods for Large-Scale Sparse Regression,
with Applications to Seismic Deformation Estimation

Derek Stanford
MathSoft, Inc.
To estimate seismic deformation, we need to fit a smooth surface to marked spatial data. Estimation of this surface can be characterized as a regression problem with smoothness constraints. We use a conjugate gradient method because the large size of the design matrix in this regression does not permit computation of an exact least squares solution. For example, a relatively coarse surface grid with 103 cells per side would lead to a design matrix with 106 columns; an exact linear solution would require inversion of a matrix with dimensions 106 by 106, which is not currently feasible. Our conjugate gradient approach allows us to avoid this difficulty by taking advantage of the sparse structure of the design matrix, and we have implemented this in software written in C and Splus.
 

Invited Sessions Home

 
Statistics in Precision Agriculture
Organizer: Barry Moser
(bmoser@lsu.edu)
Louisiana State University
 
Hypothesis Tests in the Presence of Spatial Correlation
Robert G. Downer
Louisiana State University
Traditional experimental designs do not directly account for spatially dependent observations. However, this aspect of the data can not be ignored in both design and analysis. The basic issues associated with spatially correlated observations in standard designs will be presented and some of the methods which have attempted to address the problem will be reviewed. The impact of these issues on hypothesis testing will be discussed. For fixed effects one-way analysis of variance, a permutation test is introduced.
 
 
Statistical Issues in the Analysis of Remotely Sensed Data as Pertains to Precision Agriculture
Patrick D. Gerard, David Evans, and Michael Cox
Mississippi State University
The potential environmental benefits of precision agriculture have been well documented. However, before these benefits can be fully realized, the economical feasibility of these methods must be addressed. One of the primary costs associated with precision farming involves the acquisition of information on quantities of interest, such as soil fertility and weed abundance, across a large area. Normally, obtaining the requisite information is both time and labor intensive. Agricultural scientists have turned to remote sensing to non-invasively obtain information, in the form of intensity of energy radiated from the earth, which may be related to the quantities of interest. In this talk, we will briefly discuss some of the objectives of precision agriculture and how they relate to remote sensing. We will also discuss, in general terms, some statistical and data management issues that arise from the use of remote sensing in precision agriculture, with emphasis on new challenges as remotely sensed data moves from being multispectral in nature to hyperspectral.
 
 
Use of Spatial Statistics to Design Sampling Plans for Monitoring Rust in Coffee Trees
Raúl E. Macchiavelli
and Rocío del P. Rodríguez
University of Puerto Rico
Rust is an important disease of coffee that decreases yield of coffee beans. For monitoring purposes, sampling is done in a two-step systematic plan: trees are sampled systematically (in a W pattern covering the field), and then leaves are randomly sampled from each selected tree. Since coffee in Puerto Rico is grown in areas with pronounced slopes, these plans require walking diagonally along slopes, which is not feasible for regular monitoring by farmers. In this work we compare by simulation different sampling plans in order to find one to be used for monitoring this disease. The disease incidence was evaluated for all trees in a coffee lot (N=1269) and two systematic patterns were studied: a W pattern every c trees (c=2,…,6) and a pattern with parallel rows, where trees were selected along 4 equally spaced parallel rows, every c trees (c=2,…,7). Simulations were carried out using a SAS macro, sampling 2, 5, 10, 30, and 40 leaves per tree every c trees. In order to generalize these results to other fields, the spatial distribution pattern was modeled using spatial statistics new data sets were generated changing the disease incidence and the spatial correlation and simulations were run using the approach described before. The results suggest that both systematic sampling patterns gave approximately the same standard errors (except in cases with large spatial correlation).
 
 
Effect of Number of Seed Bulked when Using the Multiple-Seed Procedure for Self-Pollinated Crops
James Beaver and Raúl E. Macchiavelli
University of Puerto Rico
Single-seed descent has been used by grain legume breeders to maintain genetic variability in populations of advanced generation lines. In this method, a single seed from each plant is chosen and planted, and the process is repeated several times (typically four). In order to reduce labor costs, breeders use a multiple seed procedure in which a single pod rather than a single seed is harvested from each plant and bulked. This paper studies the distribution of the proportion of the original plants represented after applying this method four times (advancing from the second to the sixth generations). Since the analytical solution to this problem is intractable (it involves a 4-fold convolution of a multivariate hypergeometric distribution), a simulation was run in SAS to estimate this probability distribution. Results show that the proportion of plants represented at least once is between .25 and .33 that an increase in size of the original population does not influence significantly the mean proportion but decreases its variability and that the number of seeds per pod affects both the mean and the standard deviation of the distribution.
 
 
Geostatistical Analysis of Spatial Nutrient Data in a Precision Agriculture Experiment
Bradley Tiffee and Robert G. Downer
Louisiana State University
Precision agriculture uses geographic information systems, computer technology and a global positioning system to map site-specific factors that affect production. The Dean lee Research Station of the Louisiana Agricultural Experiment Station is using this technology to map soil nutrients and soybean yield. This may be achieved through spatial modeling using variograms and kriging. A sample variogram is estimated from sampled magnesium concentrations and kriging is used to predict values for the entire field. A comparison is made to prediction from a trend surface model and recommendations are discussed.
 

Invited Sessions Home

 
The Utility of Bayesian Decision Analysis for Environmental Problems
Organizer: Paul Black
(pblack@neptuneandco.com)
Neptune and Company, Inc.
 
Scenario and Parametric Uncertainty in GESAMAC:
A Methodological Study in Nuclear Waste Disposal Risk Assessment

David Draper
University of Bath in England
In this talk I will examine a Bayesian conceptual and computational framework for accounting for all sources of uncertainty in complex prediction problems, involving six ingredients: past data, future observables, and scenario, structural, parametric, and predictive uncertainty. I will apply this framework to nuclear waste disposal using a computer simulation environment–GTMCHEM–which "deterministically" models the one-dimensional migration of radionuclides through the geosphere up to the biosphere. Focusing on scenario and parametric uncertainty, I will show that mean predicted maximum dose for humans on the earth's surface due to key radionuclides, and uncertainty bands around those predictions, are noticeably larger when scenario uncertainty is properly assessed and propagated. I will conclude by describing how Bayesian decision theory can take predictions such as these and turn them into recommendations for environmental action.
 
 
A Probability Network for Water Quality Modeling and Decision Support
Kenneth H. Reckhow and Mark E. Borsuk
Duke University
A probability network model is being developed and applied to the problem of eutrophication in the Neuse Estuary, USA. Also called a "Bayes net," this model consists of the variables of interest in the system and a set of assertions concerning the probabilistic relationships among the variables. The objective of the model is to provide a scientific assessment of the impact of nitrogen loading on estuarine algal blooms and fishkills model expressions are quantified using data analysis, mechanistic relationships, and/or expert judgment. Probabilistic predictions of model endpoints may then be made based on the entire set of conditional probabilities. Not only does this network structure provide an integrated approach to uncertainty analysis, but it also allows easy updating of prediction and inference when observations of model variables are made. This capability is particularly important when applied to a natural system in which additional monitoring is likely to occur concurrent with the modeling effort. The method is probabilistic in its approach, which facilitates a meaningful communication of uncertainty, and is consistent with the risk assessment paradigm. Model endpoints are chosen so that they are of vital interest to stakeholders and can be easily expressed for use in formal decision analysis.
 
 
Bayesian Assessment of Uncertainty and Variability in Deterministic Environmental Exposure Models
Samantha Bates and Adrian Raftery
University of Washington
In this paper we discuss Bayesian methods of analysis that incorporate both prior knowledge of the distributions of the inputs to a deterministic model and any available data on the model inputs and outputs. These methods yield posterior distributions for the model output from which to find distributions for quantities of interest. The first method uses Monte Carlo simulation from the prior distributions for the inputs and resampling of these simulations with weights determined by the observed data under the sample importance resampling scheme of Rubin. The second involves sampling from the posterior using MCMC methods. We will present an application of the methods to modeling poly-chlorinated biphenyl (PCB) concentrations in various media at a Superfund site in New Bedford Harbor, MA. A deterministic model for PCB concentration in soil was developed by Cullen (1992). Expert opinion is reflected in the prior distributions for model inputs. Interest lies in developing a distribution for the PCB concentration in soil at this site which accounts for uncertainty and variability in the model inputs and can be used in policy decisions.
 
 
Environmental Modeling and Bayesian Analysis for Assessing
Human Health Impacts from Radioactive Contamination

Tom Stockton and Paul Black
Neptune and Company, Inc.
EPA regulations and DOE orders require assessing the impact on human health of radioactive waste contamination over periods of up to ten thousand years. Towards this end complex environmental simulation models are used to assess "risk" to human health from migration of radioactive contamination. Typically there is very little data underlying these models and the data that is available is incorporated in input parameter distributions for Monte Carlo simulation. Expert judgment typically drives the level of model complexity chosen but model complexity choices are often not made within the context of the decision to be made. The utility and regulatory acceptability of a Bayesian approach to decision making regarding radioactive contamination is discussed within the context of radioactive contamination examples from Los Alamos National Laboratory, Hanford, and the Nevada Test Site. These examples highlight the desirability and difficulties of merging the cost of monitoring, the cost of the decision analysis, the cost and viability of clean up, and the probability of human health impacts within a rigorous decision framework.
 

Invited Sessions Home

 
IASC Sponsored Session: Applications to Earth Systems
Organizer: Edward J. Wegman
(ewegman@gmu.edu)
George Mason University
 
Using Smoothing to Reconstruct the Holocene Temperature in Lapland
Lasse Holmström
Rolf Nevanlinna Institute
Small arctic and subarctic lakes are known to be sensitive to climatic variation. Changes in external conditions are continuously recorded in their sediments in the form of aquatic organisms. The abundance of such organisms can therefore be used to reconstruct past environmental conditions. We use data collected from the Finnish Lapland to reconstruct post ice-age temperatures. Nonparametric smoothing has not been used often in this context. We find smoothing a viable tool both in the actual reconstruction phase and in the subsequent time series smoothing, where the SiZer method is used.
 
 
A Computational Geometry Approach for Peeling and Outlier Detection
Giancarlo Ragozini
Universita di Napoli Federico II
From a geometrical point of view, outliers are those observations lying isolated on the periphery of data cloud. A large literature exists on the detection of multiple outliers in multivariate data sets. Most of recent proposals are based on some robust distance of each data point from a center. However, they are really effective only when the data scatter has a regular shape. The proposed method is based on the direct exploration of the data periphery, without considering any center or fixed shape, exploiting the geometrical properties of the sample convex hull. The first step of the proposed detection procedure consists of a new "weak" convex hull peeling, reducing the computational effort of the classical peeling procedures. In this step the set of candidate outliers is identified, evaluating gaps in the data scatter and proximities to its boundary region. In the second step, a block omission approach is performed, considering only some specific subsets among the candidate outliers, in order to reduce the combinatorial computational cost. The outlyingness of each subset is measured through a new index based on the variation of the convex hull volume when a subset is omitted.
 
 
Applications of Deepest Regression
Mia Hubert, Peter J. Rousseeuw, and Stefan Van Aelst
Universitaire Instelling Antwerpen
The deepest regression is a method for linear regression introduced by (Rousseeuw and Hubert 1999). It is the fit with maximal regression depth. We prove that this estimator is highly robust against outliers. We propose an approximate algorithm for fast computation of the deepest regression in higher dimensions, and apply it to several real data sets. From the distribution of the regression depth function we construct tests for the true unknown parameters in the linear regression model. We also propose a bootstrap method to construct regression confidence regions. For bivariate datasets we use the maximal depth to construct a test for linearity versus convexity/concavity. Finally, the deepest regression is applied to polynomial regression and to the Michaelis-Menten model.
 
 

Invited Sessions Home

 
Statistics and Information Technology
Organizer: Alan Karr
(karr@niss.org)
National Institute of Statistical Sciences
 
How Should We Publish Data Analyses in the Web Age?
Todd L. Graves
Los Alamos National Laboratory
Papers and journals about data analyses that are published online should be different from and more powerful than those in paper journals. Web papers can include interactive visualization and data analysis applets to allow the readers to perform exploratory analyses. Readers could also replay the authors' analyses, trying out minor modifications, or even importing their own software or data to see if different analyses or additional data change the authors' conclusions. Online journals could be reorganized so that all analyses of a particular data set or that bear on a particular real world problem could be reachable through the same web page. Readers could perform and submit their own analyses of these problems all within the web journal environment. This talk will include demonstrations of these concepts written in Java.
 
 
Geographic Aggregation Procedures for Data Disclosure Limitation
Ashish Sanil
National Institute of Statistical Sciences
Government agencies often report their data (gathered through sample surveys and censuses) in the form of statistical summaries by geographic units (e.g., by state, county, etc.). In many cases the public release of data on particular geographic units is considered too risky for preserving the confidentiality of the respondents. A possible strategy for such cases is to aggregate neighboring regions into larger units which satisfy the confidentiality requirements. Often, as in the case of an on-line query system, the computation of the aggregations needs to be automated, should be computationally efficient, and should produce meaningful aggregates. Procedures for carrying out such confidentiality-preserving geographic aggregation will be described, and illustrative examples will be presented.
 
 
Detecting Defection: Mining Massive Online Data to Model ISP Customer Churn
Nandini Raghavan
AT&T Labs-Research
From a statistician's point-of-view, the most striking (and daunting) aspect of the web phenomenon is extracting meaningful information from the tremendous volume of data available. In this talk I describe our efforts to develop evolving statistical profiles (signatures) of users of an internet service provider (ISP). ISPs are characterized as having large, fluid user populations and massive, dynamic data streams which record information at different granularities. I describe an application where we use these signatures to build formal statistical models of customer migration.
 

Invited Sessions Home

 
Statistical and Computational Methods for Survival and Reliability Data
Organizer: Luis A. Escobar
(luis@stat.lsu.edu)
Louisiana State University
 
A Case Study in Competing Risk Reliability Analysis Using JMP Software
Bradley Jones
SAS Institute, Inc.

This presentation illustrates the use of JMP software to analyze a manufacturing oriented reliability problem with competing risks. The application is non-trivial and the data are real.

The demonstration shows the ability of fast modern computers to fit, diagnose, and refine such complex models interactively.

 
 
Reliability Data Analysis Using S-Plus
William Q. Meeker
Iowa State University
Censored and truncated data arises frequently in product reliability studies involving laboratory accelerated life tests, field tracking studies, and the analysis of warranty data. S-PLUS has powerful tools for analyzing such data. This talk will describe SLIDA, a collection of S-PLUS functions for Life Data Analysis that has been designed to extend and enhance the S-PLUS capabilities in this area. Some of these extensions include:
  • Functions that link maximum likelihood estimation with probability plots to facilitate analysis steps ranging from model identification and diagnostics to sensitivity analysis and presentation of final results.
  • Simulation tools for inference and test planning.
  • Functions that allow the user to specify a likelihood function and easily do appropriate likelihood-based analyses for nonstandard models.
  • Methods for recurrence data.
  • A comprehensive set of example data sets and command scripts illustrating the methods.
  • A graphical user interface for the core SLIDA functionality.

Random Effects Survival Models for Familial Data