A
B
S
T
R
A
C
T
S
INTERFACE
2000
INVITED SESSION ABSTRACTS
|
Organizer: Doug Nychka (nychka@cgd.ucar.edu) National Center for Atmospheric Research |
||||||||
|
Ralph Milliff National Center for Atmospheric Research |
||||||||
| The large-scale mean north-to-south overturning circulation in the Atlantic Ocean transports warm water poleward at the surface (e.g., the Gulf Stream and North Atlantic currents), and cold water equatorward at depth (e.g., the Deep Western Boundary Current system). This is an important branch of the so-called ocean conveyor-belt circulation that provides a conceptual model of the global ocean as a sink for atmospheric concentrations of greenhouse gases, e.g., carbon dioxide. The ocean deep convection process at high-latitudes is the energetic downward component of the conveyor-belt model. Ocean deep water is said to be formed in this process wherein oceanic parcels recently in contact with the atmosphere are sequestered at depth, and subsequently isolated from the surface over climate timescales, e.g., O(1000 yrs). The Labrador Sea is one of a few locations in the world ocean where deep convection occurs. Ocean deep convection is driven there by vigorous air-sea exchanges of heat, momentum, moisture, and mechanical energy associated with the passages of intense storms in winter. An extensive observational program in oceanography and meteorology was implemented in the Labrador Sea region with a focus on the ocean deep convection process for the 1996-97 winter season. We will review the large observational dataset that is emerging from that campaign. In addition we will introduce the growing satellite observational datasets of sea-surface winds and sea-surface heights in the Labrador Sea region. This talk sets the stage for a description in the following presentation of Bayesian Hierarchical Model (BHM) approaches to geophysical problems, including an example of a BHM approach to ocean dynamics leading to ocean deep convection and deep water formation in the Labrador Sea. | ||||||||
|
L. Mark Berliner Ohio State University |
||||||||
| Phenomena studied in the geophysical sciences are high-dimensional, interrelated processes distributed in space and time. Modeling and prediction of such processes typically requires the combination of both scientific understanding, usually reflected in physical models, and observations. However, the physical models are often very complex and subject to a variety of uncertainties. Further, though very large to massive datasets are often available, they are typically composed of disparate types of observations, and almost paradoxically, cover a small portion of the processes of interest. The hierarchical Bayesian viewpoint is suggested to provide a framework for combining scientific reasoning and observational data, in a fashion that quantitatively accounts for our uncertainty. I review a basic strategy for developing such hierarchical models indicate the relation between the Bayesian models and their analysis via Markov chain Monte Carlo and review some preliminary examples, motivated by Ralph Milliff's talk that precedes this one. | ||||||||
The
analysis of the prodiguous amounts of data generated by orbiting platforms
requires massive spatio-temporal models. Even for limited-domain cases,
traditional covariance-based space-time statistical methods are generally
not tractable. We explore programming methodologies for making hierarchical
Bayesian simulations for such models (typically requiring a large number
of iterations) a reality for very large datasets and spatio-temporal domains.
The heirarchical model is fully described by Wikle et. al, JASA, in review.
Our specific interest is the spatio-temporal prediction of the surface
winds over the equatorial Pacific using remotely-sensed satellite wind
observations and the output of a deterministic weather model.
|
||||||||
|
Rachel Buchberger National Center for Atmospheric Research and Colorado State University |
||||||||
| Current models of the earth's atmosphere (General Circulation Models) rely on deterministic rules to decide the cloud amounts in a grid box. Clouds are an important feature of these models because they influence how incoming and reflected radiation interacts with the atmosphere. Unfortunately, such models do not capture the motion of the cloud field well. An alternative is to consider cloud amounts as a spatial and temporal process that evolves in a nonlinear and stochastic fashion over time. In this project, the form of this process is estimated from observational data and is found to be a good description of the measured cloud fields. The statistical models are Nongaussian and use a neural network form to represent the autoregressive relationship between current cloud amounts and those in the next time period. | ||||||||
|
|
||||||||
|
Organizer: Sallie Keller-McNulty (sallie@lanl.gov) Los Alamos National Laboratory |
||||||||
|
J. Darrell Morgeson Los Alamos National Laboratory |
||||||||
| For the past 10 years, Los Alamos has pursued an aggressive simulation R&D program focused on the nation's critical infrastructure. The products of the program are described below together with their respective sponsors and funding. With internal discretionary funding, the Lab began to integrate these efforts into an overall "system-of-systems" simulation environment last year with a view toward addressing near, mid-, and long-term policy, investment, and operational issues for use by both public and private sectors. The scale and scope of these simulations extend the state-of-the-art in very large-scale complex systems simulations and computation (i.e., human based interactions on the order of 107 - 1012 interactions per second). The combined simulation and analysis system will effectively address such complex issues as: convergence, electrical grid restructuring energy security and reliability, interdependencies among critical infrastructure systems, environmental impacts, response to natural and manmade disasters, and others. The entire simulation system is designed to run on multiple hardware platforms including ASCI architectures. An S&T roadmap has been developed to show the evolution of this program over the next 10 years for both applied products and advances in the science that forms the foundation for these tools and their use. To date, over $50M have been spent on these projects with another $42M planned through 2004the majority provided by other federal agencies. | ||||||||
|
Chris Barrett Los Alamos National Laboratory |
||||||||
| We introduce a new mathematical object, a Sequential Dynamical System (SDS), and explore the insights we obtain in relation to computer simulation of large, composed, dynamical systems, their measurement and interpretation, and a range of issues surrounding validity of a simulation as used in decision making and analysis of such systems. In large infrastructure design, analysis and policy questions, representation using computers of present and/or future systems is desirable in many respects. However, firm theoretical foundations for simulation and its appropriate use are lacking, which limits their credible use. In particular the following issues impose a need for rigorous foundations, 1. the short and long term dynamics of these systems are complicated and characterized by massive interaction among large numbers of subsystems, 2. a potentially endless data collection requirement warrants care in choice of approaches to represent the system, 3. the fact that knowledge of piece-parts usually exceeds understanding of the composed system, and 4. a broad range of issues surrounding any meaningful use of the term "validity" as it might apply to a computer simulation used as a model to which decision making processes refer. We will examine these issues from the perspective of our emerging theoretical foundations for simulations as well as applications in transport and communications infrastructure analysis. | ||||||||
|
Katherine Campbell and Richard Beckman Los Alamos National Laboratory |
||||||||
| At first glance, the simulation models described by the other speakers in this session may appear to be radical departures from more familiar types of numerical models of dynamical systems, such as models of oil reservoirs or accelerators. A closer look at the structure of such models, however, reveals more similarities than differences. Exploring these common features, we find a great variety of statistical opportunities, from the stochastic simulation of incompletely known input and boundary conditions to the design of experiments to optimize computational algorithms, from model calibration to model assessment. This talk will include examples from transportation simulation (an infrastructure model), vadose zone flow simulation at Yucca Mountain (the proposed site of a geological repository for high level nuclear waste), and climate modeling. | ||||||||
|
Matthias Schonlau RAND |
||||||||
| I will give an overview over several aspects related to Cyber Crime. Topics include private company concerns and security measures, academic efforts in intrusion detection and the position of the government. I will also discuss the work of the Cyber Assurance unit at RAND. | ||||||||
This
paper will describe some of our recent efforts in the application of modern
statistical visualization methodologies to the problem of the detection
of intruders on computer systems. We will illustrate the application of
data imaging to various problems within the intrusion detection arena.
We will also cover some new visualization frameworks which we have developed
to aid the human analyst in their interpretation of network-based intrusion
detection information. This work has been performed in support of the
Secondary Heuristic Analysis for Defensive On-line Warfare (SHADOW) intrusion
detection system. This is an operational intrusion detection system that
has been deployed at numerous facilities worldwide. Some rudimentary background
material on the SHADOW system will also be provided.
|
||||||||
|
Jaimyoung Kwon, Peter Bickel, and John Rice UC Berkeley |
||||||||
| We will discuss hidden Markov modeling of the velocity field observed from freeway stretches. Two slightly different models are proposed, each with different flavor and computational difficulty. The models can be used for simulation and prediction. Due to the dimensionality of the problem, the traditional algorithm that calculates the likelihood exactly is not applicable. As an alternative, Monte-Carlo EM (MCEM), possibly using sequential importance sampling (SIS), is used. The method of iterative conditional modes (ICM) will also be mentioned, though it is not consistent in general. Performance of each method and various computational issues will be discussed. Empirical works include analysis of data from I-880 and I-5 interstate highways. | ||||||||
|
|
||||||||
|
Organizer: Cathryn Dippo (dippoc@ore.psb.bls.gov) Bureau of Labor Statistics |
||||||||
|
University of California, San Diego |
||||||||
| The task of information mediation is to enable a user to query across a number of information sources as if they were a single integrated source. To accomplish this integration, the data from the sources are transformed to a common representation, through a process called wrapping. We show how such information integration can be done on survey data, when the data is presented in the DDI format and information integration is achieved in the MIX framework developed at UCSD. We point out how the role of statistical expert knowledge must be exploited to make the mediation meaningful for users. We also demonstrate how a survey analysis tool, called Sociology Workbench uses integrated survey information to perform its analysis. | ||||||||
|
Carol Hert Syracuse University |
||||||||
| Statistical Websites have made it possible for a huge variety of users to access statistical data. Understanding how these users locate and use data, what expectations they have for use, and what they understand about data will enable website designers to provide information and tools that enable better access. This paper reports on a series of user investigations related to United States Federal Statistical Agency websites highlighting findings to date as well as providing a perspective on how to understand and support users via system design. | ||||||||
The
old saying "there’s nothing like more data" is only true if you can
successfully access the data you need, not get lost in it. The proliferation
of statistical databases in U.S. Government Agencies has not been accompanied
by a single widely used access system. To try to facilitate terminology
standardization and cross-database access and linkup, we are connecting
databases to a large 100,000-node taxonomy of ‘concepts’ called SENSUS.
SENSUS will be used as the basis of a multi-database query planning
and access system. This paper describes SENSUS, methods of linking other
terminology systems to it, and the work now being done with some statistical
databases.
|
||||||||
|
Carol Hert Syracuse University |
||||||||
| This poster complements the presentation of the same title. It will graphically depict the connections among several streams of research all concerned with improving system design for statistical information seeking. The streams included are research about users (and non-users), users' interactions with statistical information seeking tools (including metadata and search engines), interface design, metric and tool development, and organizational impact research. The poster will indicate how these projects contribute to our understanding of user behavior and how to support it on web-based systems. | ||||||||
|
Ed Hovy USC Information Sciences Institute |
||||||||
| A poster accompanying the paper by Hovy and Klavans provides details about the SENSUS ontology and about the process of analysis and taxonomization required to integrate a database or collection of text with an ontology. | ||||||||
|
The
Data Documentation Initiative: Current Status of |
||||||||
|
In early 1995, aided by a grant from the National Science Foundation, the Inter-university Consortium for Political and Social Research formed a committee which now numbers approximately 20 stakeholders in the social science research and archiving process. The goal of the committee was to develop a specification for the documentation of empirical social science data collections. The project is known as the Data Documentation Initiative (DDI). The group decided that the best format for the specification was an XML (Extensible Markup Language) based Document Type Definition (DTD). A 13-site beta-test of an initial DDI DTD was completed in mid-1999. Comments resulting from the beta-test were reviewed and many of the suggestions have been implemented. Version 1 of the DTD will be released in early 2000. This poster session will present the goals of the DDI project. The structure of the Version 1 DTD will be presented and explained. The types of data collections that may be marked-up using the DTD will be discussed. Issues held for Version 2 will be presented. Finally, some thoughts on the role that other parts of the XML suite (XML Schema, XML Data, XLink, XPointer) will be presented. |
||||||||
|
|
||||||||
|
Applications of Clustering and Classification to Large Datasets Organizer: William Shannon (shannon@osler.wustl.edu) Washington University School of Medicine |
||||||||
|
Andrew Moore Carnegie Mellon University and Schenley Park Research, Inc. |
||||||||
|
Intensive statistical analysis of massive data sources ("data mining") has been embraced as one of the final areas with a need for massive computation beyond that available on a $2000 computer or $200 videogame. We begin this talk with two examples of software, instead of hardware, giving 1000-fold speedups over traditional implementations of statistical algorithms for prediction, density estimation, and clustering. We then pause to examine directions in which these software solutions seemed blocked when faced with Physics, Biology and commercial scientific data discovery problems. The primary blocks were a curse of dimensionality and limitations on machine main memories. This is followed by four examples of new pieces of research that circumvent these barriers: lazy cached sufficient statistics, exact accelerated k-means, multiresolution ball-trees for very high dimensional real-valued data, and filament identifiers. We then reveal the reason for our new-found respect for super-computation: when an algorithm you previously ran overnight executes in seconds, you find yourself wanting to run it ten thousand times. We show the impact of being able to run intensive statistics as an inner loop has had on our analysis of cosmology data (preliminary data from the Sloan Digital Sky Survey) and biotoxin identification, where desirable but hopelessly extravagant operations such as model selection, bootstrapping, backfitting, randomization and graphical model design now become somewhat non-hopeless. Joint work with Andy Connolly (U Pitt Physics), Artur Dubrawski (Schenley Park Research), Geoff Gordon (Auton Lab), Paul Komarek (Auton Lab), Bob Nichol (CMU Physics), Dan Pelleg (Auton Lab) and Larry Wasserman (CMU Statistics). |
||||||||
|
|
||||||||
|
Daniel Weaver Genomica Corporation |
||||||||
| Information from the Human Genome Project is allowing scientists to perform systematic experiments and gather data in unprecedented amounts. This talk will review the mathematical classification techniques being applied to gene expression data and will frame the scientific questions that such data can address. Current techniques being applied to such data range from simple average-linkage clustering to self-organizing maps. While these techniques are sufficient for the existing data volumes, they are unlikely to work efficiently on the large data sets that will be generated in the coming years. Some key scientific questions that will be raised include: What constitutes a statistically significant, diagnostic gene expression pattern? How are gene expression data used to functionally classify genes? How can large volumes of gene expression data be used to predict the underlying gene expression control mechanisms? No biological background is needed; the relevant biological concepts will be described. | ||||||||
|
for Gene Chip Data William Shannon Washington University School of Medicine |
||||||||
| It is
now recognized that examining patterns of gene expression (i.e., a gene's
state of activity) in patients can assist health professionals in detecting,
diagnosing, and treating human disease. Recently, a new technology named
the nucleic acid array or `Gene Chip' allows clinicians to measure these
patterns and begin relating them to clinical diagnosis and outcomes. In this poster we present an overview of the effort underway at Washington University School of Medicine in St. Louis to develop and apply microarray technology to basic and clinical research in surgery, molecular microbiology, genetics, and cancer. Attention is focused on the bioinformatic and biostatistical challenges we are faced with, and early efforts to bring this under control. In addition, we will discuss a novel application of wavelet transformations to gene chip data we are currently exploring. |
||||||||
|
Stephen D. Bay, Dennis Kibler, Michael J. Pazzani, and Padhraic Smyth University of California, Irvine |
||||||||
| Advances in data collection and storage have allowed organizations to obtain massive, complex and heterogeneous databases, which have stymied traditional methods of analysis. This has led to the development of new analytical tools that often combine techniques from a variety of fields such as statistics, computer science and mathematics to extract meaningful knowledge from the data. To support research in this area, UC Irvine has created the UCI Knowledge Discovery in Databases (KDD) Archive (http://kdd.ics.uci.edu/). This is a new online repository of large and complex databases which encompass a wide variety of data types, analysis tasks and application areas. Our goal is to foster research in knowledge discovery by making these databases publicly available. The archive is supported by the Information and Data Management Program at the National Science Foundation, and is intended to expand the current UCI Machine Learning Database Repository to databases that are orders of magnitude larger and more complex. | ||||||||
|
|
||||||||
|
New Developments in EM Organizer: Andreas Buja (andreas@research.att.com) AT&T |
||||||||
The
EM algorithm is widely used in incomplete-data problems (and some complete-data
problems) for parameter estimation. One limitation of the EM algorithm
is that upon termination, it is not always near a global optimum. As
reported by Wu, when several stationary points exist, convergence to
a particular stationary point depends on the choice of starting point.
Furthermore, convergence to a saddle point or local minimum is also
possible. In the EM algorithm, although the loglikelihood is unknown,
an interval containing the gradient of the EM q function can be computed
at individual points using interval analysis methods. By using interval
analysis to enclose the gradient of the EM q function (and, consequently,
the loglikelihood), an algorithm is developed which is able to locate
all stationary points of the loglikelihood within any designated region
of the parameter space. The algorithm is applied to several examples.
In one example involving the t distribution, the algorithm successfully
locates (all) seven stationary points of the loglikelihood.
|
||||||||
|
David A. van Dyk Harvard University |
||||||||
| In recent years numerous advances in EM methodology have lead to algorithms which can be very efficient when compared with both their EM predecessors and other numerical methods (e.g., algorithms based on Newton-Raphson). In this paper we focus on mixed-effects models and combine several of these new methods to develop a set of mode-finding algorithms which are both fast and more reliable than standard algorithms such as proc mixed in SAS. We present efficient algorithms for Maximum Likelihood, Restricted Maximum Likelihood, and computing posterior modes. These algorithms are not only useful in their own right, but also illustrate how parameter expansion, conditional data augmentation, and ECME can be used in conjunction to form efficient algorithms. In particular, we illustrate a difficulty in using the typically very efficient parameter-expanded EM algorithm for posterior calculations, but show how algorithms based on conditional data augmentation can be used. We also present a result that extends Hobert and Casella's (JASA, 1996) result on the propriety of the posterior for the mixed-effects model under an improper prior, an important concern in Bayesian analysis. Finally, we show how similar methods applied to the Data Augmentation algorithm can lead to very efficient stochastic algorithms for posterior sampling. | ||||||||
|
|
||||||||
|
Organizer: Lorraine Denby (ld@research.bell-labs.com) Bell Labs--Lucent Technologies |
||||||||
|
John Doyle Caltech University |
||||||||
| A great deal of attention has been given recently to describing features of complex systems in terms such as self-similarity, power laws, and entropy, phase transitions, criticality, fractals, chaos, and so on. This talk will focus on the fascinating statistical properties of web and internet traffic, and relate these to power law statistics in other domains, such as forest fires, power outages, natural and man-made disasters, and specie extinction. While it is now widely accepted that the commonly assumed Poisson traffic models poorly describe Internet traffic, it remains to be seen if these insights will lead to new approaches to network protocol design, which remains largely ad hoc. We critique the popular explanations from statistical physics and offer some novel explanations for the origins of power laws in terms of generalizations of source coding for data compression. The implications for future convergent, ubiquitous networking will also be discussed briefly as well as general issues of more rigorous approaches to analysis and robust design of complex multiscale systems in engineering and biology. | ||||||||
|
|
||||||||
|
Organizer: Vicki Lancaster (vlancast@neptuneandco.com) Neptune and Company, Inc. |
||||||||
|
Luis Tenorio Colorado School of Mines |
||||||||
|
Ill-posed inverse problems arise when we try to recover information about a process from partial, indirect noisy observations. These problems are common in physical sciences like geophysics, astronomy and astropysics where we have to rely on indirect measurements of processes in space or underneath the Earth's surface. Through examples we will illustrate the questions that arise in ill-posed inverse problems and present some basic methods to determine a meaningful, stable solution. These methods include the Backus-Gilbert method, the method of regularization, singular value decomposition and wavelet estimation. We will also discuss methods to estimate the variability, and the bias of the inversion estimates, as well as minimax estimates of the mean square error. |
||||||||
|
Alberto Villarreal Colorado School of Mines |
||||||||
|
The Seismic Reflection Experiment is one of the most important tools in geophysical exploration. This method consists of sending artificially generated seismic waves into the earth and recording them once they are reflected back to the surface by structural irregularities in the subsurface. In a Seismic Reflection Experiment, the most important parameter of interest is the velocity at which waves travel through different media in the earth. The physical velocities in the subsurface covered by the seismic experiment define how the recorded data will look like, because seismic data consists of travel-time measurements (the time it takes for a wave to traverse the path source-reflector-receiver), and travel-time depends on the wave velocity in the earth. Therefore, we are interested in the inverse problem of estimating the medium velocities from data. These velocity estimates give important information about the structure and composition of the subsoil. We use bootstrap resampling methods to improve and automate velocity estimation from seismic data. In the bootstrap approach, data samples are created by resampling the original seismic data. Next, an optimization procedure is used to obtain velocity estimates for each sample, and the variability of different velocity estimates is used to compute standard errors. This procedure is repeated iteratively with different trial velocities. The velocity estimate with the smallest error is selected. This is a computationally intensive method but can be efficiently implemented in parallel. Besides automating the velocity analysis, this method may be used to estimate errors of seismic velocities, which are essential for subsequent steps in the data processing sequence. |
||||||||
|
Luis Tenorio and Alberto Villarreal Colorado School of Mines |
||||||||
|
In reflection seismology a trace is modeled as a convolution of a seismic pulse with a reflectivity sequence that encodes information about the layering in the subsurface. To deconvolve the trace means to remove the blurring effect of the pulse to obtain an estimate of the reflectivity. A usual assumption in deconvolution is that the reflectivity is a white random process. But, most of the time this assumption is not appropriate. We use Gaussian mixtures to generalize Wiener-Levinson deconvolution and obtain a procedure that is more robust to nonstationarities in the reflectivity and to correlation structure in noise. |
||||||||
|
Albena Mateeva Colorado School of Mines |
||||||||
| Tomography is used in seismic exploration for velocity-model building. Tomographic data consist of travel times of waves, excited near the earth's surface which after having penetrated to a certain depth have been reflected back by some geological inhomogeneity. As any inversion procedure, tomography requires good knowledge of data uncertainties. Errors in measured travel times are introduced by a variety of factors but one of them is always presentrandom noise. This paper discusses its contribution to travel time uncertainty. Two objectives were set. First, to understand the interaction between signal and random noise that leads to uncertainty in travel time. Second, to estimate that uncertainty from seismic data. The latter is a hard statistical problem that seismic industry tends to ignore rather than tackle. A practical solution of it should be of great interest not only to geophysicists but also to anybody dealing with travel times of band-limited signals. | ||||||||
|
|
||||||||
|
and Computational Challenges Organizer: Sally C. Morton (Sally_Morton@rand.org) RAND |
||||||||
|
|
||||||||
|
|
||||||||
| The NYSDOH AIDS Institute is responsible for systematic monitoring of the quality of medical care provided to HIV-infected individuals in NYS. Measurement of quality is based on indicators that are linked to optimal clinical outcomes, such as performance of PAP smears, PPD screening and use of antiretroviral therapy. Implemented in 1992, this initiative incorporates continuous quality improvement (CQI) techniques to stimulate health care providers to build and sustain quality within their organizations. Record abstraction at over 100 facilities is conducted annually, using standardized data collection forms. Through a new initiative, HIVQUAL,providers submit data directly using a software program developed in collaboration with HRSA. A subset of these records is reviewed to validate information. Results are presented as aggregated facility-specific data, comparatively and longitudinally to display historical trends. HIV performance data are also reported to the public. Accuracy and precision of data are paramount to the integrity and success of performance improvement efforts led by public health agencies. Impact at the state level ranges from individual agency actions to support for activities designed to improve care. This program represents an example of collaboration between a public health agency, clinicians and statisticians to incorporate sophisticated statistical techniques into routine CQI initiatives. | ||||||||
|
Some Statistical and Computing Issues Randall K. Spoeri HIP Health Plans, New York |
||||||||
| With the rapid expansion of managed care, questions have been raised about the quality of care being delivered. In response to these concerns, quality management efforts have relied heavily on the measurement of performance. Associated with these measurement and improvement activities are various statistical and computing considerations. This talk will provide an overview of a number of these considerations, to include: measure definitions, data availability and quality, risk adjustment, analytical/interventional use of results, and predictive modeling. Thoughts about the future will conclude the talk. | ||||||||
|
John Adams RAND |
||||||||
| This talk will consider the emerging need to summarize information as multiple measures of health care quality proliferate, e.g., HEDIS, CAHPS and others. I will discuss various ways of building aggregate scales of health care quality and examine some of the technical issues that must be overcome to produce reliable aggregate information. These issues include case-mix adjustment, appropriate standard error calculations, developing weights for measures, and the communication of statistical uncertainty to the lay audience. The focus is on aggregate scales to profile HMO performance. Examples will be drawn from the development of the Combined Autos/UAW Reporting System (CARS), an HMO report card that is mailed to the employees of the big three auto makers during open enrollment. | ||||||||
|
Karl Heiner SUNY at New Paltz |
||||||||
| The AIDS Institute of the New York State Department of Health monitors the quality of care delivered by hospitals, community health centers and drug treatment centers to individuals infected with HIV. A medical peer review organization visits these facilities each year and applies a number of protocols reflecting the standard of care to random samples of medical records. Bayesian techniques are used to model aspects of the quality of care. For standards and indicators that have been employed for a number of years, conjugate analysis and simple dynamic models are useful when measuring facility specific performance. As standards and measures of quality of care evolve, or when comparisons among groups are required, more complicated models are indicated. In order to make inferences from the models, simulation methods are applied and trellis graphics are used to display overall and among facility trends. | ||||||||
Casemix
adjustment of consumer ratings can provide more valid plan comparisons
than unadjusted ratings by controlling for factors related to systematic
response biases to questions about health care. Adjusted data are therefore
potentially more appropriate for comparing the quality of care delivered.
If members of a particular demographic group are less inclined than others
to assign poor ratings to bad care, and members of this group are disproportionately
enrolled in some plans, casemix adjustment for this systematic bias is
useful when comparing assessments of different plans.
The CAHPS Implementation Handbook recommends adjusting for age and health status when comparing consumer assessments of health plans. Younger people and those in poorer health tend to report more problems and less positive evaluations of health care than do older people and those in better health. The current CAHPS approach uses a "health plan fixed effect" model to estimate the effects of casemix adjusters, which does not assume that all plans have equal true mean ratings (unlike simple casemix models). We also consider models that test whether casemix adjusters have different effects within different plans. We present graphical displays that compare the various casemix models considered. |
||||||||
The problem
of how to handle missing data occurs frequently in meta-analyses of
clinical studies. Few cancer studies are randomized controlled trials
and many include only one treatment arm. Often no comparative drug trials
exist between competitive treatments in other therapeutic areas. Several
techniques for imputing these differences will be described with examples
from published meta-analyses and software imputation comparisons. These
methods will include Bayesian hierarchical modeling to estimate missing
random effects multilevel mixed models to estimate missing treatment
differences and a proposed method to test the difference between active
treatments when only placebo controlled studies exist.
|
||||||||
|
|
||||||||
|
The
Use of Modeling and Statistics in Defense Analysis |
||||||||
|
Colonel William Crain Defense Modeling and Simulation Office |
||||||||
| Modeling and Simulation (M&S) has become a very powerful and cost effective tool for analyzing the technical performance and warfighting utility of technologies being developed in Defense Science and Technology (S&T) programs as part of the DoD M&S strategy. An increased use of M&S-based experimentation is being seen in Defense laboratories, engineering centers, operational warfighting experiments (to address S&T impacts on force structure, doctrine, tactics, etc.), Advanced Concept Technology Demonstrations, and other aspects of the S&T community. This increase has built on many "success stories" associated with M&S-based analysis. To maximize the value of using advanced modeling and simulation technologies and techniques in the S&T community, it is important to understand the differences between models and simulations, as well as their appropriate use in experimentation. Additionally, it is important to become aware of the new technologies and applications being developed in the M&S community and their benefits to the acquisition, training and analysis communities. | ||||||||
|
A Tool to Improve Combat Simulation for the Information Age Military Lieutenant Colonel Daniel Maxwell JWARS, Office of the Secretary of Defense |
||||||||
| The Joint Warfare System (JWARS) is a campaign-level simulation of military operations that is being developed under contract by the Office of the Secretary of Defense (OSD) for use by OSD, Joint Staff, Services, and Warfighting Commands. The motivation for JWARS is to provide insight into cause and effect relationships that influence the success or failure of military forces, and ultimately to support critical operational planning and multi-billion dollar resource allocation decisions. JWARS is a closed form simulation. It is mixed-mode, with both stochastic and deterministic components key uncertainties are reflected as stochastic events that potentially cause significantly different outcomes. JWARS also represents explicitly the effects that differences in information availability may have on operational success, as well as explicitly representing most critical combat and logistical systems. This highly resolved view of warfare provides a holistic view of combat operations that has been previously unachievable. This paper describes the JWARS design, modeling concepts, and sample model results, like they might be presented to a senior leader. | ||||||||
|
Ron Fricker RAND |
||||||||
| This paper develops and applies a technique for "rightsizing" inventory levels. We demonstrate its power on a Marine Corps' inventory consisting of $24 million of stock for 13,000 different types of items. We show that our "rightsized" inventory works significantly better than the existing inventory: We achieve equivalent performance at one-half to one-third the cost. Conversely, we demonstrate significant improvement in fill rates and other inventory performance measures for an inventory of the same cost. The computationally intensive method, based on the bootstrap, is only now becoming possible to apply with the advent of today's powerful desktop computers. It is an alternative to the standard approaches, which are often inappropriate and only applied out of computational convenience. | ||||||||
|
Craig E. College Program Analysis and Evaluation Office |
||||||||
| Senior Army decision-makers constantly struggle to optimize the allocation of funds for readiness, modernization, and soldiers' quality of life. The analysis supporting resource decisions in the Army Program Objective Memorandum (POM) process comprises a large and complex series of tasks. There are time-delays between causes and effects, extensive "feedback" loops, and numerous critical qualitative and quantitative factors which must be incorporated in the analysis. Many quick-reaction "what-if" scenarios must be investigated to detail the impacts of alternative decisions. In the face of reduced personnel and dollar resources, essentially manual execution of these activities is increasingly difficult. This situation generates a requirement for a suite of analytical models to assist in: 1) developing the POM 2) articulating the impact of resource decisions and 3) analyzing resource trade-offs. Army Program, Analysis, and Evaluation Directorate's (PAED) Decision Support System (DSS) is being developed and tested to enhance the Army POM process. DSS techniques and methodologies include rank-based hierarchical assessment, Quality Function Deployment (QFD), and System Dynamics Simulation. Among several functions, this DSS tool prioritizes and optimizes over 500 Management Decision Packages (MDEPs) which are the key programming structures supporting the Army's mission and Title X responsibilities. Risks associated with resultant funding levels in various sectors, e.g., readiness, modernization, and soldiers' quality of life are then made evident. Finally, this tool provides the capability to support a future executive-level display of resource options for real time decisions making. | ||||||||
|
Lionel Galway RAND |
||||||||
| In response to a "new world" of frequent deployments, the U.S. Air Force has developed a new operational concept, the expeditionary Aerospace Force, which replaces large overseas forces with units that can be deployed quickly from the U.S. Meeting demanding timelines for deployment will require a rethinking and potential restructuring of all support functions, such as munitions, fuel, maintenance, and supply. Because of the large uncertainties regarding proposed support alternatives in future scenarios, we argue that strategic support planning (i.e. decisions on overall structure, technology and process improvements, etc.) should be done with relatively simple and transparent models that have modest data requirements and run quickly to allow runs over many different scenarios. These models allow quick exploration of wide ranges of alternatives and help direct detailed analysis to promising options. | ||||||||
|
Yvonne M. Martinez Los Alamos National Laboratory |
||||||||
| The creation of knowledge bases to support decision making has become an integral part of our Statistical Sciences Group's weapons reliability efforts. Designed properly, these knowledge bases become the critical infrastructure of expertise from which information (quantitative and qualitative) is extracted and modeled. An example of the Prototype Slapper Detonator Knowledge Base will be given. This is an electronic repository for capturing and preserving knowledge on a type of detonatorthe Slapper. The Knowledge Base is a resource for the Department of Defense's (DoD's) decision making on the design, modeling, manufacturing, and reliability of the Slapper. Methods for developing the prototype originated with meetings of advisory experts and analysts on the contents, organization, and needed capabilities of the Knowledge Base. We refine the Knowledge Base through usability tests-observations of users as they search through the Knowledge Base and perform their decision-making tasks. The Statistical Sciences Group is further exploring the use of knowledge bases to integrate information, to perform statistical analysis and modeling, and to support collaborative updating of knowledge. This work will also be displayed through examples of PREDICT (Performance and Reliability Evaluation and Design by Information Combination and Tracking) and RETAIN (Repository for Expertise and Tools for Analyzing Integrating KNowledge). | ||||||||
|
|
||||||||
|
Organizers: Bonnie Ray (borayx@m.njit.edu) New Jersey Institute of Technology and Leslie M. Moore (lmoore@lanl.gov) Los Alamos National Laboratory |
||||||||
|
Charu Chandra University of Michigan |
||||||||
| The concept of supply
chain is about managing coordinated information and material flows, plant
operations, and logistics. It provides flexibility and agility in responding
to consumer demand shifts with minimum cost overlays in resource utilization.
The fundamental premise of this philosophy is synchronization among multiple
autonomous entities represented in it. That is, improved coordination within
and between various supply chain members. Coordination is achieved within
the framework of commitments made by Members to each other. Members negotiate
and compromise in a spirit of cooperation, in order to meet these commitments.
Increased coordination can lead to reduction in lead times and costs, alignment
of interdependent decision-making processes, and improvement in the overall
performance of each Member, as well as the supply chain network (Group).
Such an arrangement offers opportunities to design, model, and analyze problems
with local perspective of a Member and global view of a Group. It also holds
the potential of emergence of divergent supply chain network topologies,
in order to satisfy dynamic market conditions. These unique configurations
and associated problems require formulations in relation to a systems framework,
recognizing their domain dependence within the domain independent environment
of the supply chain. In this talk, we present a systems framework that atlases
an interdisciplinary approach to supply chain management with methods and
techniques incorporated from production operations management, management
science, industrial engineering /operations research, systems sciences,
and artificial intelligence and computer science fields.
One of the important supply chain management problems is to transform incomplete information about the market and available production resources into coordinated plans for production and replenishment of goods and services in the network formed by cooperating entities. We illustrate the proposed framework to address this problem, for a textile supply chain. |
||||||||
Computer
codes are being developed to model many complex phenomena including
a manufacturing supply chain. Use of computer simulation models leads
to the consideration of statistical methods for gaining understanding
from these models of underlying processes, perhaps to stimulate the
development of science or to support decision making. We describe statistical
methods for sensitivity and performance analysis of complex computer
simulation experiments. Analysis of variance-based methods or regression
tree analysis are useful for determining variables having substantive
influence on the experimental results and to investigate the structure
of underlying relationships between inputs and outputs. An approach
to analysis leads to the need to design computer experiments from which
estimates of the quantities of interest can be obtained with reasonable
efficiency. Inputs to simulation codes may number in the tens to hundreds
and information that allows focus on subsets of important inputs is
invaluable. Some experiment design approaches based on fractional factorial
design, or orthogonal arrays, are described.
|
||||||||
|
Bonnie Ray New Jersey Institute of Technology |
||||||||
| Forecasting item demand across different segments of the manufacturing processes and across different time horizons is an integral part of supply-chain management. In this talk, we review time series methods that are commonly used for demand forecasting in inventory management applications and discuss some issues that arise when the methods are extended to the supply chain framework. An example of forecasting items in a textile supply chain is used to illustrate the discussion. | ||||||||
|
|
||||||||
|
Examples from Landscape Evolution and Tectonics Organizer: Dorothy Merritts (D_Merritts@acad.FandM.edu) Franklin and Marshall College |
||||||||
|
Peter Dodds MIT |
||||||||
| The statistics and structure of river networks are commonly described by power laws. In practice, deviations in scaling are present making exact measurements difficult. The choice of parameter ranges used for regression analysis can markedly affect estimates of exponents. We show that many relationships possess several distinct scaling regimes linked by crossover regions. These scaling regimes, which may be present to varying degrees, pertain to the scale of linear, pre-network basins large length scales at which correlations in landscapes become negligible and outer length scales dictated by geology. We present evidence from real data for large-scale networks including the Mississippi, Amazon, Nile and Kansas river basins. We observe that improvements in topographic resolution would be unlikely to result in cleaner statistics that variations in measurements for small-scale basins are real and unavoidable and that strong deviations are indicative of geology being at work. | ||||||||
|
Garry Willgoose and Greg Hancock University of Newcastle, Australia |
||||||||
|
Over the last 15 years
a range of landform evolution models have been developed. These models
are highly nonlinear and sensitive to small perturbations. It is not possible
to carry our repeated combinatorial experiments to collect the data to
test these models. Finally a key aspect of landforms are their spatial
organisation in the form of drainage networks. This paper outlines a novel
methodology the authors have developed to address these limitations using
key indicator statistics (width function, cumulative area function, area-slope
relationship) and the GLUE hypothesis testing methodology (Beven and Binley,
1992). Results from studies will be shown. Some unresolved statistical
challenges will be outlined including |
||||||||
In
1811-1812 three great (8.0+) earthquakes occurred near New Madrid, Missouri.
We estimate coseismic deformation in this area using stream elevation
data from topographic maps. Streams have a natural profile, the gradient
of which depends on the resistance of underlying sediment and the volume
of stream flow. If tectonic processes elevate the upstream end of a
segment a different amount than the downstream end, the stream will
attempt to return to its natural gradient by incising, aggrading, or
altering its sinuousity. This adjustment takes time, so deviations from
the natural gradient may indicate geologically recent deformation. We
use penalized regression splines to estimate the natural stream profile
and the deformation of the ground surface. Estimation of the natural
profile and deformation is based on nonparametric regression of the
form y2 - y1 = f(x2) - f(x1), where the x's are univariate or bivariate.
|
||||||||
|
with Applications to Seismic Deformation Estimation Derek Stanford MathSoft, Inc. |
||||||||
| To estimate seismic deformation, we need to fit a smooth surface to marked spatial data. Estimation of this surface can be characterized as a regression problem with smoothness constraints. We use a conjugate gradient method because the large size of the design matrix in this regression does not permit computation of an exact least squares solution. For example, a relatively coarse surface grid with 103 cells per side would lead to a design matrix with 106 columns; an exact linear solution would require inversion of a matrix with dimensions 106 by 106, which is not currently feasible. Our conjugate gradient approach allows us to avoid this difficulty by taking advantage of the sparse structure of the design matrix, and we have implemented this in software written in C and Splus. | ||||||||
|
|
||||||||
|
Organizer: Barry Moser (bmoser@lsu.edu) Louisiana State University |
||||||||
|
Robert G. Downer Louisiana State University |
||||||||
| Traditional experimental designs do not directly account for spatially dependent observations. However, this aspect of the data can not be ignored in both design and analysis. The basic issues associated with spatially correlated observations in standard designs will be presented and some of the methods which have attempted to address the problem will be reviewed. The impact of these issues on hypothesis testing will be discussed. For fixed effects one-way analysis of variance, a permutation test is introduced. | ||||||||
|
Patrick D. Gerard, David Evans, and Michael Cox Mississippi State University |
||||||||
| The potential environmental benefits of precision agriculture have been well documented. However, before these benefits can be fully realized, the economical feasibility of these methods must be addressed. One of the primary costs associated with precision farming involves the acquisition of information on quantities of interest, such as soil fertility and weed abundance, across a large area. Normally, obtaining the requisite information is both time and labor intensive. Agricultural scientists have turned to remote sensing to non-invasively obtain information, in the form of intensity of energy radiated from the earth, which may be related to the quantities of interest. In this talk, we will briefly discuss some of the objectives of precision agriculture and how they relate to remote sensing. We will also discuss, in general terms, some statistical and data management issues that arise from the use of remote sensing in precision agriculture, with emphasis on new challenges as remotely sensed data moves from being multispectral in nature to hyperspectral. | ||||||||
|
Raúl E. Macchiavelli and Rocío del P. Rodríguez University of Puerto Rico |
||||||||
| Rust is an important disease of coffee that decreases yield of coffee beans. For monitoring purposes, sampling is done in a two-step systematic plan: trees are sampled systematically (in a W pattern covering the field), and then leaves are randomly sampled from each selected tree. Since coffee in Puerto Rico is grown in areas with pronounced slopes, these plans require walking diagonally along slopes, which is not feasible for regular monitoring by farmers. In this work we compare by simulation different sampling plans in order to find one to be used for monitoring this disease. The disease incidence was evaluated for all trees in a coffee lot (N=1269) and two systematic patterns were studied: a W pattern every c trees (c=2,…,6) and a pattern with parallel rows, where trees were selected along 4 equally spaced parallel rows, every c trees (c=2,…,7). Simulations were carried out using a SAS macro, sampling 2, 5, 10, 30, and 40 leaves per tree every c trees. In order to generalize these results to other fields, the spatial distribution pattern was modeled using spatial statistics new data sets were generated changing the disease incidence and the spatial correlation and simulations were run using the approach described before. The results suggest that both systematic sampling patterns gave approximately the same standard errors (except in cases with large spatial correlation). | ||||||||
|
James Beaver and Raúl E. Macchiavelli University of Puerto Rico |
||||||||
| Single-seed descent has been used by grain legume breeders to maintain genetic variability in populations of advanced generation lines. In this method, a single seed from each plant is chosen and planted, and the process is repeated several times (typically four). In order to reduce labor costs, breeders use a multiple seed procedure in which a single pod rather than a single seed is harvested from each plant and bulked. This paper studies the distribution of the proportion of the original plants represented after applying this method four times (advancing from the second to the sixth generations). Since the analytical solution to this problem is intractable (it involves a 4-fold convolution of a multivariate hypergeometric distribution), a simulation was run in SAS to estimate this probability distribution. Results show that the proportion of plants represented at least once is between .25 and .33 that an increase in size of the original population does not influence significantly the mean proportion but decreases its variability and that the number of seeds per pod affects both the mean and the standard deviation of the distribution. | ||||||||
|
Bradley Tiffee and Robert G. Downer Louisiana State University |
||||||||
| Precision agriculture uses geographic information systems, computer technology and a global positioning system to map site-specific factors that affect production. The Dean lee Research Station of the Louisiana Agricultural Experiment Station is using this technology to map soil nutrients and soybean yield. This may be achieved through spatial modeling using variograms and kriging. A sample variogram is estimated from sampled magnesium concentrations and kriging is used to predict values for the entire field. A comparison is made to prediction from a trend surface model and recommendations are discussed. | ||||||||
|
|
||||||||
|
Organizer: Paul Black (pblack@neptuneandco.com) Neptune and Company, Inc. |
||||||||
|
A Methodological Study in Nuclear Waste Disposal Risk Assessment David Draper University of Bath in England |
||||||||
| In this talk I will examine a Bayesian conceptual and computational framework for accounting for all sources of uncertainty in complex prediction problems, involving six ingredients: past data, future observables, and scenario, structural, parametric, and predictive uncertainty. I will apply this framework to nuclear waste disposal using a computer simulation environmentGTMCHEMwhich "deterministically" models the one-dimensional migration of radionuclides through the geosphere up to the biosphere. Focusing on scenario and parametric uncertainty, I will show that mean predicted maximum dose for humans on the earth's surface due to key radionuclides, and uncertainty bands around those predictions, are noticeably larger when scenario uncertainty is properly assessed and propagated. I will conclude by describing how Bayesian decision theory can take predictions such as these and turn them into recommendations for environmental action. | ||||||||
|
Kenneth H. Reckhow and Mark E. Borsuk Duke University |
||||||||
| A probability network model is being developed and applied to the problem of eutrophication in the Neuse Estuary, USA. Also called a "Bayes net," this model consists of the variables of interest in the system and a set of assertions concerning the probabilistic relationships among the variables. The objective of the model is to provide a scientific assessment of the impact of nitrogen loading on estuarine algal blooms and fishkills model expressions are quantified using data analysis, mechanistic relationships, and/or expert judgment. Probabilistic predictions of model endpoints may then be made based on the entire set of conditional probabilities. Not only does this network structure provide an integrated approach to uncertainty analysis, but it also allows easy updating of prediction and inference when observations of model variables are made. This capability is particularly important when applied to a natural system in which additional monitoring is likely to occur concurrent with the modeling effort. The method is probabilistic in its approach, which facilitates a meaningful communication of uncertainty, and is consistent with the risk assessment paradigm. Model endpoints are chosen so that they are of vital interest to stakeholders and can be easily expressed for use in formal decision analysis. | ||||||||
|
Samantha Bates and Adrian Raftery University of Washington |
||||||||
| In this paper we discuss Bayesian methods of analysis that incorporate both prior knowledge of the distributions of the inputs to a deterministic model and any available data on the model inputs and outputs. These methods yield posterior distributions for the model output from which to find distributions for quantities of interest. The first method uses Monte Carlo simulation from the prior distributions for the inputs and resampling of these simulations with weights determined by the observed data under the sample importance resampling scheme of Rubin. The second involves sampling from the posterior using MCMC methods. We will present an application of the methods to modeling poly-chlorinated biphenyl (PCB) concentrations in various media at a Superfund site in New Bedford Harbor, MA. A deterministic model for PCB concentration in soil was developed by Cullen (1992). Expert opinion is reflected in the prior distributions for model inputs. Interest lies in developing a distribution for the PCB concentration in soil at this site which accounts for uncertainty and variability in the model inputs and can be used in policy decisions. | ||||||||
|
Human Health Impacts from Radioactive Contamination Tom Stockton and Paul Black Neptune and Company, Inc. |
||||||||
| EPA regulations and DOE orders require assessing the impact on human health of radioactive waste contamination over periods of up to ten thousand years. Towards this end complex environmental simulation models are used to assess "risk" to human health from migration of radioactive contamination. Typically there is very little data underlying these models and the data that is available is incorporated in input parameter distributions for Monte Carlo simulation. Expert judgment typically drives the level of model complexity chosen but model complexity choices are often not made within the context of the decision to be made. The utility and regulatory acceptability of a Bayesian approach to decision making regarding radioactive contamination is discussed within the context of radioactive contamination examples from Los Alamos National Laboratory, Hanford, and the Nevada Test Site. These examples highlight the desirability and difficulties of merging the cost of monitoring, the cost of the decision analysis, the cost and viability of clean up, and the probability of human health impacts within a rigorous decision framework. | ||||||||
|
|
||||||||
|
Organizer: Edward J. Wegman (ewegman@gmu.edu) George Mason University |
||||||||
|
Lasse Holmström Rolf Nevanlinna Institute |
||||||||
| Small arctic and subarctic lakes are known to be sensitive to climatic variation. Changes in external conditions are continuously recorded in their sediments in the form of aquatic organisms. The abundance of such organisms can therefore be used to reconstruct past environmental conditions. We use data collected from the Finnish Lapland to reconstruct post ice-age temperatures. Nonparametric smoothing has not been used often in this context. We find smoothing a viable tool both in the actual reconstruction phase and in the subsequent time series smoothing, where the SiZer method is used. | ||||||||
|
Giancarlo Ragozini Universita di Napoli Federico II |
||||||||
| From a geometrical point of view, outliers are those observations lying isolated on the periphery of data cloud. A large literature exists on the detection of multiple outliers in multivariate data sets. Most of recent proposals are based on some robust distance of each data point from a center. However, they are really effective only when the data scatter has a regular shape. The proposed method is based on the direct exploration of the data periphery, without considering any center or fixed shape, exploiting the geometrical properties of the sample convex hull. The first step of the proposed detection procedure consists of a new "weak" convex hull peeling, reducing the computational effort of the classical peeling procedures. In this step the set of candidate outliers is identified, evaluating gaps in the data scatter and proximities to its boundary region. In the second step, a block omission approach is performed, considering only some specific subsets among the candidate outliers, in order to reduce the combinatorial computational cost. The outlyingness of each subset is measured through a new index based on the variation of the convex hull volume when a subset is omitted. | ||||||||
|
Mia Hubert, Peter J. Rousseeuw, and Stefan Van Aelst Universitaire Instelling Antwerpen |
||||||||
| The deepest regression is a method for linear regression introduced by (Rousseeuw and Hubert 1999). It is the fit with maximal regression depth. We prove that this estimator is highly robust against outliers. We propose an approximate algorithm for fast computation of the deepest regression in higher dimensions, and apply it to several real data sets. From the distribution of the regression depth function we construct tests for the true unknown parameters in the linear regression model. We also propose a bootstrap method to construct regression confidence regions. For bivariate datasets we use the maximal depth to construct a test for linearity versus convexity/concavity. Finally, the deepest regression is applied to polynomial regression and to the Michaelis-Menten model. | ||||||||
|
|
||||||||
|
Organizer: Alan Karr (karr@niss.org) National Institute of Statistical Sciences |
||||||||
|
Todd L. Graves Los Alamos National Laboratory |
||||||||
| Papers and journals about data analyses that are published online should be different from and more powerful than those in paper journals. Web papers can include interactive visualization and data analysis applets to allow the readers to perform exploratory analyses. Readers could also replay the authors' analyses, trying out minor modifications, or even importing their own software or data to see if different analyses or additional data change the authors' conclusions. Online journals could be reorganized so that all analyses of a particular data set or that bear on a particular real world problem could be reachable through the same web page. Readers could perform and submit their own analyses of these problems all within the web journal environment. This talk will include demonstrations of these concepts written in Java. | ||||||||
|
Ashish Sanil National Institute of Statistical Sciences |
||||||||
| Government agencies often report their data (gathered through sample surveys and censuses) in the form of statistical summaries by geographic units (e.g., by state, county, etc.). In many cases the public release of data on particular geographic units is considered too risky for preserving the confidentiality of the respondents. A possible strategy for such cases is to aggregate neighboring regions into larger units which satisfy the confidentiality requirements. Often, as in the case of an on-line query system, the computation of the aggregations needs to be automated, should be computationally efficient, and should produce meaningful aggregates. Procedures for carrying out such confidentiality-preserving geographic aggregation will be described, and illustrative examples will be presented. | ||||||||
|
Nandini Raghavan AT&T Labs-Research |
||||||||
|
From
a statistician's point-of-view, the most striking (and daunting) aspect
of the web phenomenon is extracting meaningful information from the tremendous
volume of data available. In this talk I describe our efforts to develop
evolving statistical profiles (signatures) of users of an internet service
provider (ISP). ISPs are characterized as having large, fluid user populations
and massive, dynamic data streams which record information at different
granularities. I describe an application where we use these signatures
to build formal statistical models of customer migration.
|
||||||||
|
|
||||||||
|
|
||||||||
|
Organizer: Luis A. Escobar (luis@stat.lsu.edu) Louisiana State University |
||||||||
|
Bradley Jones SAS Institute, Inc. |
||||||||
|
This presentation illustrates the use of JMP software to analyze a manufacturing oriented reliability problem with competing risks. The application is non-trivial and the data are real. The demonstration shows the ability of fast modern computers to fit, diagnose, and refine such complex models interactively. |
||||||||
|
William Q. Meeker Iowa State University |
||||||||
Censored
and truncated data arises frequently in product reliability studies involving
laboratory accelerated life tests, field tracking studies, and the analysis
of warranty data. S-PLUS has powerful tools for analyzing such data. This
talk will describe SLIDA, a collection of S-PLUS functions for Life Data
Analysis that has been designed to extend and enhance the S-PLUS capabilities
in this area. Some of these extensions include:
|
||||||||
|
| ||||||||