Published in Agron. J. 96:631-645 (2004).
© American Society of Agronomy
677 S. Segoe Rd., Madison, WI 53711 USA
STATISTICS
New Indices to Quantify Patterns of Residuals Produced by Model Estimates
Marcello Donatelli*,a,
Marco Acutisb,
Gianni Bellocchia and
Gianni Filaa
a Research Institute for Industrial Crops, Via di Corticella, 133, 40128 Bologna, Italy
b DiProVe, Università di Milano, Via Celoria 2, 20133 Milano, Italy
* Corresponding author (m.donatelli{at}isci.it).
Received for publication June 18, 2002.
 |
ABSTRACT
|
|---|
The evaluation of patterns in the residuals of model estimates vs. other variables can be useful in both model evaluation and parameter calibration. New indices that allow quantifying such patterns (pattern indices) are presented. Groups of residuals are created by dividing the range of the variable under evaluation into two, three, four, or five subranges. Two types of indices are proposed. The first type (PI-type) is based on the absolute value of the maximum difference between pairwise comparisons among average residuals of each group of residuals. A variant of this index is computed by using variance ratios (PI-F type). The subranges of the variable that determines the grouping of residuals may be of equal length (PI) or variable length (PI
). In the second case, they are generated by an algorithm that optimizes subranges to maximize patterns. The power of the diverse pattern indices at identifying patterns was investigated, and their effectiveness was compared against the runs test. Critical values for pattern indices were generated by Monte Carlo simulations. Monte Carlo probability tables, the results of power analysis, and the results of using pattern indices at two case studies (i.e., daily radiation and soil water content estimates) were presented. The analysis based on pattern indices provided insight in model structure and parameter calibration. Pattern indices also allowed evaluating model performance and discriminating among alternative models. Higher power in identifying patterns was given by range-based pattern indices than by those based on variance ratios.
Abbreviations: BC, BristowCampbell (model) DB, DonatelliBellocchi (model) doy, day of year MSE, mean of square error PI, range-based fixed pattern index PI-F, F-based fixed pattern index PI
, range-based variable pattern index PI
F, F-based variable pattern index RMSE, root mean square error Tmin, minimum air temperature
 |
INTRODUCTION
|
|---|
DURING THE PAST SEVERAL decades, considerable advances have been made in the mathematical description of physical, chemical, and biological processes underlying soilplantatmosphere interactions (Penning de Vries et al., 1989; Plentiger and Penning de Vries, 1997; Donatelli et al., 1999, 2002; Bindi et al., 2001). The evaluation of simulation models is an important step in the complex task of modeling activity (e.g., Penning de Vries, 1977). The process of comparing model outputs against an independent data set is commonly known as validation (Schlesinger et al., 1979; Sage, 1987; Rykiel, 1996) although the terminology is not standardized and rather controversial (Konikow and Bredehoeft, 1992; Bredehoeft and Konikow, 1993; Bair, 1994). Validation results may lead to a decision whether or not to use a model for a specific application. Although not all models can be easily validated, it is clear that a model that does not fit measured data in a proper way must not be used to simulate the system under investigation. Also, effective application of models suffers from the lack of knowledge of model parameters and their uncertainty. For both process-based and statistically based models, proper simulation techniques require the use of parameter values that, in the former case, properly represent the system under evaluation and, in the latter case, minimize the difference between estimates and measured values. The process of adjusting parameter values is known as calibration (Beck, 1986; Klepper and Rouse, 1991; Klepper et al., 1991), and it is based on different procedures. In general, the goal of calibration is to minimize the difference between measured and estimated values. The mathematical representation of this goal is called the cost (or loss, or objective) function or assessment criterion. Defining an appropriate cost function can be difficult as it may require considerations of different type such as the relative dominance of one or more parameters, the autocorrelation among different parameters, and the drift in time series (Janssen and Heuberger, 1995; Wallach, 2003).
Several statistical indices are available for quantifying how well model outputs fit the measurements. Goodness-of-fit measures are classified in two main categories, which are difference-based and correlation-based indices. Such indices are essentially based on the proposals of Willmott and Wicks (1980), Fox (1981), Addiscott and Whitmore (1987), Loague and Green (1991), Zacharias et al. (1996), and Smith et al. (1997). A broad review was given by Martorana and Bellocchi (1999) while new developments and insights may be found in Metselaar (1999), Kobayashi and Salam (2000), Yang et al. (2000), Bellocchi et al. (2002), and Donatelli et al. (2002).
Difference-based measures, such as the mean bias, the sum of square errors, mean of square error (MSE), and root mean square error (RMSE), provide quantitative estimates of the deviation of model estimates from measurements. Measures of correlation, such as the Pearson's coefficient (r) and the coefficient of determination (r2), provide quantitative estimates of the covariation between estimated and measured values. In many cases, scientists use various combinations of indices to test the accuracy of their models. Difference-based indices, such as MSE or its squared root RMSE, have been also extensively used as cost functions in model parameter estimation (i.e., Sorooshian et al., 1983, 1993; Wallach, 1999), and in optimization algorithms (Wallach et al., 2001).
Although widely used, MSE, RMSE, and other common indices are not exhaustive in analyzing model capability to fit measured data. Diagnostic methods in linear regression are used to examine the appropriateness of assumptions underlying the modeling process and to locate unusual characteristics of the data that may influence conclusions. Since the residuals estimate the underlying error, the arrangement exhibited by the residuals should be consistent with such assumptions. The following methods are ordinarily used for detecting the presence of anomalous behavior in the residuals (e.g., Cook and Weisberg, 1982; Draper and Smith, 1998): tests for randomness (autocorrelation and runs test), tests for normality (chi-square, KolmogorovSmirnov, AndersonDarling, and ShapiroWilks), tests for outlier detection (Grubbs, leverage statistic, Cook's distance, and studentized residuals), and tests for equality of error variances (Anscombe, Bickel, and CookWeisberg). With respect to error variances (homoscedasticity), the graphical procedure for studying this assumption consists of plotting the ordinary least squares residuals against fitted values or an explanatory variable. For both plots, the residuals should be confined within a horizontal band. It may happen that one or more of the underlying assumptions for linear regression are violated and corrected via specific procedures. Homoscedasticity and other usual statistical requirements are rarely met with dynamic process-based models. As plots of residuals against various relevant quantities are recommended for virtually any regression analysis (e.g., Miklós et al., 1995; Cook and Weisberg, 1999; Wisniak and Polishuk, 1999), graphical residual analysis is a primary tool for most modeling applications also (e.g., Dent and Blackie, 1979; Mayer and Butler, 1993). It is important to note that the American Society for Testing and Materials (ASTM, 2000) sees this as part of a scientific review: "A key part of the scientific peer review will include the review of residual plots where modeled and observed evaluation objectives are compared over a range of model inputs." Scatterplots of simulation model residuals vs. independent variables often show shapes, indicative of systematic errors in the estimate and currently named patterns. This might be due either to the lack of an explanatory variable in the model or to poor calibration of model parameters. Identifying whether or not there is pattern in residuals is not always an easy procedure. See discussions by McCullagh and Nelder (1989) for generalized linear modeling and Box et al. (1994) for time series modeling. Dynamic simulation models often raise issues that occur infrequently in other application models (e.g., Reckhow et al., 1990; Danuso, 1996). For instance, crop growth models may underestimate dry matter after the very early development stages, whereas they overestimate it at later stages. In such cases, model residuals tend to arrange in groups showing noticeably similar over- or underestimation when sorted according to an external variable. Refer to the plot of Fig. 1 as an example of such a behavior.

View larger version (15K):
[in this window]
[in a new window]
|
Fig. 1. Example of three macropatterns in model residuals (solid line: average value; dotted lines: upper and lower bounds).
|
|
When the plot of residuals shows a structure in terms of two to five groups with somewhat different mean values, we call such a structure macropatterns.
Although a nonrandom structure in the residuals will not necessarily generate macropatterns, when macropatterns are observed, they can be regarded as indicators of a structure in the residuals. The methods classically used for either model validation or calibration are not helpful in the process of quantifying macropatterns as they leave complex aspects associated with the relationship between the residuals and other variables (e.g., model inputs) unexplained.
To provide an option to address the above issues, the objectives of this paper are (i) to define indices quantifying patterns in the residuals and evaluate their statistical properties, (ii) to apply them in model evaluation and selection, and (iii) to apply them as cost functions for optimizing model calibration according to given assumptions on the parameters.
 |
MATERIALS AND METHODS
|
|---|
The following steps have been applied to assess the presence of patterns in residuals:- Definition of suitable indices for macropattern quantification.
- Monte Carlo simulations aimed at obtaining critical values for the indices.
- Evaluation of the power of the indices, based on selected Monte Carlosimulated data.
In addition, two examples are shown to illustrate the use of such indices.
Definition of Pattern Indices
The PI-Type Pattern Indices
In the first pattern index proposal (Donatelli et al., 2000), the range of the independent variable under evaluation was divided into four equal-length subranges, thus arranging residuals in four groups. The index is based on the pairwise differences between group means. In particular, the absolute value of the maximum difference in the pairwise comparisons between average residuals in each group is the pattern index PI, hereafter named range-based fixed pattern index. The choice of grouping by four subranges of the independent variable was made to capture macropatterns that may be associated with the seasonal variation of model residuals instead of any possible pattern. In this paper, in addition to the division of the data set into four groups, we also consider two, three, and five groups. The index is formulated as follows:
 | [1] |
where R = model residual (estimated measured); l, m = group index; p = number of groups (ranging from two to five); ql, qm = group size; and il, im = residual value inside group.
A variant of PI is the F ratio of the between-group variance (varp) over the within-group variance (vare), hereafter named F-based fixed pattern index (PI-F). If p is the number of groups, PI-F is defined as:
 | [2] |
The PI
Type Pattern Indices
A variant of the PI and PI-F indices can be computed by allowing subranges of different length. The constraint imposed is the minimum number of residuals in each group (5, 10, 15, and 20% of the total number of values to prevent possible effects on pattern computation due to few outliers). An exhaustive algorithm allowed identifying the subrange corresponding to the maximum value of PI
type. Again, the choice of a number of groups ranging from two to five targets macropatterns. The maximum PI or the maximum PI-F are the desired indices, respectively PI
(range-based variable pattern index) and PI
F (F-based variable pattern index). In the process of grouping, PI
works by maximizing the differences between mean values while PI
F works by minimizing the variance within groups and, consequently, maximizing the variance between groups. If the minimum number allowed in a group is k, and p is the number of groups (l and m as defined in Eq. [1]), then PI
and PI
F are formulated as follows:
 | [3] |
 | [4] |
where the maximum pattern index is taken over q's for which the sum of the q's is the overall sample size.
Critical Values
Critical values at 0.10, 0.05, and 0.01 significance level for pattern indices under two, three, four, and five groups of equal size have been computed by Monte Carlo simulations under the null hypothesis of no pattern, generating uniform deviates by three congruential generators and random deviates with normal distribution by the BoxMuller method (Sprott, 1991; Press et al., 1992). No link to constant terms or independent variables is attached to generated residuals, thus preventing them from showing autocorrelation or heterogeneous variances. Residual values were obtained, at first, generating random data sets with a standard normal distribution (mean = 0, standard deviation = 1). Therefore, the z-score transformation of the residuals before the computation of PI is required to compare the computed values against generated critical values.
Transformations are not required when computing PI-F. If the number of values in each group is defined, under the null hypothesis of no pattern in the residuals, and given data matching the statistical assumptions of normality, randomness, and independence, then critical values for PI-F are the same as for the usual F distribution.
Critical values at 0.10, 0.05, and 0.01 significance level for PI
and PI
F were generated following the same procedure of fixed pattern indices and with the constraint of a minimum number of data in each group (5, 10, 15, and 20% of the data set).
In all cases, 5000 random draws were considered sufficient to give reliable confidence limits (Acutis and Lotito, 1997; Manly, 1997). Simulations were performed for each number of groups (two to five), with data sets of 20, 30, 40, 50, 60, 75, and 100 values and for the minimum size allowed (5, 10, 15, and 20). The total number of draws was 700000, obtaining PI, PI
, and PI
F (the last two simultaneously). Critical values for data sets of infinite size were obtained by interpolation via power functions. Studies on the robustness of critical values against either heterogeneous variances or nonnormal distribution were also performed, limited to data sets of 60 values. Heterogeneous variances were randomly generated, according to the rules summarized in Table 1 (e.g., Scheffé, 1959). Robustness to nonnormality was tested generating data sets with uniform distribution in the range of ±0.5·
to have variance equal to 1.
View this table:
[in this window]
[in a new window]
|
Table 1. Relative variance ratios for testing the robustness of critical values of pattern indices to heterogeneous variances.
|
|
Power Evaluation
All pattern indices were investigated for their power to recognize macropatterns. On the same data, runs test (Bradley, 1968) was performed and interpreted as an indicator of patterns. Runs test is a nonparametric procedure that analyzes if the number of changes of sign in an ordered sequence of residuals is compatible or not with randomness. The presence of macropatterns (as defined here) creates a number of changes of sign lower than a random sequence, and thus the choice of runs test appears as a possible choice to test the presence of patterns in absence of specific procedures. Pattern indices and runs test were then compared in regard to their relative power ability to put in evidence patterns in the residuals. Synthetic populations of residuals, with mean = 0, were generated by Monte Carlo runs and divided into two, three, four, and five balanced groups of different mean and constant variance (i.e., error variance = 1). This was done with data sets of different size (40, 60, and 100), given the constraint of a minimum number of values in each group for PI
type (520%). Group mean values were chosen to obtain a mean variance over the error variance ratio equal to specified F values. The procedure (Carmer and Swanson, 1971) was reiterated for F = 1, F = 2, and F = 3. The set of means used are reported in Table 2.
For both pattern indices and runs test, the percentage power ratio (
) was figured as:
 | [5] |
where n is the number of cases either giving significant pattern indices or, alternatively, significant runs test (at p = 0.05 for both statistics) for the selected F value and m is the number of Monte Carlo simulations. A sample of m = 500 Monte Carlo experiments was considered large enough for the purpose (Willavize et al., 1980).
Groups of residuals with either heterogeneous variances or uniform distribution were also generated to assess the robustness of pattern indices under conditions of heteroscedasticity and nonnormality. Likewise, this evaluation was reiterated for F = 1, F = 2, and F = 3. Uniform distribution was generated in the range ±0.5·
around the means. Heterogeneous variances were generated according to the variance ratios of Table 1.
Applications
To illustrate the functioning of the pattern indices, two examples are presented.
Radiation Model Evaluation
The first example shows diverse patterns may result as a consequence of using models with different levels of accuracy in accounting for relevant processes when estimating daily global solar radiation.
Daily solar radiation data collected at Tel Hadya, Syria (36°12' N, 36°56' E; elevation 12 m above sea level) in 1988 were used. Daily estimates were made by both a simplified (one parameter) model, BristowCampbell (BC; Bristow and Campbell, 1984), and a more complex (three parameters) model, DonatelliBellocchi (DB; Donatelli and Bellocchi, 2000; Bellocchi et al., 2002). Both models estimate a transmissivity coefficient of radiation through the atmosphere, using air temperature data. Solar radiation at ground level is calculated by the product, for a given latitude and at each day, of potential radiation (that is, the calculated solar radiation outside the earth's atmosphere, e.g., Campbell and Norman, 1998) times estimated transmissivity coefficient.
The model BC accounts for seasonal patterns dividing daily temperature range by the fixed average of monthly temperature range. On the other hand, the model DB makes use of the mobile average of weekly temperature range in place of the monthly based fixed one and incorporates a function specifically devoted to account for seasonality. Details on model features can be found in the documentation accompanying the solar radiation estimate software RadEst3.00 (Donatelli et al., 2003), freely downloadable from the site: http://www.sipeaa.it/tools (verified 30 Jan. 2004).
Both BC and DB were evaluated with respect to the production of patterns in the residuals, either versus the day of year (doy) or minimum air temperature (Tmin, °C), using a yearly data set of daily data registered at Tel Hadya. Model parameters were calibrated over multiyear data sets.
Calibrating a Hydrological Model
The second example investigates the patterns generated from the estimation of soil water content (m3 m3) in the 0.40-m plowed layer over 3 yr of corn cultivation by means of the widely used hydrological model MACRO (Jarvis, 1994), a physically based model of water and solute transport in macroporous soil. Model inputs were set to represent a field experiment (Acutis, 1996). Simulation results were compared against a data set of 35 measured values. The simulation was reiterated 48 times, each time using a different set of parameter values. Three parameters, i.e., conductivity at the boundary between macro- and microporosity, effective diffusion pathlength, and boundary pressure, were varied within the known range. Both the RMSE and pattern indices (i.e., PI and PI
with min. 20%), with the observed soil water content as external variable (partitioned into three and four groups), were used to evaluate model performance. A technique based on the concept of Pareto domination (Erickson et al., 2002) and involving two objective functions (pattern index and RMSE) was developed to improve the prospects for finding preferred solutions, i.e., select a satisfactory and reliable set of parameters through a systematic exploration of the pattern index/RMSE scatterplot. The dominant solutions were interpolated by the following:
 | [6] |
where yi is the ith value of the pattern index under investigation, y0 is the minimum value of the pattern index, and RMSEi is the ith RMSE value. The estimate of k allows the RMSE value corresponding to the critical value of the pattern index (at a given significance level) to be derived:
 | [7] |
 |
RESULTS
|
|---|
Critical Values
Under standard conditions, critical values at different significance levels and number of values were generated for PI (Table 3), PI
(Table 4), and PI
F (Table 5) via Monte Carlo runs. The user may refer to critical values of PI when groups include the same number of values.
View this table:
[in this window]
[in a new window]
|
Table 3. Critical values (p = 0.10, p = 0.05, and p = 0.01) for range-based fixed pattern index (PI) at a variety of groups and number of values (N).
|
|
View this table:
[in this window]
[in a new window]
|
Table 4. Critical values (p = 0.10, p = 0.05, and p = 0.01) for range-based variable pattern index (PI ) at a variety of groups, minimum percentage of data in each group, and number of values (N).
|
|
View this table:
[in this window]
[in a new window]
|
Table 5. Critical values (p = 0.10, p = 0.05, and p = 0.01) for F-based variable pattern index (PI F) at a variety of groups, minimum percentage of data in each group, and number of values (N).
|
|
Tables 6 and 7 report critical values for pattern indices, limited to N = 60, generated under nonstandard conditions (heteroscedasticity and nonnormality, respectively).
Power Evaluation
The power ratios (at p = 0.05 significance level) given by the pattern indices and the runs test, as computed over generated data sets of 60 values under standard conditions (normal distribution, homogeneous variances), are presented (Fig. 2)
. Similar findings were obtained with data sets of 40 and 100 values (results not shown).
Both range- and F-based pattern indices showed higher power than runs test over the variety of conditions investigated (i.e., prescribed F value, number of groups, minimum percentage of values in a group). In general, the power showed an increase with increasing the number of groups. Under F = 3, for instance, passing from two to five groups,
(Eq. [2]) values of PI ranged from 54.4 to 90.4 (+66%), and
values of PI-F ranged from 22.8 to 66.8 (+193%). The range-based pattern indices generally were more powerful than those based on the variance ratios (68 and 23% on average, respectively). The variable range-based indices were more powerful (+22% on average) than PI, whereas the fixed variance ratios gave superior power (+27% on average) compared with those computed by varying subgrouping. Moreover, major power tended to be associated with PI
F when a large amount of data in one group was allowed (20%) while the contrary emerged with PI
(major power with 5% minimum, with the exception of two groups).
Large loss of power (in general more than 50%) was observed with PI
under nonstandard conditions while PI
F had small or zero loss (Fig. 3 and 4)
. No loss was observed with runs test. Decreasing loss of power was registered with increasing F values and number of groups. For instance, PI
computed over five groups, with 20% of minimum amount of data in each group, and had a loss of power of about 11% under F = 3.
Applications
Evaluating Radiation Models
When the models BC and DB were used to estimate daily radiation values at Tel Hadya for the year 1988, a different performance was observed in terms of macropattern generation, either versus doy or Tmin (Fig. 5)
.

View larger version (42K):
[in this window]
[in a new window]
|
Fig. 5. Daily radiation estimates performed with the models BC and DB at Tel Hadya (Syria) in 1988: residuals vs. day of year (doy) and vs. minimum air temperature (Tmin).
|
|
Figures 6 and 7
show the pattern indices against both predictors computed for the two models. The model DB had consistently superior results, except in a few cases when computing PI
vs. doy with 20% of minimum data in each group (Fig. 6) or PI
vs. Tmin with two groups only (Fig. 7).
Calibrating a Hydrological Model
Figure 8
shows the relation between fixed and variable range-based pattern indices with three and four groups (computed after transforming the residuals in z scores) and RMSE, as figured out of 48 simulations of soil water content (m3 m3) executed with the model MACRO under alternative hypotheses on input parameters. Two points of interest were identified on the tradeoff curve that interpolates across nondominated solutions. In particular, possible critical RMSE values are identified corresponding to significant values of the pattern indices at p = 0.05 and p = 0.10. The grouping by fixed subranges did not allow groups of equal number of residuals; thus, the significant values for fixed pattern indices are those obtained via Monte Carlo values generated ad hoc to reproduce the specific characteristics of the groups investigated (Table 8).

View larger version (36K):
[in this window]
[in a new window]
|
Fig. 8. Relation of z-scored range-based fixed pattern index (PI) and range-based variable pattern index (PI ) (min. 20%) with three and four groups to root mean square error (RMSE), as computed from 48 simulations of soil water content (m3 m3) performed with the model MACRO. The exponential curve interpolates across dominant solutions. Critical values of the pattern indices are reported for p = 0.05 (solid line) and p = 0.01 (dotted line).
|
|
View this table:
[in this window]
[in a new window]
|
Table 8. Characteristics of the population of residuals generated with MACRO (total number of residuals, number of values in each group) and critical values (p = 0.10, p = 0.05, and p = 0.01) for range-based fixed pattern index with three and four groups.
|
|
 |
DISCUSSION
|
|---|
Statistical indices of new development for use in model evaluation and as cost functions (i.e., in the parameter calibration process) have been presented. They are meant as being sensitive to the presence of macropatterns, likely to be recognized in the distribution of model departures from experimental measurements (model residuals). Macropatterns may be graphically detected by inspection of the plots of residuals versus the range of values of a third variable, the last sorted in ascending order. We identify such indices as pattern indices.
Index computations require splitting the range of the third variable into a number (two to five) of subranges. Subranges may be of equal size (fixed pattern indices) or different size (variable pattern indices). In the first case, the user utilizes his/her own criterion for partitioning. For instance, the four seasons in a year may guide the user through dividing the 365/366 d of the year (Julian days being the external variable) in groups of about 90 values each. In the second case, the partition of the data set of residuals in consecutive groups is made with the help of an exhaustive algorithm. Both fixed and variable pattern indices are computed based on either pairwise differences between group means (range-based pattern indices) or variance ratios (F-based pattern indices). Computed values may be compared with critical values generated ad hoc, according to different significance levels (Monte Carlo tables) or, for PI-F, with the critical values of the theoretical F distribution. However, model residuals rarely conform to the statistical conditions; thus, the computed F may not follow the theoretical F distribution. In this case, critical values cannot be defined. The varying number of elements in each group makes the generation of tables of critical values a cumbersome process. The possibility to override this difficulty is provided by randomization tests (Acutis and Lotito, 1997; Manly, 1997), which allow computing critical values in any situation in which permutations are possible (Bell and Sen, 1984). From our investigations (limited to sample cases), it appeared that the critical values of pattern indices are not sensitive to both heterogeneous variances and nonnormal (i.e., uniform) distribution (Tables 37). Provided that the model's performance is generally discussed as relative comparisons (Oreskes et al., 1994), and major interest is the use of indices of this type as cost functions, lack of suitable critical values may be inconsequential.
Theoretical and applied investigations showed that pattern indices may be of great help for quantifying the presence of patterns in the model residuals. The capabilities of the exhaustive algorithm assure large flexibility to the variable pattern indices and extend the field of residual diagnostics toward the quantification of macropatterns of different type. However, this process may be relatively time-demanding to compute for large data sets. A clue to future progress in computation efficiency is the application to this context of modern algorithmic techniques that avoid the requirement of generating a complete listing of data (e.g., StatXact references through the site http://www.cytel.com; verified 30 Jan. 2004).
Power analysis has proven to be a viable solution to the investigation of pattern indices. At first, the better ability of the pattern indices to detect macropatterns, as opposed to a classical test (runs test), was shown under a variety of conditions. Moreover, the power of range-based pattern indices has been proven to be constantly higher than the one of those based on variance ratios under conditions of normality and homogeneous variances. This can be explained because in one case (pairwise mean comparisons), the quantification of patterns is made by the extreme means only, whereas in the other case, all of the group means enter the variance ratio computation. On the contrary, large sensitivity of range-based pattern indices emerged under conditions of heteroscedasticity or nonnormality at selected, illustrative, instances. Under such conditions, distinct patterns based on the mean differences are not likely to appear. Power ratios did not vary consistently with pattern indices based on variance ratio under the same conditions of heteroscedasticity or nonnormality. However, whether power ratios of both range- and F-based pattern indices were comparable under nonstandard conditions, they were always higher than the ones registered with runs test. Moreover, loss of power tended to decrease with increasing F values and number of groups. In almost all cases, the recognition of macropatterns increases with increasing the number of groups; thus, an arrangement of the residuals into the largest number of groups (i.e., five) appears desirable. However, this option could not be the best one when working with small data sets. Few outliers may indeed affect the result in such cases; thus, a partitioning into three or four groups may be more conservative. The user may also prefer minimum amounts of data in each group larger than 5% of the data set when computing range-based pattern indices over small data sets, considering the high sensitivity of PI
to a small number of data in one group (see line of 5% minimum in Fig. 2).
The practical interest of pattern indices has been explored by applying them at two experimental cases. As shown in the first example, the use of pattern indices is straightforward for discriminating between alternative solar radiation models. In particular, such indices determined whether a model has been set to avoid seasonal biases. The use of pattern indices may help in the investigation of features in a model that may be excessively simplified for a specific process. In our example, several forms of pattern indices allowed assessing seasonal functions included in the model DB that were intended to compensate for the simplified assumptions related to seasonal factors in the model BC. The few exceptions that emerged with PI
further testify to the poor ability of range-based pattern indices at detecting patterns either when the external variable is partitioned into two groups or when a large amount of values in the groups (i.e., at least 20% of the data set) is allowed. In the second example, fixed and variable range-based pattern indices were used as cost functions for calibrating a complex hydrological model, in association with the RMSE. Although affected by error, the relation between PI (or PI
) and RMSE values manifested as high values of pattern index may be associated with low values of RMSE. Hence, a calibration process where the goal is to minimize the mean squared error (to prevent estimates from showing large residuals) may at times reduce pattern control and vice versa. The model users are therefore required to balance between contrary requirements. It was illustrated how the information obtained by two criteria (i.e., pattern index and RMSE) for parameter characterization can help to frame model calibration. An investigation was performed to identify the quasidominant model parameters set using multiple criteria (i.e., relatively small RMSE values provided an acceptable value of pattern index). This was achieved by using graphical visualization of the tradeoff curve to facilitate the subjective procedure inherent in choosing a reduced model parameters set. Possible thresholds of RMSE, as identified by the critical values of PI at p = 0.10 or p = 0.05, provide an effective means of selecting the best-suited parameter set, i.e., the one allowing the least RMSE in the area of nonsignificant pattern index (at the desired significance level). In the example illustrated, variable range-based pattern indices showed higher ability to identify patterns than fixed pattern indices. The partition into four groups, as achieved by the exhaustive algorithm, allowed a more consistent pattern identification. In this case, some PI
values stay above the critical value at 0.05 significance level. The same four-group partition with PI did not allow any pattern detection. However, there are not enough data in the first group (only three residuals, see Table 8) to consider as many as four groups. Small groups may suggest outliers that may act to mask or enhance pattern detection, whereas large groups would not. Residuals can be outliers in many different ways: They may be outliers on every third variable, on a single variable, or multivariated outliers yet not extreme on any one variable. All of these are data-dependent properties of the outliers, which say nothing about why residuals are outliers.
From these advices, combined with some theoretical issues mentioned earlier, questions arise concerning adequate group sizes to downweight the effect of outliers, if they are kept in the data set. We argue that meaningful groups can be formed by a critical mass of residuals to reveal possible macropatterns.
The cases investigated in this work are illustrative examples only, and the relative results are not meant as conclusive findings about the overall possibilities and limitations associated to the use of pattern indices. Additional experience with extended simulations over a large variety of data sets will substantiate the effectiveness of the general performance of these indices. In particular, an extended evaluation over small-size data sets will provide a better definition of both the minimum number of data per group and the optimal grouping. Nonetheless, there is evidence supporting the effectiveness of pattern indices for use under operational conditions: Users may select solutions accepting a slightly higher model error, compensated by generating residuals more uniformly distributed over the range of values of the relevant variable. The issue of making a choice between different criteria leads to the widest question of how to judge models when different evaluation statistics are not concordant. Many authors believe that there is no robust statistic that can be used to draw conclusions in model evaluations, and therefore several methods need to be used together to give a comprehensive check (e.g., Yang et al., 2000). On purpose, the aggregation of different statistics (including pattern indices) into integrated indices has been proposed for use in model evaluation (Bellocchi et al., 2002) and could be proficiently extended to create integrated cost functions for parameter calibration.
The pattern indices we present are useful, when computed on different types of plots of the residuals, to provide meaningful information on the adequacy of different aspects of the model and to locate possible causes of model failure. In our view, they should always integrate difference-based and correlation-based indices in both model calibration and validation. The weak power disclosed when using the variance ratio in pattern detection suggests that there are advantages in using the pairwise means comparison.
Provisions for computation of pattern indices are supplied with the software IRENE (Integrated Resources for Evaluating Numerical Estimates; Fila et al., 2003a), also available as a dynamic link library (IRENE_DLL; Fila et al., 2003b) to be used in custom-developed applications. Both software tools are freely downloadable from the site http://www.sipeaa.it/tools.
 |
CONCLUSIONS
|
|---|
We have not found in the statistical literature specific methods to quantify the presence of macropatterns in the residuals from modeling studies. Pattern indices were specifically developed for such a purpose, and they can integrate conventional methods used in model validation (i.e., difference- and correlation-based methods). Between the two formulations of pattern indices evaluated, i.e., range- and F-based pattern indices, the former showed high sensitivity to the presence of macropatterns. Both indices outperformed runs test, the latter found in the literature as the only and not specific option to be used to detect macropatterns.
The new indices proved to be effective in both validation and calibration studies. The pattern indices allowed the selection of the best parameter set according to multiobjective Pareto's concept (tradeoff curves), with the help of statistical tables associated with pattern index values and generated via Monte Carlo simulations. Hence, pattern indices can be effective methods for producing tradeoff curves for model calibration. The theory of pattern indices is still in an initial stage of development, and we are aware of being far from an extended, rigorous comprehension of the statistical properties of such indices. Nonetheless, the theoretical investigations performed so far, although partial, appear to us as offering to the reader an adequate sight on the applicability of pattern indices to model evaluation in the area of agroecological modeling, which was the scope of the paper. Statisticians are invited to collaborate with suggestions on how the theoretical nature of pattern indices can be further investigated and better defined. Additional studies on the application of pattern indices are also required to better investigate the behavior of pattern indices with data sets of limited size. Further studies should also consider models of various nature and complexity to confirm the possible generalization of both the indices use and the multiobjective calibration procedure.
 |
ACKNOWLEDGMENTS
|
|---|
Research under the auspices of the Project SIPEAA (http://www.sipeaa.it), Paper no. 4, with the support of the Italian Ministry of Agricultural and Forestry Policies. The authors gratefully acknowledge Professor Luigi Cavazza, Professor Francesco Danuso, and Professor Alberto Piazza for their comments on the initial proposal of pattern analysis. Thanks to the anonymous referees for their valuable comments and suggestions for improving the presentation of this work.
 |
REFERENCES
|
|---|
- Acutis, M. 1996. Modelling water movement in the soil: Comparison of different approaches and parameterizations [Online]. Available at http://www.sipeaa.it/tools/MACRO/Modelling_Soil_Water_Movement.pdf (verified 30 Jan. 2003). ISCI-Agronomy, Bologna, Italy.
- Acutis, M., and S. Lotito. 1997. Possibilità applicative dei «resampling methods» nell'analisi statistica di dati agronomici. Riv. di Agron. 31:810816.
- Addiscott, T.M., and A.P. Whitmore. 1987. Computer simulation of changes in soil mineral nitrogen and crop nitrogen during autumn, winter and spring. J. Agric. Sci. 109:141157.
- ASTM. 2000. Standard guide for statistical evaluation of atmospheric dispersion model performance [Online]. D6589. Available at www.astm.org (verified 22 Jan. 2004). ASTM, West Conshohocken, PA.
- Bair, E.S. 1994. Model (in)validationa view from courtroom. Ground Water 32:530531.
- Beck, M.B. 1986. The selection of structure in models of environmental systems. The Statistician 35:151161.
- Bell, C.B., and P.K. Sen. 1984. Randomization procedures. p. 129. In P.R. Krishnaiah and P.K. Sen (ed.) Nonparametric methods. Handb. of Stat. IV. Elsevier Sci. Publ., Amsterdam.
- Bellocchi, G., M. Acutis, G. Fila, and M. Donatelli. 2002. An indicator of solar radiation model performance based on a fuzzy expert system. Agron. J. 94:12221233.[Abstract/Free Full Text]
- Bindi, M., M. Donatelli, J. Porter, and M.K. Van Ittersum. 2001. Proc. Int. Symp. Modelling Cropping Syst., 2nd, Florence, Italy. 1618 July 2001. CNR Inst. for Biometeorol., Florence, Italy.
- Box, G.E.P., G.M. Jenkins, and G.C. Reinsel. 1994. Time series analysis: Forecasting and control. 3rd ed. Prentice Hall, Englewood Cliffs, NJ.
- Bradley, J. 1968. Distribution-free statistical tests. Prentice Hall, Englewood Cliffs, NJ.
- Bredehoeft, J.D., and L.F. Konikow. 1993. Ground water models: Validate or invalidate. Ground Water 3:178179.
- Bristow, K.L., and G.S. Campbell. 1984. On the relationship between incoming solar radiation and daily maximum and minimum temperature. Agric. For. Meteorol. 31:159166.
- Campbell, G.S., and J.M. Norman. 1998. An introduction to environmental biophysics. 2nd ed. Springer, New York.
- Carmer, S.G., and M.R. Swanson. 1971. Detection of differences between means: A Monte Carlo study of five pairwise multiple comparison procedures. Agron. J. 63:940945.[Abstract/Free Full Text]
- Cook, R.D., and S. Weisberg. 1982. Residuals and influence in regression. Chapman & Hall, London.
- Cook, R.D., and S. Weisberg. 1999. Graphs in statistical analysis: Is the medium the message? Am. Stat. 53:2937.
- Danuso, F. 1996. Analisi dell'incertezza nelle applicazioni agronomiche dei modelli dinamici e di regressione. Statistica Applicata 8:103125.
- Dent, J.B., and M.J. Blackie. 1979. Systems simulation in agriculture. Appl. Sci. Publ., London.
- Donatelli, M., M. Acutis, and G. Bellocchi. 2000. Two statistical indices to quantify patterns of errors produced by models. p. 186. In Proc. Int. Crop Sci. Conf., 3rd, Hamburg, Germany. 1722 Aug. 2000. Eur. Soc. for Agron., Hamburg.
- Donatelli, M., and G. Bellocchi. 2000. New methods to estimate global solar radiation. p. 186. In Proc. Int. Crop Sci. Conf., 3rd, Hamburg, Germany. 1722 Aug. 2000. Eur. Soc. for Agron., Hamburg.
- Donatelli, M., G. Bellocchi, and F. Fontana 2003. RadEst3.00: Software to estimate daily radiation data from commonly available meteorological variables. Eur. J. Agron. 18:369372.
- Donatelli, M., C.O. Stöckle, F. Villalobos, and J.M. Villar Mir. 1999. Proc. Int. Symp. Modelling Cropping Syst., 1st, Lleida, Spain. 2123 June 1999. University of Lleida, Spain.
- Donatelli, M., M.K. van Ittersum, M. Bindi, and J.R. Porter. 2002. Modelling cropping systemshighlights of the symposium and preface to the special issues. Eur. J. Agron. 18:111.
- Draper, N.R., and H. Smith. 1998. Applied regression analysis. 3rd ed. John Wiley & Sons, New York.
- Erickson, M., A. Mayer, and J. Horns. 2002. Multi-objective optimal design of groundwater remediation systems: Application of the niched Pareto genetic algorithm (NPGA). Adv. Water Resour. 25:5165.
- Fila, G., G. Bellocchi, M. Acutis, and M. Donatelli. 2003a. IRENE: A software to evaluate model performance. Eur. J. Agron. 18:369372.
- Fila, G., G. Bellocchi, M. Donatelli, and M. Acutis. 2003b. IRENE_DLL: A class library for evaluating numerical estimates. Agron. J. 95:13301333.[Abstract/Free Full Text]
- Fox, D.G. 1981. Judging air quality model performance: A summary of the AMS workshop on dispersion models performance. Bull. Am. Meteorol. Soc. 62:599609.
- Janssen, P.H.M., and P.S.C. Heuberger. 1995. Calibration of process-oriented models. Ecol. Modell. 83:5566.
- Jarvis, N.J. 1994. The MACRO model (Version 3.1). Technical description and sample simulations. Rep. and Diss. 19. Swedish Univ. of Agric. Sci., Dep. of Soil Sci., Uppsala.
- Klepper, O., and D.I. Rouse. 1991. A procedure to reduce parameter uncertainty for complex models by comparison with real system output illustrated on a potato growth model. Agric. Syst. 36:375395.
- Klepper, O., H. Scholten, and J.P.G. Van de Kamer. 1991. Prediction uncertainty in an ecological model of the Oosterschelde estuary, S.W. Netherlands. J. Forecasting 10:191209.
- Kobayashi, K., and M.U. Salam. 2000. Comparing simulated and measured values using mean squared deviation and its components. Agron. J. 92:345352.[Abstract/Free Full Text]
- Konikow, L.F., and J.D. Bredehoeft. 1992. Ground water models cannot be validated. Adv. Water Resour. 15:7583.
- Loague, K., and R.W. Green. 1991. Statistical and graphical methods for evaluating solute transport models: Overview and application. J. Contam. Hydrol. 7:5173.
- Manly, B.F.J. 1997. Randomization, bootstrap, and Monte Carlo methods in biology. Chapman and Hall/CRC, New York.
- Martorana, F., and G. Bellocchi. 1999. A review of methodologies to evaluate agroecosystem simulation models. Ital. J. Agron. 3:1939.
- Mayer, D.G., and D.G. Butler. 1993. Statistical validation. Ecol. Modell. 68:2132.
- McCullagh, P., and J.A. Nelder. 1989. Generalized linear models. Stat. and Appl. Probability Monogr. 37. 2nd ed. Chapman & Hall, London.
- Metselaar, K. 1999. Auditing predictive models: A case study in crop growth. Ph.D. thesis. Wageningen Agric. Univ., Wageningen, the Netherlands.
- Miklós, D., S. Kemény, G. Almásy, and K. Kollár-Hunek. 1995. Thermodynamic consistency of data banks. Fluid Phase Equilib. 110:89113.
- Oreskes, N., K. Shrader-Frechette, and K. Belitz. 1994. Verification, validation, and confirmation of numerical models in the earth sciences. Science (Washington, DC) 263:641646.[Abstract/Free Full Text]
- Penning de Vries, F.W.T. 1977. Evaluation of simulation models in agriculture and biology: Conclusions of a workshop. Agric. Syst. 2:99107.
- Penning de Vries, F.W.T., D.M. Jansen, H.F.M. ten Berge, and A. Bakema. 1989. Simulation of ecophysiological processes of growth of several annual crops. Simulation Monogr. Pudoc, Wageningen, the Netherlands.
- Plentiger, M., and F.W.T. Penning de Vries. 1997. Rotation models for ecological farming. CAMASE/PE Workshop Rep. AB-DLO and C.T. de Wit Graduate School for Prod. Ecol., Wageningen, the Netherlands.
- Press, W.H., B.P. Flannery, S.A. Teukolsky, and W.T. Vetterling. 1992. Numerical recipes in Fortran. Cambridge Univ. Press, Cambridge, UK.
- Reckhow, K., J. Trevor Clements, and R.C. Dodd. 1990. Statistical evaluation of mechanistic water quality models. J. Environ. Eng. 116:250268.
- Rykiel, E.J., Jr. 1996. Testing ecological models: The meaning of validation. Ecol. Modell. 90:239244.
- Sage, A.P. 1987. Validation. p. 4996. In M.G. Sage (ed.) Systems analysis and control encyclopedia: Theory, technology, applications. Pergamon, Oxford, UK.
- Scheffé, H. 1959. The analysis of variance. Wiley, New York.
- Schlesinger, S., R.E. Crosbie, R.E. Gagne, G.S. Innis, C.S. Lalwani, J. Loch et al. 1979. Terminology for model credibility. Simulation 32:103104.[Abstract/Free Full Text]
- Smith, P., J.U. Smith, D.S. Powlson, W.B. McGill, J.R.M. Arah, O.G. Chertov et al. 1997. A comparison of the performance of nine soil organic matter models using datasets from seven long-term experiments. Geoderma 81:153225.[ISI]
- Sorooshian, S., Q. Duan, and V.K. Gupta. 1993. Calibration of rainfall-runoff models: Application of global optimization to the Sacramento Soil Moisture Model. Water Resour. Res. 29:11851194.
- Sorooshian, S., V.K. Gupta, and J.L. Fulton. 1983. Evaluation of maximum likelihood parameter estimation techniques for conceptual runoff models: Influence of calibration data variability and length on model credibility. Water Resour. Res. 1:251259.
- Sprott, J.G. 1991. Numerical recipes: Routines and examples in BASIC. Cambridge Univ. Press, Cambridge, UK.
- Wallach, D. 1999. Linking model validation, uncertainty analysis and parameter adjustment. p. 271272. In Proc. Int. Symp. Modelling Cropping Syst., 1st, Lleida, Spain. 2123 June 1999. University of Lleida, Spain.
- Wallach, D. 2003. Editorial. Agronomie (Paris) 22:115117.
- Wallach, D., B. Goffinet, J.-E. Bergez, P. Debaeke, D. Leenhardt, and J.-N. Aubertot. 2001. Parameter estimation of crop models: A new approach and application to a corn model. Agron. J. 93:757766.[Abstract/Free Full Text]
- Willavize, S.E., S.G. Carmer, and W.M. Walker. 1980. Evaluation of cluster analysis for comparing treatment means. Agron. J. 72:317320.[Abstract/Free Full Text]
- Willmott, C.J., and D.E. Wicks. 1980. An empirical method for the spatial interpolation of monthly precipitation within California. Phys. Geogr. 1:5973.
- Wisniak, J., and A. Polishuk. 1999. Analysis of residualsa useful tool for phase equilibrium data analysis. Fluid Phase Equilib. 164:6182.
- Yang, J., D.J. Greenwood, D.L. Rowell, G.A. Wadsworth, and I.G. Burns. 2000. Statistical methods for evaluating a crop nitrogen simulation model, N_ABLE. Agric. Syst. 64:3753.
- Zacharias, S., C.D. Heatwole, and C.W. Coakley. 1996. Robust quantitative techniques for validating pesticide transport models. Trans. ASAE 39:4754.