|
|
||||||||
a Dep. of Agron., Kansas State Univ., Manhattan, KS 66506
b Div. of Biol., Kansas State Univ., Manhattan, KS 66506
* Corresponding author (welchsm{at}ksu.edu)
Received for publication May 1, 2001.
| ABSTRACT |
|---|
|
|
|---|
Abbreviations: ES, Excel Solver (software) G x E, genotype x environment (interaction) LS, least squares MQL, MarquardtLevenberg (algorithm) RMSE, root mean square error SCE, simulated complex evolution SSE, sum of squared errors
| INTRODUCTION |
|---|
|
|
|---|
Varietal differences in these processes have been modeled by the introduction of genetic coefficients, so named because they quantify effects assumed to be of genetic origin. Recently, there has been a great deal of research on developing efficient methods for estimating these parameters (Irmak et al., 2000), including attempts to relate them to specific plant genotypes (White and Hoogenboom, 1996). A valuable result of these studies has been a reduction in the number of quantities needing estimation and the work required to do so.
However, these efforts have had limited impact on the model algorithms used to control individual processes. Currently, modeled control mechanisms tend to be empirical and inferred indirectly, albeit with great ingenuity, from observations. Such empiricism may limit the accuracy of augmented models that attempt to combine physiological and physical models of the canopy and soil environments. It is not uncommon in model verification studies to find overestimation of low yields and underestimation of high yields (Heiniger et al., 1997; Roman-Paoli, 1997). While it is usually possible to find case-by-case adjustments to remove such effects, it is likely that current approaches to process-control modeling are simply less plastic than real plants. As a result, it has been suggested that improved representation of process controls should have modeling priority (Hammer, 1998).
Crop simulation models have much to gain from the exploding body of genomic science, which is unraveling the networks of interacting genes that actually control important plant processes. Unfortunately, experimental practicalities, such as long generation times, physical size, and other constraints, greatly complicate the elucidation of such networks in agronomic and horticultural crops. Also, the methods used to construct genetic networks, such as mutant screenings, epistasis detection experiments, and phenotype rescue treatments, yield qualitative relationships (e.g., promotion and inhibition) rather than the quantitative mathematical formulas needed in simulation models.
Our response to these issues is based on two premises. First, neural networks (Kasabov, 1996) can be used to supply the quantification that genetic networks currently lack (Mendoza and Alvarez-Buylla, 1998, 2000). Second, networks developed using model plants will prove directly adaptable to crops because genetic mechanisms of central importance are often widely conserved taxonomically.
This paper presents preliminary results supporting the first premise based on efforts to model the control of inflorescence transition in A. thaliana. The first section reviews current information on the genetics of this process. The next introduces neural networks, including one specifically structured to approximate inflorescence control. Succeeding sections detail a data collection experiment currently underway and successful efforts to mimic data from two temperature treatments. Included with the latter are some additional results of possible relevance to bioinformatics as well as to crop simulation. The final section includes a general discussion and conclusions to date.
| FLOWERING CONTROL IN ARABIDOPSIS THALIANA |
|---|
|
|
|---|
The interplay between environmental and genetic factors has been assembled into qualitative network models for the control of flowering in A. thaliana (Koornneef et al., 1998a, 1998b; Levy and Dean, 1998; Simpson et al., 1999; Theissen and Saedler, 1999; Blazquez, 2000; Blazquez and Weigel, 2000; Devlin and Kay, 2000). It is quite important to understand that the genes in these models control phenology directly. That is, observed temporal changes are not the byproducts of alterations in plant growth or vigor (Reeves and Coupland, 2000). The work reported here is based on the Simpson et al. (1999) model because the Blazquez (2000) version is too recent to have affected our research planning. Both of these models can be viewed on the web at the addresses given in the reference list.
According to Simpson et al. (1999), there are four pathways that operate in parallel to control the inflorescence transition. The two major ones are the autonomous pathway and the photoperiod pathway. The prevailing view is that the latter exerts the promotive effect of long days, whereas the former is an endogenous photoperiod-independent, flower-inhibiting pathway. Autonomous path repression of flowering during vegetative growth must be overcome to convert the meristem to a reproductive state. Repression can be accomplished by vernalization, which injects repressive signals into the autonomous path at the FLC (Flowering Locus C) gene.1 Simpson et al. (1999) accords vernalization the status of a separate pathway although Blazquez (2000) considers it a part of the autonomous path. Additionally, gibberellin is required for flowering in short days and also participates in flowering promotion under long days. The genes mediating gibberellin's influence constitute the fourth independent path. In all models, signal integration of the two main paths feeds into a transduction cascade ultimately upregulating flowering genes and stimulating conversion to the reproductive phase.
| A GENETIC NEURAL NETWORK |
|---|
|
|
|---|
If functioning genes can, in fact, be identified with nodes, then removing a node should equate to a loss-of-function mutation. To test this concept, a set of genes and mutant alleles was chosen according to several criteria. First, the set contained representatives from both the photoperiod and autonomous pathways. Second, all genes possessed readily available mutant alleles with strong phenotypic effects that achieved maximal loss of function. Third, the alleles were obtainable in a common genetic background, providing proper experimental control. Finally, the selection balanced the number of genes, the number of plants per genotype, and the available growth chamber space.
Figure 1 shows the selected genes and their connection into a neural network abstracted from the Simpson et al. (1999) model. Both the photoperiod and autonomous pathways are present. These feed into a single integration node (Int) whose 0 to 1 output is interpreted as the fraction of the transition completed. Inputs to the system are photoperiod and days after planting. The latter input was used as a surrogate for growth, which is thought to be of relevance to the autonomous pathway (Blazquez, 2000). Initial concerns existed that daylength might somehow influence the autonomous pathway or, conversely, that growth-related information might leak onto the photoperiod path by some yet undiscovered genetic mechanism. For this reason, extra inputs were added to each pathway (underlined symbols in Fig. 1) with little justification beyond the authors' desire not to be victimized by limited thinking. Later, cause arose to remove them (see below).
|
|
| DATA COLLECTION METHODS |
|---|
|
|
|---|
The growth chamber used in the study was a Conviron Model E8VH (Controlled Environ., Winnipeg, MB, Canada). Lighting was provided by 20 fluorescent tubes (F48T12/CW/VHO, 110 W, Philips Lighting Co., Somerset, NJ) and two incandescent bulbs (Philips Standard Frost Incandescent Lamps, 60 W). Light intensity at plant level was 216.1 µmol m-2 s-1 as measured with a Quantum sensor (LI-COR, Lincoln, NE). Chamber relative humidity was maintained between 70 and 80%.
The 72 pots were randomly assigned to one of four trays and placed in the chamber. To limit systematic biases resulting from any residual within-chamber environmental gradients, trays were moved daily in an orderly fashion so that pots spent nearly equal times in every chamber location. In addition, thermistors monitored the canopy-level temperature within each tray at hourly intervals, and data were recorded by a CR-21 data logger from Campbell Scientific (Logan, UT). To provide the seedlings with an initial high humidity, transparent domes (Catalog no. 14-2568-1, Hummert Int., Earth City, MO) were placed over the trays from planting until 6 to 7 d after emergence. Although standard practice in work with A. thaliana, the hoods created a transient average temperature increase of 0.38 to 0.74°C. This increase is included in all temperature accumulations, which yielded treatment-long average temperatures of 24.16 and 16.37°C. Plants were watered at about 3-d intervals, which is adequate to avoid stress.
The pots were observed daily and records made of emergence date, bolting date, flowering date, leaf number (including rosette and inflorescence leaves), and plant height. Emergence date was taken as the first visible green, but this is not particularly reliable because of the small seed size. Bolting date, when flower buds are first visible on the main meristem, marks the transition to inflorescence, which is the focus of the current report. Flowering date is the opening of the first bud and denotes the second transition discussed above.
| GENETIC NEURAL NETWORK WEIGHTS |
|---|
|
|
|---|
Training a neural network is an example of a nonlinear optimization problem where the objective function is a goodness-of-fit criterion. A common difficulty in such work is that initial estimates of the solution may be so distant from the best answer that the optimizer may converge prematurely to some inferior result. Because of network complexity and perhaps because of the opacity of brain function, designers neither demand nor expect any clear relationship between the values of the weights and the knowledge represented. Good initial estimates are therefore considered unobtainable, and zeros are commonly used. As a result, subsequent success with a larger network may have nothing to do with storage capacity. It may simply be that the goodness-of-fit landscape has been altered so as to provide the optimizer with a more congenial path from arbitrary initial estimates to an acceptable endpoint.
In the current study, the strategy of adding nodes disappeared once the choice was made to work with a particular gene subnet. Therefore, when the first training attempts failed, several tactics were employed to search more efficiently and to modify the goodness-of-fit landscape directly. The modifications included (i) alternative goodness-of-fit functions and search algorithms, (ii) reducing problem dimensionality (i.e., the number of weights) whenever possible, (iii) increasing dimensionality in a controlled fashion whenever unavoidable, (iv) reparameterizing, and (v) expanding model scope incrementally. Candidly, exploitation of these strategies was informed by some science, some intuition, and a great deal of error.
To date, two different goodness-of-fit measures have been used in training: ordinary least squares (LS) and a maximum likelihood (ML) method that parallels the KaplanMeier product-limit estimator well known in survival analysis (Cox and Oakes, 1984). Three search procedures were used: simulated complex evolution (SCE) (Duan et al., 1992, 1994; Thyer et al., 1999), the MarquardtLevenberg (MQL) algorithm (Press et al., 1992), and the variant of Newton's method (Press et al., 1992) embodied in the Excel Solver software (ES). The SCE procedure is readily adaptable to a parallel computing environment and therefore potentially useful for large-scale problems. The MQL algorithm is the current method of choice for nonlinear LS (Press et al., 1992). In terms of execution speed, robustness, flexibility, and numerical charisma, ES has limited long-term potential compared with MQL and SCE. However, the ease of Excel programming has permitted rapid testing of new ideas, the determinative success factor to date. This report therefore contains results obtained by combining LS and ES using an adaptation of the methods of Wraith and Or (1998).
| MODEL SPECIFICS AND RESULTS |
|---|
|
|
|---|
![]() |
![]() | [1] |
indicates the final network output, and the a's and b's are weights to be found. The b values control the time until the onset of nodal switching while the a's affect the time it takes to complete a transition, once started. Our initial training efforts focused on this model and failed utterly.
|
The 24°C Model
To reduce dimensionality, all extraneous D and L inputs were removed, eliminating a2, a3, a9, a10, and a14. Removal is easily justified because it actually makes the neural network more similar to the underlying Simpson et al. (1999) model. Also, given data on only one photoperiod, there is no way to unambiguously determine the parameters a1 and bCRY2 (for CRY2) or a8 and bPHYB (for PHYB). Therefore, the CRY2 output was fixed at 1 for long days and 0 for short days (although no use has yet been made of the latter). The a8 parameter was absorbed into bPHYB and renamed b'PHYB. According to the data, once the transition starts, all mutant populations can progress from 10 to 90% completion in roughly similar intervals (3 to 6 d). Thus, as an approximation, a4, a5, a6, a7, a11, a12, a13, and a19 were replaced by a single parameter, a. However, this simplification did not mimic co data very well, so a13 was released to vary independently. A second parameter, a', was used for a15, a16, a17, and a18 in the Int node.
Another change was quite ad hoc but helpful. It seemed reasonable that if no information was propagating along a path, then the input to Int should be zero. However, with the given transfer function, a zero input to an upstream node propagates a signal of 0.5 to the next downstream node. This effect was removed by making bInt proportional to a' and to -0.5 times the number of upstream nodes. In addition to addressing the issue just mentioned, this change further reduced dimensionality by one parameter. It has another advantage as well. Mathematically, bInt would be expected to influence transition onset time as previously mentioned. However, the Int node is a synthetic model construct that replaces several actual genes with roles in pathway integration. By fixing bInt, the model is forced to more closely relate timing to the b values associated with real genes that actually affect temporal shifts.
Together, these modifications yielded the nine-parameter model
![]() | [2] |
This model was fit using ES. Figure 3 plots predicted vs. observed values for those observations falling in the actual transition interval. Except for fve, predictions are close to the actual data. The overall R2 of all predicted and observed values used in training was 0.9693.
|
|
Applied completely, this idea would add 16 parameters to the model (eight nodes times two temperatures). To limit this increase in dimensionality while maximizing consistency with Eq. [2], the temperature factors for 24°C were assigned to be unity. Under this scheme, the 16°C factors are interpreted as relative changes. Furthermore, because the Int output must range from 0 to 1, no temperature multiplier was used there. It was also found desirable to relax the assumption of common a and a' values. The weights, ai, were estimated by parameters of the form avi or a'vi where a and a' are as above and vi (i = 11, 12, 15, 16, 17, 18, and 19) values are new with initial estimates of 1.0. This parameterization allowed the search engine to find the approximate solution via a, which was then particularized by the offset vi parameters. The weights a4, a5, a6, and a7 were varied freely. Again, it should be emphasized that these reparameterizations were tactical and intended to facilitate training. No scientific significance is attached to their specific form.
In what follows, [r,s] is a vector of temperature multipliers for 24 and 16°C, respectively. Taken together, these changes yield the combined temperature model
![]() | [3] |
Column I of Table 2 gives the resulting parameter values and the root mean square error (RMSE). There is an immediately noteworthy pattern among the temperature factors (qi's). Four of the seven nodes (all of the autonomous pathway plus the closely linked GI) have temperature factors between 0 and 1, plausibly suggesting a general plant slowdown with cooling. Perhaps not surprisingly, the remaining photoperiod sensory pathway nodes (CRY2, PHYB, and CO) depart from these values. A final analysis of the sensory path should await the addition of short-day data to the training set. Even so, there are some comments about the light receptors that can be made now.
|
The blue-light receptor encoded by CRY2 is known to mediate the red lightdependent flowering-control actions of phytochrome B (Guo et al., 1998). The network was able to reproduce this to the extent that, neglecting the temperature constant, the PHYB node exactly tracked the CRY2 output. However, in the absence of specific training data under red and blue lighting, the network was unable to resolve the precise antagonistic relationship that exists between these receptors in reality. Under the white light in this experiment, the phyB inflorescence transition preceded that of the Ler wild type by 1 to 2 d regardless of temperature. The temperature-stabilized CRY2 was apparently a much more reliable indicator of training set behavior than was PHYB. As a result, v12, the weight applied to PHYB outputs, diminished by almost four orders of magnitude (Table 2, Column II). For this reason, training was repeated (Table 2, Column III) after deleting PHYB from the network and excluding the corresponding mutant data. Not only is the aggregate RMSE value better, but the bulk of the improvement is associated with the wild type, whose prediction errors were reduced by 33% (from RMSE = 0.152 to 0.102).
Predicted transition curves and observed data for both temperatures using the values from Table 2, Column III, are plotted in Fig. 5 . The ability of the model to track multiple genotypes accurately as temperature changes is evident. However, the figure also reveals an interesting feature, namely that the order of inflorescence transition for fve and co differs between the two temperatures (arrows). Moreover, the scale of the exchange is large. Comparing median transition times (estimated by linear interpolation), co switches from 7.7 d ahead of fve at 24°C to 9.3 d behind at 16°C for a switching magnitude (i.e., net change) of 17 d. To the authors' knowledge, this observation regarding co and fve is novel.
|
| DISCUSSION AND CONCLUSIONS |
|---|
|
|
|---|
A natural desideratum for applied models is that they be no more complex than necessary to achieve specific goals. In this regard, estimates such as approximately 26 000 genes for A. thaliana (Riechmann et al., 2000) can be quite intimidating. However, fears of the impracticality of genomic-level modeling are probably groundless because small numbers of genes often control broad developmental and physiological processes. The genes used in this study top a floral development hierarchy (Theissen and Saedler, 1999) that ultimately generates all reproductive organs. More generally, mathematical analysis suggests that the accrual of disproportionate influence to tiny gene minorities may be an inevitable consequence of network self-organization over evolutionary time (Barabasi and Albert, 1999). If so, significant crop simulation improvements may only require the modeling of a relatively few genes.
Furthermore, applied models, including genetic neural networks, will continue to benefit from sensitivity analysis and other traditional simplification techniques. The ability to delete PHYB under white-light conditions is one illustration. In addition, it may not be necessary to model the complete dynamics of all genes. White and Hoogenboom (1996) have had success melding genomics with existing simulation models using simple regression equations to predict traditional genetic coefficients based on the presence or absence of selected dominant alleles.
Beyond applied issues, one exciting aspect of this work is that it immediately suggests further researchable questions. Like all good models, neural networks seem capable of providing guidance in data interpretation. An apparent sensitivity to assumed structure was demonstrated in the reversed photoperiod pathway and randomized network examples above. It may also have played a part in the inability to fit the original model with its extraneous inputs of daylength and plant age. If such sensitivity is widespread, automated tools may, with appropriate human guidance, be able to screen large numbers of network structures for compatibility with existing data.
To be realized, such tools will require powerful numerical estimation methods. One of the most useful features of neural networks is the extensive body of research results on training techniques (Fine, 1999; Grierson and Hajela, 1996). Still, it remains to be seen what will prove to be the most efficacious ways to apply these methods to genomic data. It is obvious, however, that training will need to utilize data from multiple experiments conducted in different laboratories. This need is illustrated by the PHYB results, which from the perspective of absolute realism, would have been improved by data from red- and blue-light environments. Clearly, no single research group can perform all the treatments necessary to obtain complete data.
Complicating factors are (i) the propensity of various A. thaliana investigators to use different sets of alleles in separate genetic backgrounds and (ii) the possible impacts of subtle interlab variations in rearing environments. While these issues will no doubt cause initial problems, quantitative models using neural networks or any other successful approach can provide common reference points against which effect magnitudes can be judged. This may be an area in which the experience of crop modelers can be of direct benefit to genomic scientists.
Because genetic network construction involves a consideration of phenotypes, there are multiple possible mechanisms underlying the resulting links. The relationships may be quite immediate, as when one gene codes for a transcription factor directly controlling the expression of another. CO, whose sequence includes the DNA-binding, Zn-finger motif (Lewin, 1997, p. 850), is a probable example. Alternatively, the effects may be indirect and mediated by significant physiological processes. Even so, the construction method still guarantees that the genes responsible will appear at their proper place in the network diagram. The cryptochrome and phytochrome receptors illustrate this situation. In spite of this diversity, the neural network approach appears able to span these differences in mechanism, at least to some degree.
Current crop simulation models, in contrast, emphasize physiological mechanisms to the complete exclusion of genomics. As noted earlier, this has led to a very empirical representation of process-control mechanisms that may, in the final analysis, lack plasticity. Even so, the extensive utilization of crop simulation models in areas as diverse as policy analysis (Rosenzweig et al., 1996; Tubiello et al., 1999); research (Hanks and Ritchie, 1991); and, increasingly, production management (Welch et al., 2002) shows that the physiological approach is not without its strengths.
However, if taken to extremes, genetic neural networks could run the risk of committing the reverse error, that of subsuming too much under the heading of genomics. Future plant models will need to craft a careful balance between traditional methods and new genomic approaches like, or derived from, the one presented here. Thus, the most productive way forward will undoubtedly entail a synthesis of crop modeling and genomic methods rather than the mere adoption of a new approach by one existing research community. Among all the exciting aspects of this work, this is the most exciting of all.
| APPENDIX |
|---|
|
|
|---|
A case was defined as two specific varieties compared in two particular environments. Among the 37 395 cases in the data were 3470 (9.3%) with apparent switches. However, the effect cannot be declared real without a statistical evaluation. The following reasoning was developed by Dr. T. Loughin (Dep. of Stat., Kansas State Univ., personal communication, 2000). Consider the distribution of differences in mean genotype effects between pairs of varieties. If no attention is paid to the order in which the subtractions are performed, then this distribution will span 0. In this situation, there are only two ways in which switches can fail to occur. The first is if there are no genotype x environment (G x E) interactions, a premise that is undeniably false. The second is if the magnitudes of G x E interactions become small when genotype differences are small. Only in this way can pairs of varieties have genotype differences that do not fall on opposite sides of the origin in diverse environments.
There were 5201 pairs of varieties found in two or more environments. The means and variances of the intervarietal differences in treatment means were calculated across environments. The variances estimate G x E interaction effects and were converted to standard deviations so as to have the same units as the means. Figure 6 is a plot of all 5201 (mean, standard deviation) pairs. As is evident, there is some reduction in the standard deviation as the means approach 0, consistent with the relative rarity of switching. However, there are points quite near 0 with standard deviations as large as 4 d or greater, making it very difficult to assert that all treatment differences actually have the same sign. These deviations also exceed the 1-d sampling interval (whose impact is clear along both axes), thus eliminating quantizing effects as a source of the observed switching.
|
| NOTES |
|---|
|
|
|---|
A common situation in genomics is to know that a group of genes may form a path but not be able to tell the order in which they fall. For example, one might be uncertain as to whether a portion of the photoperiod pathway runs PHYB
GI
CO, as in the Simpson et al. (1999) model, or the reverse. Typically, a set of single and double mutant experiments will be performed to resolve such issues. Although no such experiments were performed here, there was a clear distinction between these two alternativesthe reversed order yielded an SSE of 6.68.
1 In this paper, GENE denotes a particular locus, GENE is the corresponding neural network node, and gene is a plant with loss-of-function mutant alleles at that locus. ![]()
2 Look for more detail genetic information at http://www.arabidopsis.org/chromosomes/ (verified 16 Sept. 2002). ![]()
3 Because the model output is the fraction of plants completing the transition, all reported SSE and root mean square error (RMSE) values are unitless. ![]()
4 A reviewer noted the Sirius model (Jamieson et al., 1998) can switch in some situations because it alters thermal time to flowering by making main-stem final leaf number responsive to daylength and vernalization. However, in the experiment reported here, both photoperiod and the short, initial chilling, were common across treatments. ![]()
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
J. Khazaei, F. Shahbazi, J. Massah, M. Nikravesh, and M. H. Kianmehr Evaluation and Modeling of Physical and Physiological Damage to Wheat Seeds under Successive Impact Loadings: Mathematical and Neural Networks Modeling Crop Sci., July 1, 2008; 48(4): 1532 - 1544. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Khazaei, M.R. Naghavi, M.R. Jahansouz, and G. Salimi-Khorshidi Yield Estimation and Clustering of Chickpea Genotyp |