Agronomy Journal Grow Your Career With ASA
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via ISI Web of Science (13)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by White, J. W.
Right arrow Articles by Hoogenboom, G.
Right arrow Search for Related Content
PubMed
Right arrow Articles by White, J. W.
Right arrow Articles by Hoogenboom, G.
Agricola
Right arrow Articles by White, J. W.
Right arrow Articles by Hoogenboom, G.
Related Collections
Right arrow Crop Growth and Development
Right arrow Crop Physiology & Metabolism
Right arrow Cell Biology & Molecular Genetics
Right arrow Crop Models
Right arrow Crop Ecology
Agronomy Journal 95:52-64 (2003)
© 2003 American Society of Agronomy

SYMPOSIUM PAPERS

Gene-Based Approaches to Crop Simulation

Past Experiences and Future Opportunities

Jeffrey W. White*,a and Gerrit Hoogenboomb

a Centro Internacional de Mejoramiento de Maiz y Trigo (CIMMYT, Int.), Apt. Postal 6-641, 06600 Mexico, D.F., Mexico
b Dep. of Biological and Agricultural Engineering, Univ. of Georgia, Griffin, GA 30223-1797

* Corresponding author (j.white{at}cgiar.org)

Received for publication May 1, 2001.

    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 PLANT GENETICS AND GENOMICS...
 GENE-BASED APPROACHES TO...
 PRACTICAL ISSUES IN GENE-BASED...
 CONCLUSIONS
 APPENDIX
 REFERENCES
 
Use of process-based models of plant growth and development is increasing in both basic and applied research. Advances in genomics suggest the possibility of using information on gene action to improve simulation models, particularly where differences among genotypes are of interest. This paper reviews issues related to incorporating gene action in crop models, starting with an introduction to basic concepts of functional genomics. We recognize six levels of genetic detail in modeling approaches. Modeling gene action through linear estimates of effects on model parameters (Level 4) has shown promise in the common bean (Phaseolus vulgaris L.) model GeneGro. However, this approach requires extensive data on the genetic makeup of cultivars, and such data are still not routinely available. Software for simulating complex biochemical pathways offers the prospect of simulating processes such as photosynthesis or photoperiod control of flowering by considering interactions of regulators, gene-products, and other metabolites (Level 6), but such software applications may require an understanding of the reaction kinetics of large biomolecules existing at concentrations as low as one or two molecules per cell. Over the next decade, genetic information probably has the most to contribute in understanding temporal and tissue-level variation in the genetic control of specific processes and, for more applied modeling, in improving the representation of cultivar differences. Strategic decisions are needed on prioritization among species and traits to be modeled, as well as on how to improve collaboration with molecular biologists to better access and harness the data resulting from their research.

Abbreviations: ESTs, expressed sequence tags • ICASA, International Consortium for Agricultural Systems Applications • ICIS, International Crop Information System • QTL, quantitative trait locus • RAPD, randomly amplified polymorphic DNA


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 PLANT GENETICS AND GENOMICS...
 GENE-BASED APPROACHES TO...
 PRACTICAL ISSUES IN GENE-BASED...
 CONCLUSIONS
 APPENDIX
 REFERENCES
 
CROP SIMULATION MODELS are used increasingly in a range of basic and applied research in the plant sciences and on natural resource management (Whisler et al., 1986; Boote et al., 1996; Tsuji et al., 1998). They provide one of the best approaches for integrating our understanding of complex plant processes as influenced by weather, soil, and management conditions. Such models often prove as valuable in guiding research as in providing quantitative predictions (Loomis et al., 1979). Nonetheless, process-based crop simulation models have limitations as quantitative representations of plant growth and development. Descriptions of physiological processes such as photosynthesis, respiration, and partitioning are subject to large uncertainties (Monteith, 1996), and the data used as inputs to models are often difficult to obtain and can contain significant measurement error (Boote et al., 1996).

It is particularly problematic to represent cultivar differences in response to environment and management. Simulation models often employ cultivar-specific genetic coefficients for traits such as photothermal time from emergence to flowering or potential rate of seed growth. However, values of these parameters are seldom determined through direct measurement or genetic analyses. Various numerical optimization approaches have been proposed for calculating their values (Hunt et al., 1993; Welch et al., 2001), but the methods require substantial sets of field data.

Rapid advances in plant genetics and genomics (e.g., Bouchez and Höfte, 1998; Arabidopsis Genome Initiative, 2000), especially for Arabidopsis thaliana (L.) Heynh., suggest the possibility of improving model descriptions of physiological processes and cultivar differences. For research-focused modeling, benefits might include clarification of key metabolic pathways, including their control in different tissues and during different phases of plant development. For applied research, perhaps the most rapid benefits will come from genetic data on cultivar differences. Such information would greatly facilitate determination of model parameters for new lines or cultivars, the development of ideotypes for specific regions, and analysis of sources of error in simulations.

Several researchers are already examining strategies for dynamic simulation of large biochemical and genetic networks (Anonymous, 1999; Collins and Jegalian, 1999), with an eye to simulating an entire organism based on a description of its genome. Challenges include predicting the structures of all proteins in a genome (Gassterland, 1998; Skolnick and Fetrow, 2000), modeling reactions in complex biochemical networks (Arkin et al., 1997; Mendes and Kell, 1998; von Dassow et al., 2000), integrating such information into simulations of entire organisms (Tomita et al., 1999), and accounting for effects of the external environment on gene action.

Use of genetic information in simulation models of plants is still rare. In the GeneGro model for common bean, White and Hoogenboom (1996) used simple linear effects of seven genes to replace empirically determined coefficients of BEANGRO, the model from which it was derived. In comparisons of observed and simulated data for 20 cultivars, GeneGro performed as well as BEANGRO (Hoogenboom et al., 1997). While promising, the approach fell short of representing gene action at the process level. To simulate cold acclimation in cereals, Fowler et al. (1999) described a routine whose development was partially guided by information from molecular studies. In molecular biology, qualitative models for processes such as floral development are often described (e.g., Koornneef et al., 1998; Sheldon et al., 1999). Hay and Ellis (1998) reviewed these approaches with a focus on predicting time of flowering in wheat (Triticum aestivum L.) and barley (Hordeum vulgare L.), and recent efforts to use genetic information to guide the modeling of phenology in soybean [Glycine max (L.) Merr.; Stewart et al., 2003] and Arabidopsis thaliana (Welch et al., 2003) further demonstrated the benefits possible from incorporating genetic information in process-based models.

This paper reviews strategies and opportunities for representing how specific genes affect growth and development processes simulated in crop models. Since the developers and users of crop simulation models are often unfamiliar with the terminology and concepts of plant genetics and genomics, these topics are first reviewed as background for subsequent discussions.


    PLANT GENETICS AND GENOMICS FROM A SYSTEM MODELING PERSPECTIVE
 TOP
 ABSTRACT
 INTRODUCTION
 PLANT GENETICS AND GENOMICS...
 GENE-BASED APPROACHES TO...
 PRACTICAL ISSUES IN GENE-BASED...
 CONCLUSIONS
 APPENDIX
 REFERENCES
 
In modeling any agricultural system, central issues are to identify and characterize the relevant system parameters, how they interact, and how they are affected by external variables, such as weather, soil, and management conditions. To use genetic information in plant growth and development models, it is useful to consider the nature of a gene, what are its products, how is production of the latter regulated, and what are the impact of the gene products on crop processes. Furthermore, to assess the feasibility gene-based modeling, one must decide on the number of genes to consider and how genes are identified.

In the next section, we offer a short primer of plant genetics and genomics focusing on topics relevant to crop simulation modeling. Given the rapid progress in genomics and the newness of research that applies genomics to crop modeling, we recognize that this selection may appear idiosyncratic. For detailed information on plant genetics and genomics, recent texts such as Brown (1999) and Lewin (2000) should be consulted. The appendix provides a glossary of selected terms in genomics. (Readers familiar with genomics should skip directly to the section "Gene-Based Approaches to Modeling.")

What is a Gene?
In classical Mendelian genetics, a gene is defined as the fundamental unit of inheritance. Advances in the understanding of gene action and structure first led to the concept of a gene as a contiguous sequence of base pairs in a strand of DNA that codes for the synthesis of an enzyme—the one gene, one enzyme model (Watson et al., 1988). The understanding that a single enzyme may contain multiple polypeptide chains coded for at different positions on the chromosomes prompted further revision to one-gene, one polypeptide (Lewin, 2000). Small RNA molecules also are gene products, can have catalytic functions, and can regulate the transcription of RNA from a DNA template. Thus, a further refinement might be one-gene, one diffusible product. Fortunately, for initial efforts in gene-based modeling, Mendel's classical definition appears adequate.

A locus is a physical position on a chromosome. In classical genetics, a locus is the position of a single gene. However, molecular tools can identify multiple positions within a single gene. Thus, the term locus is also applied to a position within a gene.

Alleles are alternative forms of the same gene at the same locus. They arise from mutations. For simplicity, this paper emphasizes cases where there are two alleles, a dominant and recessive form, but multiple alleles can exist for a single gene and have quantitatively different effects.

Genes are named by their discoverers and are usually given both full and abbreviated designations (e.g., Finatus and Fin, for the gene controlling indeterminate vs. determinate stem in bean). If multiple genes within a class or family are found, these are distinguished by appending numbers (e.g., Vrn1 and Vrn2 for vernalization in wheat). Dominance is indicated using uppercase for the first letter of the name (e.g., Fin is the dominant allele, while fin is the recessive) or by placing a + or - following the name (e.g., xxx+ vs. xxx-).

The starting point of a plant gene is taken as the onset of the section of base pairs in a DNA molecule that codes for the messenger RNA (mRNA) that will carry the genetic information for subsequent protein synthesis. The first base pair is assigned a positional value of 1. Upstream positions are assigned negative values corresponding to the distance measured in numbers of base pairs and typically include sites for various elements involved in regulation of transcription. The downstream positions contain the DNA that is copied in the mRNA. This may be a single continuous sequence, or functional segments, called exons, or may alternate with sections not represented in the mRNA, termed introns. A termination segment may also be present. DNA is read starting from the end of the first nucleotide having a 5'-hydroxyl and ending at the 3'-hydroxyl of the last nucleotide. Thus, descriptions of genes frequently refer to the 5' or 3' positions.

The structure of the Opaque-2 (O2) gene in maize (Zea mays L.) is typical of a plant gene having six exons (Fig. 1). O2 codes for a transcription factor (Schmidt et al., 1990), a protein that binds to the upstream region of other genes and is required for initiation of transcription of those genes. The recessive genotype (o2 o2) has endosperm with an opaque (chalky) appearance, reduced levels of 22 kDa zeins, and enhanced levels of lysine and tryptophan. The transcription factor also affects transcription of nonstorage protein genes. The protein has a special sequence of amino acids that promote binding to DNA, termed a leucine zipper motif due to occurrence of leucine in every seventh position (Schmidt et al., 1990).



View larger version (12K):
[in this window]
[in a new window]
 
Fig. 1. Schematic of the structure of the gene Opaque 2 based on descriptions by Maddaloni et al. (1989) and Schmidt et al. (1990). Numbers in italics represent relative positions as base pairs along the DNA molecule. Due to space constraints, labels for some exons and introns are abbreviated (e.g., "I 2") or not shown (e.g., Intron 3).

 
Schmidt et al. (1990) described the core of the gene, termed the structural gene, as having 1311 nucleotides that specify the composition of a transcription factor containing 437 amino acids. In the sequence published by Maddaloni et al. (1989), the nucleotide series is broken into six exons and five introns (Fig. 1). The two published descriptions of the O2 sequence differ in several details. In the upstream region, Maddaloni et al. (1989) characterized a sequence of 1548 nucleotides, whereas the upstream region of Schmidt et al. (1990) had only 258 nucleotides. There also is a discrepancy in the length of the resulting protein sequence. Schmidt et al. (1990) described a polypeptide with 437 amino acids, and Maddaloni et al. (1989) reported 453 amino acids, reflecting a slightly longer nucleotide sequence for the structural gene. Although genetic data may carry an aura of absolute accuracy, differences in reported sequences may occur due to the methods used to isolate genes, the sequencing procedure, interpretation of the sequences and, of course, genetic variation in the germplasm characterized.

Gene Transcription and Expression
The process of reading a sequence of nucleotides from a DNA molecule and using this information to create a gene product has several stages (Fig. 2). In genes specifying the sequence of amino acids of a protein or a shorter polypeptide chain, contiguous triplets of nucleotides in the DNA specify the amino acid sequence. This "genetic code" is transferred from the DNA to mRNA in a polymerization reaction catalyzed by RNA polymerase. Since the code is ultimately read from the mRNA, it is usually expressed in terms of the four bases found in RNA, adenine (A), guanine (G), cytosine (C), and uracil (U). The amino acid leucine can be coded for by the base sequences UUA, UUG, CUU, CUC, CUA, or CUG. The mRNA template is used by organelles known as ribosomes to translate the code in the process of polymerizing the amino acids to create the specified polypeptide. This polypeptide may immediately be functional, but chaperone proteins often mediate correct assembly, particularly with regard to folding into the tertiary protein structure. So-called heat shock proteins are among important chaperones (Lewin, 2000).



View larger version (28K):
[in this window]
[in a new window]
 
Fig. 2. Schematic of the multi-stage process of transcription of mRNA from DNA, processing, transport and translation that result in synthesis of a polypeptide in the cell cytoplasm. Processes are shown in italics and key products in bold face.

 
Models of gene expression usually focus on transcription and translation (Fig. 2). The level of activity of a given gene is often regulated at the initiation of transcription (Lewin, 2000), but other potential control points or factors can operate (Brown, 1999). These include the processing of pre-mRNA, the transport of mRNA from the nucleus to the cytoplasm, translation efficiency, the stability of the mRNA, and protein modification and turnover (Gallie, 1993).

Transcription is regulated by one or more sequences in the upstream region that are recognized by proteins known as transcription factors. Promoter sequences are those involved in the binding of RNA polymerase, while enhancer sequences serve as binding sites for proteins that increase the activity of promoters. Less often, repressor proteins will bind to upstream sites, preventing initiation.

A single transcription factor may affect the transcription of a range of genes whose expression needs to be synchronized. Similarly, transcription of a single gene may be influenced by many factors. A gene that codes for a phosphatase that triggers mitosis in Drosophila illustrates the potential complexity of such systems (Lehman et al., 1999). In the string gene, an over 30 kilobase upstream region contains many promoter and enhancer sequences. Different transcription factors control string transcription in different types of cells and tissues and can be specific to subsets of epidermal cells, mesoderm, trachea, and nurse cells (Lehman et al., 1999).

How the environment influences gene transcription and expression is, of course, a central area of research in plant molecular biology. In cold tolerance in Arabidopsis, one family of regulatory genes (CBF/DREB1) appears to serve as master switches that are induced under cold conditions (Thomashow, 1999 and 2001). In processes such as reproductive development, families of phytochrome and cryptochrome (blue-light) receptors are involved, but how these affect gene activity is still unclear (Koornneef et al., 1998).

Genome Size and Functional Complexity
Estimates of total gene numbers for most plant species are on the order of 40 000 (e.g., Stuber et al., 1999). Arabidopsis, which is believed to have the smallest genome of any flowering plant, has about 25 500 genes (Table 1 and Arabidopsis Genome Initiative, 2000). DNA content per nucleus shows large variation among species (Table 2) and might be expected to indicate gene numbers. However, plant DNA contains multiple and sometimes large segments of noncoding bases, so this is not the case (Dean and Schmidt, 1995). Although it is not useful for estimating gene number, DNA content has important implications for the detection of individual genes using molecular markers: the more noncoding DNA segments there are, the harder it is to locate individual genes.


View this table:
[in this window]
[in a new window]
 
Table 1. Estimated number of genes in various organisms or plant organelles.

 

View this table:
[in this window]
[in a new window]
 
Table 2. Estimates of genome size and chromosome number for various plant species as listed in database of Bennett et al., 1998.

 
The potentially large number of genes affecting plant growth and development raises questions about the feasibility of gene-based approaches for modeling these processes. To further complicate the situation, different alleles of a single gene may interact. In inbreeding crops such as wheat or rice (Oryza sativa L.), all loci can be assumed to be homozygous (having the same allele on each chromosome), but in outcrossing and clonally propagated crops, heterozygous loci are common. Polyploid species (where more than two sets of chromosomes, and hence genes, occur) such as wheat, alfalfa (Medicago sativa L.), and potato (Solanum tuberosum L.) present a further level of complexity. Duplicated genes may retain their original functions, evolve to divergent functions, or one copy may cease to function (Wendel, 2000).

Environmental effects on gene expression also support a view of complexity. Large numbers of genes are activated or turned off when plants are exposed to stresses such as heat or water deficit (e.g., Bonham-Smith et al., 1988; Bray, 1990). While many of these genes have been characterized, we are far from a functional picture of how they interact (e.g., Thomashow, 1999). This complexity due to potentially large numbers of genes and interactions is a major concern for attempts to model networks of genes coding for a series of interacting enzymes and transcription factors. One major simplification may be possible through treating such networks as relatively independent modules. Thus, in simulating photosynthesis and respiration, one may be able to ignore genetic control of meiosis and mitosis, specific disease resistance mechanisms, and uptake and metabolism of nonlimiting nutrients.

Contrasting with this view of overall genetic complexity, evidence for genetic differences among cultivars suggest comparative simplicity. Quantitative trait loci (QTLs) are molecular markers whose presence or absence explains a significant amount of variation in a quantitative trait such as yield or time to flower. Effects of QTLs are usually measured in progeny of a single cross that are segregating for many traits and, hence, genes. In an extensive review, Tanksley (1993) found that for a given cross, 5 to 10 QTLs would often explain 40 to 70% of phenotypic variation in traits such as plant height, days to maturity, grain size, and even grain yield (see Table 3 for examples for maize). When data from multiple environments are considered, the genotypic variation explained can also be estimated, and such values are often higher than estimates for phenotypic variation (e.g., Austin and Lee, 1998).


View this table:
[in this window]
[in a new window]
 
Table 3. Number of quantitative trait loci (QTLs) detected and their effects (as portion of phenotypic variation explained) for various maize populations and traits. Only the effects of the first three loci and the total effect are given.

 
The success of classical genetics in explaining many quantitative traits, particularly for phenology, is also encouraging. For common bean, five genes explain much of the commercially important variation in time to flowering and maturity (White and Hoogenboom, 1996; Hoogenboom and White, 2003).

The proposed simplicity of physiologically important genetic differences in cultivars would appear to conflict with the estimates of 40 000 genes for maize. Among explanations for simplicity are that observed diversity in molecular markers reflects either variation in noncoding sections of chromosomes or in base–pair sequences within a gene that have no effect on the phenotype. Thus, while numerous allozymes (a class of isozymes or protein variants) exist, these variants usually reflect amino acid substitutions that have no effect on enzyme function (Tanksley, 1993).

The simplicity of cultivar differences that is suggested by QTLs may reflect individual markers identifying clusters of closely linked genes, giving a false impression of genetic simplicity. Analyses of regions around individual QTLs (fine mapping) for grain yield in maize (Graham et al., 1997) and heading date in rice (Yamamoto et al., 1998) revealed multiple genetic factors closely linked to the individual QTLs. Another concern is that QTL studies often use wide crosses to ensure detecting a large number of loci showing genetic differences. Commercially important germplasm may be much more uniform for genes detected by QTLs. Detection of large effects of a few genes thus would be an artifact. Clearly, this possible QTL vs. gene number paradox merits further attention.

Identifying Genes and Alleles
Approaches for identifying genes are divisible into whether one identifies phenotypic differences and relates these to genes or one identifies a gene and associates the gene with a phenotype. Traditional genetic analyses based on segregation ratios of expressed traits (phenotypes) in crosses are still invaluable for qualitative traits controlled by one or two genes. Unfortunately, for most physiological traits, the inheritance pattern is quantitative, and the phenotype reflects effects both of multiple genes and of the environment, thus making the task of detecting segregation patterns very difficult.

Partial solutions include using large populations, minimizing effects of environmental variation, and using molecular markers to suggest or verify modes of inheritance (Tanksley, 1993; Patterson, 1998). Gu et al. (1998) used markers to confirm the inheritance of photoperiod response in common bean initially described by Kornegay at al. (1993). Commonly used markers seldom fall precisely on the actual gene position (locus) of phenotypic interest, and markers can lay tens of thousands of base pairs away from the target gene (Lee, 1995). Thus, while markers can help locate a gene within a region of a chromosome, they are of limited value for characterizing gene sequences.

Due to the empirical nature of traditional genetics and molecular markers, data contain errors from measurement and interpretation. Sources of errors include misidentification of lines, use of lines of uncertain genetic purity, misreading of bands on electrophoresis gels, and data transcription errors. In a study of scoring error and reproducibility of bands from randomly amplified polymorphic DNA (RAPD) markers, Skroch and Nienhuis (1995) found a scoring error of 2%, and initial reproducibility was only 76%.

Gene sequencing can be viewed as working the gene identification problem from the other end—from genotype to phenotype. The most productive sequencing methods are based on expressed sequence tags (ESTs), where 300 to 500 base pair sequences from cDNA (DNA complementary to an mRNA) are determined (Schena et al., 1995; Bouchez and Höfte, 1998; Yamamoto and Sasaki, 1997). Knowing the sequence, the amino acid chain being specified is readily determined from the genetic code.

Understanding Gene Function
Knowing the sequence of a gene permits determining the amino acid sequence of the polypeptide, but this information alone is little better than a parts list (Skolnick and Fetrow, 2000). Combining information from polypeptides having similar sequences, from crystallography, and from other sources can reveal more about the possible function of the polypeptide. Obtaining this added value is a major research area of genomics (Baxevanis and Landsman, 1998).

Additional information is also available from the pattern of gene expression and may be of more immediate use in modeling development. Various methods allow tracking the timing and tissue specificity of gene transcription, which may have important implications for studying developmental processes (Bouchez and Höfte, 1998). Sheldon et al. (1999) found that the FLF gene of Arabidopsis affects vernalization-dependent flowering by encoding a protein that represses the transition to flowering. Gene activity was greater in vegetative rosette leaves of Arabidopsis than in reproductive tissue and was sustained at a uniform level at least until time of bolting. Such information appears directly applicable to crop simulation models such as CROPGRO and CERES that assume phase-specific differences in sensitivity to temperature and photoperiod.


    GENE-BASED APPROACHES TO MODELING
 TOP
 ABSTRACT
 INTRODUCTION
 PLANT GENETICS AND GENOMICS...
 GENE-BASED APPROACHES TO...
 PRACTICAL ISSUES IN GENE-BASED...
 CONCLUSIONS
 APPENDIX
 REFERENCES
 
To incorporate the vast amount of information coming from genomics, modelers need to focus on specific objectives. Detailed models of gene transcription and biochemical pathways will improve the understanding of specific processes, but may prove difficult to apply directly to prediction of overall growth and development. Simpler models may draw on general information from genomics to suggest representations of physiological processes, and there is the special opportunity of more rigorous accounting of genetic differences among cultivars.

For simulations to elucidate differences in plant growth and development among cultivars, six levels of genetic detail can be identified:

  1. Generic model with no reference to species.
  2. Species-specific model with no reference to genotypes.
  3. Genetic differences represented by cultivar-specific parameters.
  4. Genetic differences represented by specific alleles, with gene action gene effects represented through linear effects on model parameters.
  5. Genetic differences represented by genotypes, with gene action explicitly simulated based on knowledge of regulation of gene expression and effects of gene products.
  6. Genetic differences represented by genotypes, with gene action simulated at the level of interactions of regulators, gene-products, and other metabolites.

The first two levels are found in early models of crops and are still used for models where only generic representations of species are required (e.g., Parton et al., 1992; Mitchell et al., 1998). Most current crop models are at Level 3. Examples include CROPGRO (Hoogenboom et al., 1992; Boote et al., 1998), which typically uses 15 cultivar-specific parameters within a species, and CERES (Godwin et al., 1989; Hoogenboom et al., 1994; Ritchie et al., 1998), which uses up to eight such parameters, depending on the cereal species simulated. Level 4 corresponds to the approach used in GeneGro Version 1 (White and Hoogenboom, 1996), and Level 5 is partially represented in the phenology routines of GeneGro Version 2 (Hoogenboom and White, 2003). The feasibility of Level 6 is implicitly considered for unicellular organisms in models such as E-CELL (Tomita et al., 1999). The last three levels represent a continuum of approaches involving greater levels of genetic and biochemical detail.

Models without Explicit Gene Effects (Levels 1 to 3)
While perhaps less glamorous than approaches described below, and arguably not truly "gene-based," simply using genomic information to guide hypothesis formulation and model testing is possible, in the absence of genetically characterized cultivars. Indeed, many discussions of modifications to models consider whether clear cultivar differences are known and whether the physiological mechanisms are at least partially understood. The work of Fowler et al. (1999) to model cold tolerance in wheat exemplifies this approach.

Modelers should be aware that functional genomics offers many powerful tools for crop physiology. As mentioned earlier, procedures exist for monitoring changes in gene expression in different tissues as affected by developmental phase and environmental conditions. Questions of timing of key developmental events can be examined in detail as well as tissue specificity of the responses. Papers by Bouchez and Höfte (1998) and Somerville and Somerville (1999) provide overviews of potential applications of plant functional genomics in physiology.

Linear Models of Gene Effects (Level 4)
The examples below emphasize self-pollinating species such as wheat, rice, soybean, and common bean. Continued self-pollination eliminates heterozygosity, so effects of interactions between alleles of a single gene are excluded.

For a single gene with two alleles, effects can be estimated in a linear model, if their levels are assigned values of 1 and 0 for the dominant and recessive alleles, respectively.

where P is the effect to be estimated as a cultivar-specific model parameter, a and b are coefficients of the liner regression, and G is the variable indicating which allele of the gene is present. Thus in GeneGro Version 1, the representative leaf photosynthetic rate (LFMAX) was estimated by

where Fin indicates the genotype for the Finatus gene by a value of 1 (dominant) or 0 (recessive).

For a gene that affects more than one trait (pleiotropy), additional equations are used. For example,

and

where P1 and P2 are cultivar parameters for separate traits, and a, b, c, and d again are coefficients estimated through linear regression. Pleiotropy is common in physiological traits expressed at the whole plant level. In common bean, the most easily identifiable effect of the Fin gene is on the fate of the main stem, whether it remains vegetative or terminates in an inflorescence. If the stem remains vegetative, the plant flowers and matures later, usually resulting in increased leaf area and seed yield but reduced individual grain size (White et al., 1992). Fin, therefore, has pleiotropic effects on growth habit, phenology, and yield and in GeneGro Version 1, the gene appeared in equations for parameters affecting phenology, threshing percentage, specific leaf area, and leaf photosynthesis.

Although a single gene may determine a given trait, most traits are affected by multiple genes. The simplest case is when the effects of the genes do not interact, and the genes have additive effects,

where C is the cultivar parameter to be estimated; F and G represent effects of two genes; and a, b, and c are regression coefficients. In GeneGro Version 1, the coefficient for photothermal time to flowering (PHTHRS[5]) was estimated as:

where Fd is a gene for early flowering and maturity (Coyne, 1970).

If one gene influences effects of other genes, then an interaction is said to occur (epistasis). This can be represented as the product of the interacting genes:

Studies on the inheritance of photoperiod response in common bean showed an interaction between the genes Ppd and Hr (Kornegay et al., 1993), so in GeneGro Version 1, this effect was incorporated in estimating the cultivar parameter for length of the critical long day:

The foremost requirement for this modeling approach is to have access to a dataset of cultivars that are both calibrated for the cultivar specific-parameters and characterized for relevant genes. As indicated with CLDVAR, knowledge of the underlying processes and reports of gene action should influence the initial decision concerning which gene effects to include. Stepwise regression procedures can guide the final selection of genes used to model a single parameter. In developing GeneGro Version 1 (White and Hoogenboom, 1996), data for 32 cultivars and seven genes were used to estimate 19 parameters. Four parameters showed nonsignificant correlations, so they were set to constant values.

Representing gene action through linear effects is an effective way to reduce ambiguities in models of plant growth and development and to begin to introduce explicit gene action. Undoubtedly, the biggest constraint to use of linear effects is lack of appropriate data on cultivar-specific model parameters and on cultivar genotypes.

Simulations Based on Knowledge of Gene Action (Level 5)
Assumption of linear effects of gene action is justified when more complete information is lacking. To move beyond Level 4, however, model equations should specify gene-action using more physiologically realistic representations. In common bean, the inheritance of photoperiod sensitivity suggested that the increased photoperiod sensitivity, associated with the genotype PpdPpd HrHr, resulted in an inhibition of flowering, but the original model equations in GeneGro described the effect of photoperiod as a rate enhancing (promoting) effect. Thus, the genetic information was used to restructure the equation for photoperiod response. This information further suggested that a third gene, Tip, should show a temperature effect such that lower temperatures reduced the level of photoperiod sensitivity (Hoogenboom and White, 2003).

We can anticipate many ways to represent genotypic differences through gene action. The simplest would be the presence or absence of a discrete response, such as cold acclimation. The next level would involve quantitative variation in a response (e.g., photoperiod sensitivity). For models where enzyme kinetics is considered, such as the Farquhar photosynthesis model (Farquhar et al., 1980), which is incorporated in several crop models, genetic variants might affect coefficients for enzyme kinetics.

Much more complex situations certainly exist. Genes such as Fin affect multiple traits through mechanisms that are not immediately obvious. In bean, the genotype fin fin is associated with determinate stems, increased leaf size and thickness (and hence leaf photosynthetic rate), earlier flowering and maturity, reduced plant height, increased seed size but through compensation in other yield components, reduced numbers of pods and seeds per pod (White et al., 1992). While each effect could be modeled separately, a more robust approach might be to represent the underlying action of Fin. This conceivably might involve multiple effects of a single transcription factor or growth regulator, but there is essentially no information on the underlying mechanisms of Fin available at this time. We conclude that, in this early phase of gene-based models, it is wiser to deal with gene action on a case-by-case basis until there are more examples from which to generalize.

In terms of practical handling of genetic information, cultivar lists should include the genetic profile for each gene considered in the model. A binary code (i.e., 0, 1) is only adequate for inbred crops with diploid inheritance and where only two alleles are known. As genetic information increases in detail, multiple alleles with quantitatively different responses will undoubtedly be encountered, so additional code values will be required. The same approach would be required to handle out-crossing crops and some types of polyploid crops. To facilitate model development and testing, parameters for gene action might be externalized in a separate file that allows different types of functions to be defined in relation to a specific genotype. [This approach is implemented in files of species coefficients in the CROPGRO and CERES models (Hoogenboom et al., 1994).] Efficient handling of gene interactions and pleiotropy, however, may require more sophisticated approaches.

Simulations Based on Interactions of Regulators, Gene-Products, and Other Metabolites (Level 6)
A process-based approach could potentially include gene transcription, synthesis of polypeptides, their conversion into proteins, the multiple reactions within each cell type, and ultimately interactions among cells and effects of the external environment. Various software systems can simulate complex biochemical reactions. One system, E-CELL, models a single cell (Tomita et al., 1999), and proposals exist for modeling organisms on super computers (Anonymous, 1999; Howard, 2000). We briefly describe three software systems for modeling biochemical systems. All three assume that information on pathways and reaction kinetics are known, and thus represent a level of simplification above protein sequence and structure. None of the packages allow for variation in the external environment.

The Gepasi software allows users to simulate the dynamics of complex metabolic pathways and has the specific goal of providing biochemists a user-friendly tool that requires a minimum of programming skills (Mendes and Kell, 1998). Reactions are entered using standard chemical notation, mass conservation relations are automatically accounted for, and commonly used rate laws are predefined. Compartments with different volumes are allowed. Output is displayed graphically as the reactions progress. The model size is limited only by available memory of the computer on which the software is installed.

E-CELL (Tomita et al., 1999) simulates a whole cell. Similar to Gepasi, the user defines the proteins, protein–protein interactions, protein–DNA interactions, gene regulation, compartmentation, and other characteristics of the system. Most enzyme reactions are modeled with Michaelis-Menten equations, but other equations can be used. Types of reactions allowed include binding of macromolecules to and transport of a substance from the external environment into the cell. The default time interval for integration is 1 ms. Simulating a 127 gene system from Mycoplasma genitalium using a 200 MHz Pentium II processor, the system ran about 20 times slower than real time (Tomita et al., 1999). E-CELL has also been used to simulate a system of 44 reactions and intermediates in a human red blood cell (Matsushima et al., 1998). The current model does not simulate cell division, but various enhancements are planned (Tomita et al., 1999).

Ingeneue (von Dassow et al., 2000) is a general-purpose program for constructing and analyzing models of genetic networks operating across a group of cells. Thus, it focuses on pathways involved in gene expression using parameters such as half-lives of messenger RNAs and proteins, binding rates, and enzyme cooperativity coefficients. To represent differences among cells, processes are monitored for a two-dimensional array of hexagonal cells.

These systems can advance our understanding of cell biochemistry and gene regulation, but current applications are far from providing the capacity of simulating growth of a plant, even if simplified to a few key cell types and maintained in a constant environment. Superficially, the main constraints to reaching this goal concern data availability. The softwares handle multiple compartments, which presumably could be extended to represent diverse cell and tissue types. In practice, it seems likely that biochemical and computational constraints will be limiting (Anonymous, 1999; Howard, 2000). Many reactions of interest in cell biology involve molecule concentrations as low as one or two copies of a promoter per cell. Standard representations of reaction kinetics models will likely prove unsuitable. This does not mean, however, that such models cannot provide valuable insights. Poolman et al. (2000), using an enzyme-level model of the Calvin cycle, found that assimilation flux was unexpectedly sensitive to sedoheptulose bisphosphatase activity. In a massive simulation experiment of segmentation in Drosophila development, the Ingeneue system showed that the gene network was surprisingly robust despite variations in parameters, once a stable network configuration was identified (von Dassow et al., 2000).


    PRACTICAL ISSUES IN GENE-BASED MODELING
 TOP
 ABSTRACT
 INTRODUCTION
 PLANT GENETICS AND GENOMICS...
 GENE-BASED APPROACHES TO...
 PRACTICAL ISSUES IN GENE-BASED...
 CONCLUSIONS
 APPENDIX
 REFERENCES
 
In examining gene-based approaches for crop modeling, there is the underlying question of general feasibility. Although much depends on scientific issues, there are also practical concerns such as how to access genomic data, which crop species to focus on initially, which traits might be easiest to model, and what scale of processes to model. We examine such issues below.

How to Access Genetic and Molecular Data?
Geneticists and molecular biologists maintain public databases on genes, alleles, gene sequences, and related information, which are readily accessible through the Internet (Ouellette, 1998; Walsh et al., 1998; Stuber et al., 1999). Physiologists and modelers, however, may find these databases less useful than expected. The user interfaces assume familiarity with bioinformatics. Databases of gene sequences and protein structure lack information on actual gene function (although they often list relevant publications). The number of lines or cultivars characterized for a given gene is usually limited to the parents used in describing the gene, and few data on field performance are found. The Arabidopsis Information Resource (2000) purports to provide more phenotypic data for Arabidopsis, but still falls short of meeting the requirements of whole-plant models.

The International Crop Information System (ICIS), which is being developed by a network of international agricultural centers, universities, and other institutions, represents one effort to fill this information gap (ICIS Network, 2000). The ICIS can store detailed field data as well as results from molecular marker work. There is a prototype utility for exporting model-ready files (Lieshout et al., 2001), which use the formats promoted by the International Consortium for Agricultural Systems Applications (ICASA; Hunt et al., 2001), and the pedigree analysis tools might be used to infer genetic relations where genetic characterizations are incomplete. Genetic resource and crop genomic databases such as GRIN (USDA-ARS National Plant Germplasm System, 2001) and GrainGenes (USDA-ARS Plant Genome Research Program, 2002) as well as breeding and crop registration records are also potential sources of phenotypic, genetic, and pedigree data.

Which Species?
Geneticists emphasize the benefits of working with model organisms such as Escherichia coli and Drosophila melanogaster. Arabidopsis thaliana was chosen as a model plant species because of its small plant size, rapid growth and development, large number of mutants, and very small genome (Tables 1 and 2). In December 2000, the 125 megabase genome was fully sequenced and found to contain about 25 500 genes (Arabidopsis Genome Initiative, 2000). For traits such as time of flowering, there is considerable information available on gene function (e.g., Coupland, 1995; Koornneef et al., 1998; Sheldon et al., 1999) that modelers may find useful. Thus, one obvious target for gene-based modeling is Arabidopsis.

Among economically important species, rice has attracted attention because of its role as a food crop, its having a genome that is small relative to those of wheat and maize (Table 2), and the fact that an active biotechnology network exists for the crop (Fischer et al., 2000). For legumes, common bean and soybean are possibilities, based on the availability of genetic and phenotypic data, but barrel medic (Medicago truncatula Gaertner) is being promoted as a model legume species for genomic studies (Medicago truncatula Consortium, 2001). Comparisons among genomes of different crop species reveal high levels of similarity (e.g., Devos and Gale, 1997; Weeden et al., 1992), and it appears likely that models of gene action in one crop can be extrapolated to other crops in the same botanical family (e.g., among legumes or among cereals). However, Helentjaris and Briggs (1998) noted that efforts to identify maize homologs for genes described in other species have proven more difficult than originally anticipated. One problem is that a single species may have multiple genes with similar sequences but different functions. In barrel medic roots, eight different chitinases are variously involved in mycorrhizal associations, nodulation, and plant–pathogen interactions (Salzer et al., 2000).

Which Traits?
Practical applications often oblige crop modelers to emphasize simulation of economic yield. Ideally, gene-based approaches should first focus on traits that have a relatively simple inheritance and can be measured accurately, yet are of economic importance. Phenology probably offers the best compromise among these criteria. GeneGro Version 1 predicted phenology much more accurately than it predicted grain size, harvest index, or seed yield (Hoogenboom et al., 1997). For many crops, a sufficient number of major genes for phenology are known (e.g., Table 4 for bread wheat), but an effort is needed to increase the amount of breeding lines and cultivars characterized.


View this table:
[in this window]
[in a new window]
 
Table 4. Examples of genes affecting phenology of bread wheat.

 
Morphological traits such as determinacy, leaf size and thickness, plant height, and grain size offer other alternatives. They are readily measured, often have important effects on plant growth, and sometimes show simple genetic control. A challenge, however, is to identify an appropriate level of analysis. Yin et al. (1999) showed that accounting for growth stage effects resulted in more meaningful QTL analyses for specific leaf area in barley. Arguably, however, this leaf trait should be dissected into additional components such as physical thickness and the relative partitioning between cell wall and cell contents, given that values for specific leaf area reflect the integrated effects of many processes related to leaf area expansion, thickening, and accumulation of cell contents.

A third set of traits worth considering is that involved in stress responses that show well-defined molecular responses. These include responses to cold (Thomashow, 1999), heat (Hall, 1992; Nakamoto and Hiyama, 1999), hypoxia and anoxia (Drew, 1997), and dehydration (Ingram and Bartels, 1996). While allowing precise control of plant response and gene expression, specific stress responses may largely be survival mechanisms. Thus, whereas their study could improve the simulation of plant survival, the results might prove harder to relate to the simulation of basic processes of growth and partitioning. Highly heritable quality traits (e.g., lipid acid or storage protein composition) also merit consideration.

What Scale and Level of Detail?
A crop is analyzable for processes at various scales: community, population, plant, organ, tissues, cell, and downward to molecular levels. Furthermore, at any of these scales, the level of process detail may vary. Thornley (1976) and Monteith (1996) have argued that models should not describe too many levels of process scale. Simulating gene action in a model that predicts crop growth might present an unacceptable breadth of scales. Note, however, that this call for simplicity appears stem mainly from concerns about ease of comprehension and computational requirements.

In constructing simulation models, researchers early noted that growth regulator systems are essentially communication mechanisms and there was no need to model the signal transmission mechanism. De Wit and Penning de Vries (1983) noted that "these systems may be simulated themselves, but there is no reason to do so in crop models." Thus, one also might argue that, for many gene systems, a detailed representation of signaling mechanisms is not required for effective modeling.

There is no a priori reason to expect that a model that treats certain processes at a molecular scale and others at the organ or plant level will be inherently less accurate than a model covering a narrower range of scales. If a molecular-scale approach represents a process more accurately, then simulation of processes at higher levels of integration may also benefit. Reinforcing the argument for using multiple scales as dictated by necessity is the expectation that gene-based approaches will reduce uncertainty over genetic differences in plant growth and development. Current use of cultivar-specific parameters allows uncertainties in models to be obscured by over-calibration of these parameters.

We conclude that while an increase in model complexity per se does not guarantee improved accuracy, modeling processes at the genetic level is justified if it improves representation of essential processes. If excessive complexity or execution time affects model use and maintenance, then the model structure should be revised based on criteria for software engineering as well as biology. However, breadth of model scale should not be limited merely as a matter of principle.

How Relevant are Results from Animal Systems to Plant Biology?
Given the vast investments in functional genomics for biomedical use, one might query how applicable research on animals is to crop species. At a molecular level, it appears that the control of development in plants and animals differs substantially. Meyerowitz (1999) noted that since plants and animals diverged evolutionarily at the unicellular stage, one expects more commonality for intracellular processes than in processes requiring intercellular communication. Kaplan and Cooke (1997) concluded that long-assumed parallels between animal and plant embryogenesis are superficial and indicate little about underlying genetic control. Recent analyses of the genome of Arabidopsis support these views: mechanisms of development and signal transduction may bear superficial similarities to animal systems but differ in many fundamental aspects (Arabidopsis Genome Initiative, 2000).

There also appears to be a qualitative difference between biomedical and agricultural research. Biomedical research often focuses on diseases that show qualitative genetic effects. Distinctive mutants are well known in plants, such as for chlorosis and dwarfing (e.g., Sheridan, 1988), and their effects may partially be the basis of heterosis (Crow, 1999). However, information on specific mutants may prove of limited use in modeling crop processes, where the focus usually is on integrating effects of multiple quantitative traits.

How to Ensure Effective Collaboration among Crop Modelers, Geneticists, and Molecular Biologists?
Creating a new generation of simulation models for plant growth and development will require developers to harness a much broader range of expertise than in the past. To form effective teams, modelers will first have to convince geneticists and molecular biologists that quantitative, dynamic models can help guide research and more rapidly improve scientific understanding and lead to practical applications of genomics. Two initiatives that merit attention are the International Functional Genomics Working Group (Fischer et al., 2000) and the NSF-funded 2010 Project (Chory et al., 2000).

More effective data management will also be paramount. The genomics community is accustomed to open access to vast datasets through the Internet (Baxevanis and Ouelette, 1998). Data exchange standards are well established in genomics research (Ouelette, 1998) and play a major role in stimulating collaboration. In crop modeling, the ICASA standards (Hunt et al., 2001) provide parallels, but their use is far from universal.


    CONCLUSIONS
 TOP
 ABSTRACT
 INTRODUCTION
 PLANT GENETICS AND GENOMICS...
 GENE-BASED APPROACHES TO...
 PRACTICAL ISSUES IN GENE-BASED...
 CONCLUSIONS
 APPENDIX
 REFERENCES
 
Vast amounts of information on gene sequences and functions are becoming available, and we can expect substantial advances in our ability to model plant growth and development in coming years. Software programs now exist that can model systems with hundreds of molecular components, representing compartmentalization as found at the cellular level. However, scaling these systems up to multicellular systems, where thousands of genes and gene products are involved, will require computing abilities well beyond those available now. This approach may also fall down on attempting to represent chemical reactions where certain components are present at concentrations of only one or two molecules per cell.

Although work at a detailed molecular level is justifiable to improve our overall understanding of biological systems, other approaches may bring more immediate benefits to agriculture. In models where representation of cultivar differences is of interest (such as for phenology and grain size), genotypic differences can be represented through linear effects on model parameters or simplified representations of gene action. Our growing understanding of the gene networks involved in processes such as flowering and leaf development can further strengthen such modeling approaches. Evidence of temporal and tissue variation in gene expression may prove especially useful. For such research, modelers will require better access to quality data on field performance and genotypes. This seems an opportune time for the modeling community to create a central data management system, perhaps focusing first on two or three crops where genetic information is most readily available.

The question remains, however, of how to proceed in filling the vast middle ground between models built up from descriptions of gene sequences and the comparatively simple crop simulation models now in use. Modeling subprocesses such as photosynthesis, respiration, floral development, or responses to specific stresses offers opportunities for exploring this middle region. Various simulation environments are available for modeling complex biochemical networks, but improvements are needed to allow for a varying external environment and to represent cell growth and division. Increasingly, such research will require larger teams with a broader disciplinary representation than is found in traditional crop modeling, representing challenges for research funding and management.


    APPENDIX
 TOP
 ABSTRACT
 INTRODUCTION
 PLANT GENETICS AND GENOMICS...
 GENE-BASED APPROACHES TO...
 PRACTICAL ISSUES IN GENE-BASED...
 CONCLUSIONS
 APPENDIX
 REFERENCES
 
Glossary of Selected Terms in Plant Genetics and Genomics
base pair. The pairings of the bases adenine with thymine and cytosine with guanine that occur in a DNA molecule.

chaperone. A protein that assists the folding or assembly of another protein.

chromosome. The discrete bodies that carry genes coded in DNA.

codon. A triplet of nucleotides that codes for an amino acid or represents a termination point.

DNA. "Deoxyribonucleic acid." The molecule that encodes genetic information in cells as base pairs and is the principal component of chromosomes.

enhancer. A region on a DNA molecule located upstream of a gene and that increases the functioning of a promoter.

EST. "Expressed sequence tag." Partial sequences (e.g., 300–500 base pairs) of DNA derived from a mRNA and thus representing an expressed gene.

gene. The fundamental unit of inheritance. Alternatively, the segment of DNA that codes for a diffusible product.

genomics. The study of the entire gene complement of an organism.

heterozygous. Having different alleles for a given gene.

homozygous. Have the same alleles for a given gene.

isozymes. Enzymes that have the same function but are chemically distinct.

locus. A specific location on a chromosome. Its usage is often restricted to positions of genes.

pleiotropic. Applied to a gene that affects more than one trait (i.e., has multiple effects).

polypeptide. A chain of amino acids joined by peptide bonds.

polyploid. Having more than two sets of chromosomes.

promoter. A region on a DNA molecule located upstream of a gene and involved in binding of a RNA polymerase to initiate transcription.

proteomics. The study of the entire protein complement of an organism.

QTL. "Quantitative trait loci." Loci linked to a quantitative trait. The loci are usually identified with molecular markers.

RAPD. "Randomly amplified polymorphic DNA." A molecular marker system based on using short, random sequences of nulceotides as primers for amplifying selected parts of a genome.

regulatory gene. A gene that codes for an RNA or polypeptide that has a regulatory function in gene transcription.

repressor. A protein that binds to DNA or RNA to prevent transcription or translation.

RFLP. "Restriction fragment length polymorphism." A genetic polymorphism where differences are based on lengths of DNA fragments obtained by cutting DNA with enzymes that only act at specific base pair sequences.

ribosome. Complex organelle that translates information from a mRNA and guides the polymerization of amino acids to form a polypeptide.

RNA. "Ribonucleic acid." A ribose-containing nucleic acid that has the bases adenine, guanine, uracil, and cytosine. In plants, it exists as three main types. Messenger RNA (RNA) carries the coded information for a polypeptide from the DNA to the site of synthesis. Transfer RNA (tRNA) binds single amino acids and transfers them to the ribosome for synthesis. Ribosomal RNA is a major structural component of ribosomes.

structural gene. A gene that codes for an RNA or component of a protein that is not a regulator.

transcript. An RNA copy of a segment of DNA.

transcription. Synthesis of RNA based on the base sequence of a DNA strand.

transcription factor. A protein that is required for initiation of transcription but that is not part of the RNA polymerase per se.


    ACKNOWLEDGMENTS
 
The authors thank Jim Jones for early encouragement to pursue gene-based approaches, Tony Hunt for helpful discussions in the elaboration of the manuscript, and Mike Listman for editorial assistance.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 PLANT GENETICS AND GENOMICS...
 GENE-BASED APPROACHES TO...
 PRACTICAL ISSUES IN GENE-BASED...
 CONCLUSIONS
 APPENDIX
 REFERENCES