Maximum likelihood analysis of quantitative trait loci under selective genotyping

Maximum likelihood analysis of quantitative trait loci under selective genotyping


Play all audios:


ABSTRACT Selective genotyping is a cost-saving strategy in mapping quantitative trait loci (QTLs). When the proportion of individuals selected for genotyping is low, the majority of the


individuals are not genotyped, but their phenotypic values, if available, are still included in the data analysis to correct the bias in parameter estimation. These ungenotyped individuals


do not contribute much information about linkage analysis and their inclusion can substantially increase the computational burden. For multiple trait analysis, ungenotyped individuals may


not have a full array of phenotypic measurements. In this case, unbiased estimation of QTL effects using current methods seems to be impossible. In this study, we develop a maximum


likelihood method of QTL mapping under selective genotyping using only the phenotypic values of genotyped individuals. Compared with the full data analysis (using all phenotypic values), the


proposed method performs well. We derive an expectation–maximization (EM) algorithm that appears to be a simple modification of the existing EM algorithm for standard interval mapping. The


new method can be readily incorporated into a standard QTL mapping software, e.g. MAPMAKER. A general recommendation is that whenever full data analysis is possible, the full maximum


likelihood analysis should be performed. If it is impossible to analyse the full data, e.g. sample sizes are too large, phenotypic values of ungenotyped individuals are missing or composite


interval mapping is to be performed, the proposed method can be applied. SIMILAR CONTENT BEING VIEWED BY OTHERS POPULATION SIZE IN QTL DETECTION USING QUANTILE REGRESSION IN GENOME-WIDE


ASSOCIATION STUDIES Article Open access 13 June 2023 LEVERAGING INFORMATION BETWEEN MULTIPLE POPULATION GROUPS AND TRAITS IMPROVES FINE-MAPPING RESOLUTION Article Open access 10 November


2023 THE FLASHFM APPROACH FOR FINE-MAPPING MULTIPLE QUANTITATIVE TRAITS Article Open access 22 October 2021 INTRODUCTION Statistical analysis of quantitative trait loci requires both the


phenotypic data and marker genotypes of individuals sampled from a reference population. It is generally believed that a large sample size is required to map QTLs with small effects.


However, obtaining a large sample can be very costly or even impossible. Usually, the cost of genotyping is higher than that of the phenotypic measurement. Lander & Botstein (1989)


showed that one can selectively genotype individuals from the extremes of the phenotypic distribution, yet receive almost identical power as when the whole sample is genotyped. If the cost


of the phenotypic measurement is low, selective genotyping can significantly reduce the cost. This selective genotyping technique has been widely utilized in QTL mapping experiments (e.g.


Groover et al., 1994). Under selective genotyping, phenotypic values of ungenotyped individuals still have to be included in the analysis, with their marker genotypes treated as missing


values, otherwise estimates of QTL effects will be biased (Lander & Botstein, 1989). A full likelihood function is given by Muranty & Goffinet (1997a). Exact maximum likelihood


estimates (MLEs) can be achieved via an iterative approach. However, Muranty & Goffinet (1997a) derive approximate MLEs under the assumption that QTL effects are small relative to the


residual standard deviation. Recently, Johnson et al. (1999) proposed an expectation–maximization (EM) algorithm implemented via Monte Carlo sampling for handling missing marker genotypes.


With phenotypic values of ungenotyped individuals excluded from the data analysis, Darvasi & Soller (1992) investigated an analysis of variance (ANOVA) approach to estimate QTL effects


with a bias correction. Without providing a detailed implementation, they also proposed a maximum likelihood approach for such a truncated data analysis. Muranty & Goffinet (1997b)


extended their selective genotyping to multiple trait QTL mapping, showing that selection on one trait can increase the power of QTL detection for a correlated trait. They also proposed a


selection index method for multiple trait selective genotyping. Instead of selecting the two tails of a single trait, they first established a selection index combining phenotypic values of


all traits, and then selected the two extremes on the scale of the index. Again, phenotypic values of ungenotyped individuals must be included in the data analysis to remove the bias in the


estimated QTL effect. In reality, different traits may be expressed in different stages. If selection is performed on an earlier displayed trait, individuals that fail to reach the criterion


of selection in this stage may be removed, and thus do not have the opportunity to express their phenotype for a later trait. In this case, unbiased estimates of QTL effects for the later


trait seem to be impossible based on the method of Muranty & Goffinet (1997a,b). Therefore, a new method is needed to handle missing values for both the genotypes and the phenotypes.


Such a method is now available as a result of work by Henshall & Goddard (1999). They adopted an entirely different approach by altering the roles of genotypes and phenotypes in the


likelihood function. They treated phenotypes as independent variables and genotypes as dependent variables. Because genotypes are binary in a population with only two genotypes, they


utilized a standard logistic regression approach. The advantages of this method are: a standard statistical package, such as SAS, is readily applied and estimates are not affected by


selection of the phenotype. The second advantage is important in handling the problem of missing phenotypes. As recognized by the authors, the logistic regression method, however, has not


been sufficiently generalized to handle populations with more than two segregating genotypes, e.g. the F2 family. Furthermore, it is not clear how to implement the composite interval mapping


(Jansen & Stam, 1994; Zeng, 1994) in the logistic regression framework. The objectives of this study are to develop a maximum likelihood approach to QTL mapping using samples containing


only the genotyped individuals and to compare its efficiency relative to that using samples of all individuals. The maximum likelihood solution will be achieved via an EM algorithm that is


simple enough to be incorporated into any standard interval mapping software. THEORY AND METHODS SINGLE-TRAIT ANALYSIS Consider an F2 population of _N_ individuals with phenotypic values


measured for trait _y_. Among the _N_ individuals, only _n_ (_n_ ≤ _N_) of them are selectively genotyped, with _n_/2 being selected from the upper extreme and _n_/2 being selected from the


lower extreme in the scale of _y_. This selection regime can be viewed as ‘disruptive selection’ with two known artificial truncation points. Individuals are genotyped only if _y__j_ ≤ _t_1


or _y__j_ ≥ _t_2, where _t_1 and _t_2 are, respectively, the (_n_/2)th and (_N_ − _n_/2)th ascendingly ordered phenotypic values of _y_ among the _N_ individuals. In real data analyses, the


two tails selected for genotyping may not be symmetrical. The two truncation points, _t_1 and _t_2, are not calculated from the distribution; rather, they take the largest phenotypic value


in the lower tail and the smallest phenotypic value in the upper tail. The phenotypic value of the _j_th individual is described by the following linear model: where _a_ and _d_ are the


additive and dominance effects of a QTL, respectively, _z__j_ and _w__j_ are indicator variables for the genotype of the QTL, which are defined as: where _Q__k__Q__l_ for _k_ ≤ _l_=1,2


denotes the QTL genotype, and _e__j_ is the residual effect distributed as N(0, σ2). For notational simplicity, let us define B=[μ, _a_, _d_]T and X_j_=[1, _z__j_, _w__j_], and rewrite model


(1) by: where X_j_=U_kl_ for genotype _Q__k_Q_l_ and This is a typical regression model. Because the QTL genotype is not observable but inferred from marker information, only the


conditional distribution of X_j_ given the marker genotype is available. Define the conditional probabilities of the QTL genotype, and thus X_j_, by _p_(X_j_) and the probability density of


_y__j_ given X_j_ by: The likelihood function is rewritten as: where _S_={U11, U12, U22}. Note that the first _n_ individuals are genotyped and the last _N_ − _n_ individuals are


ungenotyped, and _p_(X, _j_) takes its prior value for _j_=_n_ + 1, …, _N_. The phenotypic values of ungenotyped individuals contribute very little information to linkage analysis. Their


inclusion in the likelihood serves solely as a way to correct the bias in estimation of the QTL effects caused by selective genotyping. When the number of ungenotyed individuals becomes very


large, the maximum likelihood method implemented this way is computationally inefficient. An alternative way to construct the likelihood function is to include only the phenotypic values of


the genotyped individuals while still taking into account the selection process. The likelihood function then becomes: where: The denominator of _g_(_y__j_|X_j_) is Pr[(_y__j_ ≤ _t_1) ⋃


(_y__j_ ≥ _t_2)| X_j_], the conditional probability that the _j_th individual is selected for genotyping given X_j_, where Φ(τ1|X_j_)=Pr(_y__j_ ≤ _t_1|X_j_) and Φ(τ2|X_j_)=Pr(_y__j_ ≤


_t_2|X_j_) are standardized normal functions, and τ1|X_j_=(_t_1 − X_j_B)/σ and τ2|X_j_= (_t_2 − X_j_B)/σ are the standardized truncation points. The likelihood function can be searched via


an EM algorithm that is described below. In the E step, the conditional posterior distribution of X_j_ is obtained using initial values of B and σ2, The posterior distribution is then used


to calculate the expectations of various quantities that involve X_j_. In the M step, we estimate the parameters based on the following equations: and where φ(τ1|X_j_) and φ(τ2|X_j_) denote


the standardized normal densities, different from Φ(τ1|X_j_) and Φ(τ2|X_j_). The notation E_x_ stands for expectation with respect to X_j_, the missing genotype. The initial values of


parameters are then replaced by B^ and _σ^_2, forming a new cycle of iteration. After convergence, B^ and _σ^_2 will be the MLEs of B and σ2. Note that the terms involving σ and σ2 in the


right hand sides of eqns (7) and (8) are because of selective genotyping. Without selection, these terms will vanish and the EM equations will reduce to the standard ones (Zeng, 1994). In


the simple model described in this study, only one non-QTL effect, μ, is included in the model. If the model includes many covariates, as seen in composite interval mapping, the ECM approach


should be adopted (Jiang & Zeng, 1995). The derivations of B^ and _σ^_2 are given in the Appendix. MULTIPLE-TRAIT ANALYSIS Let us define a 1 × _m_ vector for _m_ traits measured in the


_j_th individual by Y_j_=[_y__j_1, _y__j_2, …, _y__jm_]. The multivariate linear model is expressed by: where X_j_=[1, _z__j_, _w__j_] remains the same as in the single-trait model, and is a


1 × _m_ vector for the residuals with a multivariate normal distribution, i.e. E_j_ ≈ N_m_(0, V), where: Assume that the criterion of selection is a linear combination of all traits, called


the selection index and denoted by _I__j_=∑_m__k_=1_c__k__y__kj_=Y_j_C. The selection index is a generalized criterion of selection. If _c_1=1 and _c__j_≠1=0, then the index becomes the


phenotypic value of the first trait. The score of the selection index can be similarly partitioned into a genetic and residual component: where B_I_=∑_m__k_=1_c__k_B_k_=BC and


_e__jI_=∑_m__k_=1_c__k__e__jk_=E_j_C. The expectation and variance of _I__j_ are E(_I__j_|X_j_)=X_j_BC and Var(_I__j_|X_j_)=σ2_I_=CTVar(E_j_)C=CTVC, respectively. The two truncation points


of selection in the scale of the index are defined, again, by _t_1 and _t_2, respectively. The probability density of Y_j_ without selection is: After truncation selection on index _I__j_,


the joint density becomes _g_(Y_j_|X_j_) = [_f_(Y_j_|X_j_)]/[1 + Φ(τ2|X_j_) − Φ(τ2|X_j_), where the denominator is the probability that the _j_th individual is selected for genotyping, i.e.


1 + Φ ( τ 1 | x j ) - Φ ( τ 2 | x j ) = 1 - ∫ ∫ t 1 < y j c < t 2 f ( y j | x j ) d y j = 1 - ∫ t 1 t 2 f ( I j | x j ) d I j , where τ1|X_j_ = (_t_1 − X_j_BC)/σ_I_ and τ2|X_j_ = (_t_1


− X_j_BC)/σ_I_ are the standardized truncation points in the scale of the index. The likelihood function appears the same as eqn (4). Again, the MLEs can be obtained by using an EM


algorithm, which requires first calculating the posterior distribution of X_j_ and then maximizing the expectation of the log likelihood. The EM equations are given as follows: and where


Β=(1/σ2_I_)[σ1_I_, σ2_I_, …, σ_mI_], is a 1 × _m_ vector for the simple regression coefficients of the traits on the index. Note that the multivariate EM equations are simple extensions of


the univariate EM by multiplying Β and ΒTΒ by the appropriate terms in eqns (7) and (8). Again, eqns (12) and (13) will reduce to the standard ones (Jiang & Zeng, 1995) under random


selection. STATISTICAL POWER UNDER SELECTIVE GENOTYPING It is difficult to evaluate the power of QTL mapping when a genome-wide chromosomal scanning is performed because the distribution of


the test statistic under either hypothesis (null or alternative) is unknown. The usual practice is to evaluate the power under the assumption that the position of the QTL is known so that


only point-wise test statistics are considered (Muranty, 1996). The distribution of a point-wise test statistic is usually known, at least asymptotically. Although the power calculated this


way cannot be applied to a whole genome-wide analysis, it may be used to compare relative efficiencies of different methods. It is certainly appropriate to use this power to evaluate mapping


procedures under the candidate gene approach. Theoretical work has been conducted for systems with two contrasting genotypes in the segregating population, e.g. backcrosses or half-sibs


(Darvasi & Soller, 1992). In this study, we evaluate the statistical power of QTL detection for a single-trait model in systems with three possible genotypes, e.g. F2 families, under the


assumption that the trait is controlled by a single QTL whose genotype is observable. Throughout the discussion, we will emphasize the difference in power between QTL detection with and


without selective genotyping. Power calculation without selective genotyping has been extensively investigated by researchers (e.g. Soller & Brody, 1976 and Muranty, 1996). Denote the


general linear model in matrix notation by: The null hypothesis is H0 : _a_=_d_=0, which is expressed in matrix notation by H0 : KB=0, where: The generalized likelihood ratio test statistic


(Graybill, 1976) for testing this hypothesis is: where _q_=2 is the rank of K, _p_=3 is the number of parameters in the full model, σ ^ Ω 1 2 and σ ^ Ω 0 2 are the residual variances


estimated from the full model and the reduced model (KB=0), respectively. Graybill (1976) showed that λ follows a noncentral _F_ distribution denoted by _F_(λ : _q_, _n_ − _p_, δ), where δ


is the noncentrality parameter given by: Muranty (1996) called λ the _F_-test statistic because of the nature of _F_ distribution. In genetic studies, a different sample will involve a


different X because a completely different segregation process will occur for a different experiment. When the sample size is not too small, however, XTX will be fairly constant from sample


to sample. Therefore, we can substitute XTX by its expectation. Defining and we have Substituting XTX by E(XTX) and after some algebraic manipulation, we get: where σ2G=_a_2/2 + _d_2 is the


total genetic variance. The statistical power is then calculated as: where 1 − Ψ is the Type II error and _F_−1(1 − α : _q_, _n_ − _p_, 0) is the critical value for testing H0 at a Type I


error rate of α. Under selective genotyping, the exact form of the distribution of the likelihood-ratio test statistic is unknown. To derive the power under selective genotyping, we must


assume that the test statistic still follows a noncentral _F_ distribution but with a different noncentrality parameter. This approximation is valid when the selection intensity is weak or


the QTL in question has a small effect. In fact, Darvasi & Soller (1992) have already made this approximation when calculating the number of genotyped individuals required to achieve a


given power under an additive effect model in a backcross design. Selective genotyping will change the conditional distribution of _y__j_ given its genotype and the frequencies of the three


genotypes in the mapping population. These changes will eventually modify _a_, _d_, σ2 and E(XTX), leading to an increase in the noncentrality parameter and thus an increase in the power.


Let us denote the phenotypic value in the selected population by _y_*. Using the theory of truncated selection (Cohen, 1991), we found that the conditional expectation and variance of _y_*


given genotype _Q__k__Q__l_ are: and respectively, where τ1|U_kl_=(_t_1 − U_kl_B)/σ for _k_ ≤ _l_=1,2. Let us now define the probability that an individual with genotype _Q__k__Q__l_ is


selected for genotyping by _q__kl_=1 + Φ(τ1|U_kl_) − Φ(τ2|U_kl_). According to Bayes’ theorem, the frequency of genotype _Q__k__Q__l_ in the selected population can be defined by _p__kl_=[(1


+ |_k_ − _l_|)_q__kl_]/(_q_11 + 2_q_12 + _q_22) for _k_ ≤ _l_=1, 2. The modified additive and dominance effects after selective genotyping become: and respectively. The altered residual


variance takes the weighted average of the within-genotype residual variances, i.e. The modified E(XTX) resulting from selective genotyping is: where and Denote B*=[μ* _a_* _d_*]T as the


vector of parameters after selection, then the noncentrality parameter under selective genotyping is Subsequently, the statistical power under selective genotyping is calculated using eqn


(18) but with the noncentrality parameter replaced by δ*. ILLUSTRATION In this section we demonstrate the application of the method using simulated data and show the general behaviour of the


method that one expects to observe in QTL mapping experiments. SINGLE-TRAIT QTL MAPPING In the first simulation study, we assumed that a single QTL is located at position 25 cM of a 100-cM


chromosome segment covered by 11 evenly spaced codominant markers. The size of the QTL (measured by the percentage of phenotypic variance explained by the QTL) is 0.05. The actual genetic


effects that generate such a QTL are _a_=0.229 and _d_=0.162. In an F2 population, these genetic effects will make up a genetic variance of σ2G=_a_2/2 + _d_2=0.0525. The residual variance


was set at σ2=1.0, leading to _h_2=0.0525/(0.0525 + 1.0)=0.05. The number of individuals genotyped was fixed at 100. We then varied the total number of individuals measured for the phenotype


to control different levels of selection pressure. We set up four levels of proportion genotyped: 100%, 50%, 10% and 5%. The total numbers of phenotypically measured individuals


corresponding to the four proportions were 100, 200, 1000 and 2000, respectively. Under selective genotyping, three methods of QTL mapping were compared: (i) full data analysis (FULL) where


all phenotypic values, including ungenotyped individuals, were included in the data analysis with the marker genotypes of ungenotyped individuals treated as missing values; (ii) biased


analysis (BIAS) where only phenotypic values of genotyped individuals were included in the analysis with the likelihood function constructed as if there were no selection; and (iii) the true


method of selective genotyping (SELECT) proposed in this study where only genotyped individuals were included and the likelihood was constructed with correction for the bias. The QTL


location was estimated as the mean chromosomal position that shows the highest value of the test statistic. Each simulation was repeated 100 times. The mean and standard deviations of the


estimates are given in Table 1. Under random selection with the low variance explained by the QTL and the small sample size (_n_=100), estimation of the QTL position is not only severely


biased towards the centre of the chromosome but is also subject to a large estimation error. Estimates of the QTL effects and the residual variance are quite close to the expected values,


although with relatively large errors. With selective genotyping (SELECT), although the same numbers of individuals are included in the analysis, the bias in QTL position estimate has been


progressively corrected as the selection intensity increases; for instance, when the proportion selected is 5%, the estimation is almost unbiased with the estimation error reduced to


one-third of what is observed under random selection. Compared with the FULL method, the SELECT method has a slightly increased estimation error in the QTL position estimate. This indicates


that inclusion of the large number of ungenotyped individuals does provide a little information about linkage for a reason to be explained later. The BIAS method, using the same amount of


phenotypic information as the SELECT method, has almost identical estimation error of the QTL position as the SELECT method. Both the FULL and the SELECT methods provide estimates of the QTL


effects close to the expectations with similar estimation errors. The BIAS method, however, gives severely biased estimates of the QTL effects, because of the use of an incorrect likelihood


function. The residual variance is estimated very closely to the expectation by both the FULL and SELECT methods. However, estimate of the FULL method has decreased the already small


estimation error. This explains the slightly decreased estimation error of the QTL position by the FULL method. The BIAS method, again, gives a very biased estimate of the residual variance.


Finally, selective genotyping has increased the score of the test statistic up to threefold (see the last column of Table 2). MULTIPLE TRAIT QTL MAPPING In the second simulation study, we


investigated QTL mapping for two correlated traits under selective genotyping. The marker map remains the same as previously described. The first trait is controlled by one QTL at the same


location (25 cM) with the same effect as described in the previous experiment, i.e. _a_1=0.229 and _d_1= 0.162. The second trait is controlled by a QTL located at 85 cM with identical


effects, i.e. _a_2=0.229 and _d_2=0.162. The residual variances were set at σ21=σ22=1 and the residual covariance set at σ12=0.5. The selection criterion was _I__j_=_c_1_y__j_1 + _c_2_y__j_2


where _c_1=1 and _c_2=0, i.e. only the first trait was selected. A total of 2500 individuals were measured for phenotype, but 250 (10%) were selectively genotyped. The simulation was


replicated 50 times. Figure 1 gives the average likelihood ratio test statistic profiles under selective genotyping (10%). The solid (_T_1), dotted (_T_2) and dashed (_T_12) lines represent


the likelihood-ratio test statistic profiles for the first trait, the second trait and both traits (joint test), respectively. Note that the likelihood-ratio test statistic profiles


(functions of the _F_-test statistics) are used here. They are defined as: and where _σ^_2Ω is the estimated residual variance under model Ω which defines the linear model by the set of


parameters included in the model: We used _T_ instead of λ to depict the test statistic profiles because _T_ approximates a χ2_q_ distribution and thus bears the additive property, i.e.


_T_12=_T_1 + _T_2. Although the two traits have an identical genetic variance, the first trait has a substantially higher test statistic profile than the second one because the first trait


is directly selected for genotyping. As a comparison, we repeated the simulation under random selection, i.e. we generated 250 individuals and genotyped all of them (100%) for mapping. The


corresponding test statistic profiles are given in Fig. 1(b). Compared with random selection (Fig. 1b), the increase in the test statistic profile for the first trait (Fig. 1a) is obvious. A


slight increase in the test statistic profile for the second trait is also observed because of its correlation to the first trait. POWER UNDER SELECTIVE GENOTYPING As reported in this


section, we first calculated the predicted powers under various proportions of genotyped individuals using the theoretical formula given in eqn (18). We then conducted simulation experiments


to verify our theoretical prediction. The effects of the QTL were again set at _a_=0.229 and _d_=0.162. The number of individuals genotyped remained at _n_=100. We varied the total number


of individuals measured (_N_) to control the proportions selected for genotyping (see column 2 of Table 3). Because the population mean was set at μ = 0, the truncation points are


symmetrical, and thus _t_2=−_t_1, where the values of _t_1 were found by trial and error so that the theoretical proportions selected equal the predetermined proportions (see column 3 of


Table 3). The values of the noncentrality parameter are listed in column 4 of Table 3. The critical value for testing the hypothesis at a Type I error rate of α=0.05 is _F_−1(0.95 : 2, 97,


0)=3.09, which was used to calculate the theoretical powers (listed in column 5 of Table 3). We then simulated 1000 samples under each level of proportion and conducted QTL analysis for each


sampled data set. The empirical power under each setting was calculated as the proportion of the samples that have the _F-_like test statistic greater than 3.09. These empirical powers,


given in the last column of the table, are fairly close to the corresponding theoretical predictions. DISCUSSION When the phenotypic values of ungenotyped individuals are included in the


data analysis, standard methods with proper handling of missing markers are used (Lander & Botstein, 1989; Muranty & Goffinet, 1997a,b; Henshall & Goddard, 1999; Johnson et al.,


1999). A problem occurs if the number of ungenotyped individuals is large because of the increased computational burden; for example, if 10% of the test population is genotyped, to genotype


250 individuals, one needs to measure an additional 2250 individuals for their phenotypes. The total sample size will be 2500. Because the 2250 ungenotyped individuals contribute very little


to linkage analysis but serve as bias correctors, their phenotypic values do not have to be included in the analysis. These individuals, however, do contribute to the estimation of the


residual variance. The estimate of the residual variance usually has very small estimation error. When the number of individuals genotyped is small, however, the residual variance estimate


from only the genotyped individuals may not be sufficiently accurate. In this case, it is important to include the ungenotyped individuals. The methods described above (e.g. Muranty &


Goffinet, 1997a,bJohnson et al., 1999) are not the only ways to include the ungenotyped individuals. An alternative way is to ignore completely the genetic effects for the ungenotyped


individuals, partition the residual variance of an ungenotyped individual into a genetic and a pure environmental component, and use a mixed-model approach. This can be accomplished via the


following maximum likelihood analysis. Define the model for an ungenotyped individual by _y__j_=μ + _r__j_ for _j_=_n_ + 1, …, _N_, where _r__j_ is the residual effect with an N(0, σ2G + σ2)


distribution and σ2G=_a_2/2 + _d_2, as defined previously. The probability density of _y__j_ for the ungenotyped individual will be: The likelihood function including all individuals will


be: Note that ungenotyped individuals do not contribute to the estimation of B except μ, but they are used to estimate σ2G + σ2. The MLE may be directly searched or obtained via an EM


algorithm. In either way, the speed of convergence may be faster than the methods that treat X_j_ as missing values because there is no need to update _p_(X_j_) for an ungenotyped


individual. Further investigation is required to explore the properties of this alternative approach. It is not hard to imagine that ungenotyped individuals may not have a full measurement


of phenotypic values. This may occur, for example, in QTL mapping for the trait of flowering time. An investigator may choose to visit the field for the first few days when the population of


plants begins to flower and the last few days when the population approaches the end of the flowering season. In this case, plants that flower in the middle of the season may not have a


record of phenotype. Another example comes from multiple trait analysis in forest trees. One may decide to select early growth rate for QTL mapping, but later the investigator may want to


map QTLs for later growth rate as well. If selection is on the early trait, because of limited space, the investigator may not keep the culled individuals in the field. Then the final


population would be a selected population with regard to the later growth rate. The maximum likelihood analysis proposed in this study is the proper tool for handling such centrally


truncated data. It is undesirable to use only one tail of the trait distribution to carry out QTL mapping because the total variance of the trait is artificially deflated. However, if the


data happen to be single-tail truncated for some technical reasons, the proposed method can readily be applied for correcting the bias. A typical example of single-tail truncation can be


seen in artificial selection of plant and animal breeding. Another example may come from longitudinal data analysis where the phenotypic value of an individual depends on it longevity. Only


surviving individuals have a complete measurement of phenotype, whereas individuals not surviving only have partial information; for example, the yearly egg production of a chicken strongly


depends on the viability of the chicken. If a chicken dies in the middle of the year, we do not know the phenotypic record of her yearly production, but we do know that her yearly egg


production is greater than the current production in her record at the time when she dies. An unbiased analysis must be performed by taking into account these partial records. For


multiple-trait analysis, selective genotyping has been a problem because if all traits are deemed to be important to the researcher, which traits should be selected? The selection index


approach of Muranty & Goffinet (1997b) is a compromise between the traits. Because the selection criterion now becomes a single ‘trait’, it is easy to apply in practice. Lin &


Ritland (1997) suggested that an individual should be genotyped if at least one of _m_ traits exhibits the extreme value. Under this selection regime, different individuals seem to have


different criteria of selection; for example, if individual _j_ is selected because its _k_th phenotypic value is first observed as being extreme, then the criterion for _j_ is (_y__jk_ ≤


_t_1_k_) ⋃ (_y__jk_ ≥ _t_2_k_). On the other hand, if individual _i_ is selected because its _l_th phenotypic value is first observed as being extreme, then the criterion for _i_ is (_y__il_


≤ _t_1_l_) ⋃ (_y__il_ ≥ _t_2_l_). In both the index selection and the method of Lin & Ritland (1997), the selection criterion of each individual is a single trait (one-dimensional


selection), and thus the proposed method will apply. Another selection regime may be the so-called independent culling level selection where an individual will not be genotyped if any one of


the _m_ phenotypic values fails to reach the extreme. This selection regime is perhaps more rigorous than the previous two methods, but it is hard to programme because it is a multiple


dimensional selection (requiring multiple integration). Further study may be necessary to compare different selection regimes. Nonetheless, when phenotypic values of ungenotyped individuals


are included, methods of selection will be irrelevant to the statistical issue. Once selection is carried out on the phenotypic value of one trait (primary trait), QTL mapping for a highly


correlated trait (secondary trait) will also benefit. However, if the two traits are not correlated, the effective sample size in terms of the secondary trait will be comparable to a


_random_ sample of _n_, where _n_ is the number of genotyped individuals. Therefore, one should be cautious about the power of QTL mapping for traits less correlated to a highly selected


primary trait. An advantage of the logistic regression of Henshall & Goddard (1999) is that selection does not bias estimates of QTL effects, irrespective of whether phenotypic values of


ungenotyped individuals are included in the data analysis. This is because the roles of marker genotypes and the phenotypes in the likelihood function have been altered, just like the


discordant sib-pair mapping of Risch & Zhang (1995). Further investigation on the logistic regression, however, shows that selective genotyping can alter the estimation of the QTL


effect. The equivalence between logistic regression and the maximum likelihood holds only approximately when the effect of a QTL is small. This can be shown by looking at the posterior


probability of a QTL genotype given the phenotypic value of individual _j_: where _p_(X_j_) is the prior probability of the QTL genotype, independent of marker information, and


_g_(_y__j_|X_j_)=[_f_(_y__j_|X_j_)]/{Φ(τ1|X_j_)+[1 − Φ(τ2|X_j_)]}. However, the logistic regression model uses r ( x j ) = [ p ( x j ) f ( y j | x j ) ] / [ Σ x j ∈ S p ( x j ) f ( y j | x j


) ] , i.e. the term Φ(τ1|X_j_)+[1 − Φ(τ2|X_j_)] in the denominator of _g_(_y__j_|X_j_) has vanished. The exact maximum likelihood function should be built using _p_*(X_j_) instead of


_r_(X_j_). However, using _r_(X_j_) may still be justifiable because: (i) when the size of the QTL is small, Φ(τ1|X_j_) + [1 − Φ(τ2|X_j_)] can be considered as a constant across different


genotypes so that the corresponding terms in the numerator and denominator cancel each other out, leading to _p_*(X_j_) ≈ _r_(X_j_); and (ii) _r_(X_j_) is much easier to handle than


_p_*(X_j_) in the maximum likelihood analysis. In addition to the approximate nature of the logistic regression, there are two unsolved problems: (a) modification is required to map a QTL in


an F2 population; and (b) an exact interval mapping has not been available. An approximate interval mapping was accomplished via interpolation (Henshall & Goddard, 1999). Solving for


the first problem requires a multicategorical response model, e.g. models for nominal or ordinal responses (Fahrmeir & Tutz, 1994). The second problem involves missing QTL genotypes and


may be solved using the EM or MCMC algorithm of C. Vogl & S. Xu (unpublished results) for mapping viability loci. Quantitative trait loci mapping is usually performed after a marker map


is fully developed. If the trait under selection has a strong genetic component, selective genotyping can also cause distortion of the inferred marker map from the true one. The distortion


is reflected by the change in both the marker order and the distances between markers. Because the severity of marker map distortion depends on the sizes and locations of QTLs, marker


mapping and QTL mapping should be carried out concurrently under selective genotyping. One can apply the general idea of EM to concurrent mapping. To do this, one first maps QTLs under the


assumption that the marker map is known without error, and then corrects the marker map by taking into account the distortion caused by selective genotyping under the assumption that the


sizes and locations of QTLs are known. This completes one cycle of iteration, and the iteration should continue until a criterion of convergence is reached. The problem can be very


complicated, especially when the marker order is allowed to change. Many theoretical and practical problems may exist in concurrent mapping, and further investigation is deemed necessary.


Maximization of the likelihood function is not an easy task. Special algorithms and computer programs are required. We developed an EM algorithm that appears to be a simple modification of


the existing EM algorithm for standard interval mapping. As a result, it can be readily incorporated into a standard QTL mapping software, e.g. MAPMAKER (Lander & Botstein, 1989). One


caveat about the EM algorithm is that when the selection intensity is too high, the EM algorithm may take a very large number of iterations to converge and sometime may not converge at all.


This is not a problem of the EM algorithm itself, rather, it is caused by numerical overflow when the bias adjustment of the residual variance is conducted. Recall that we added an


additional term in eqn (8), σ2[(τ1|X_j_)φ(τ1|X_j_) − (τ2|X_j_)φ(τ2|X_j_)]/[1 + Φ(τ1|X_j_) − Φ(τ2|X_j_)], to the standard EM estimation of the residual variance. When the selection intensity


is high, this term can be numerically unstable. Proper handling of this numerical overflow is required which is, unfortunately, beyond our technical ability. We found that when the


proportion selected is 40% or more, the problem rarely happens. In our simulation studies, when numerical overflow occurred, the EM algorithm was replaced by the simplex algorithm (Nelder


& Mead, 1965) for direct search of MLEs. The simplex method is usually slower than the EM algorithm, but it can handle highly selected data. Another caveat is the sensitivity of the


proposed method to departure from normality. Because the likelihood function involves Φ(τ), in addition to φ(τ), we anticipate that the method is more sensitive to deviation from normality


than the methods using also the ungenotyped individuals. In conclusion, we developed an exact maximum likelihood approach to map QTLs under selective genotyping using phenotypic values of


genotyped individuals only. Compared with the full data analysis (using all phenotypic values), the proposed method performs well: the average test statistic is slightly lower; estimates of


QTL parameters are almost identical; and the estimate of residual variance is subject to a relatively large error. The slightly lower test statistic value may be caused by the relatively


large increase in the estimation error of the residual variance. A general recommendation is that whenever full data analysis is possible, the full maximum likelihood analysis should be


performed. If it is impossible or difficult to analyse the full data, e.g. the sample size is too large, the phenotypic values of ungenotyped individuals are missing or composite interval


mapping is to be performed, then the proposed method should be applied with the understanding that there is little to lose. REFERENCES * Cohen, A. C. (1991). _Truncated and Censored Samples


— Theory and Applications_. Marcel Dekker, New York. Book  Google Scholar  * Darvasi, A. and Soller, M. (1992). Selective genotyping for determination of linkage between a marker locus and a


quantitative trait locus. _Theor Appl Genet_, 85: 353–359. Article  CAS  PubMed  Google Scholar  * Fahrmeir, L. and Tutz, G. (1994). _Multivariate Statistical Modelling Based on Generalized


Linear Models_. Springer-Verlag, New York, NY. Book  Google Scholar  * Graybill, F. A. (1976). _Theory and Application of the Linear Model_. Duxbury Press, North Scituate, MA. Google


Scholar  * Groover, A., Devey, M., Fiddler, T., Lee, J., Megraw, R., Mitchell-Olds, T. _et al_ (1994). Identification of quantitative trait loci influencing wood specific gravity in an


outbred pedigree of loblolly pine. _Genetics_ 138: 1293–1300. CAS  PubMed  PubMed Central  Google Scholar  * Henshall, J. M. and Goddard, M. E. (1999). Multiple-trait mapping of quantitative


trait loci after selective genotyping using logistic regression. _Genetics_ 151: 885–894. CAS  PubMed  PubMed Central  Google Scholar  * Jansen, R. C. and Stam, P. (1994). High resolution


of quantitative traits into multiple loci via interval mapping. _Genetics_ 136: 1447–1455. CAS  PubMed  PubMed Central  Google Scholar  * Jiang, C. and Zeng, Z. B. (1995). Multiple trait


analysis of genetic mapping for quantitative trait loci. _Genetics_ 140: 1111–1127. CAS  PubMed  PubMed Central  Google Scholar  * Johnson, D. L., van Jansen, R. C. and Arendonk, A. M.


(1999). Mapping quantitative trait loci in a selectively genotyped outbred population using a mixture model approach. _Genet Res_, 73: 75–83. Article  Google Scholar  * Lander, E. S. and


Botstein, D. (1989). Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. _Genetics_ 121: 185–199. CAS  PubMed  PubMed Central  Google Scholar  * Lin, J. Z. and


Ritland, K. (1997). Quantitative trait loci differentiating the outbreeding _Mimulus guttatus_ from the inbreeding _M. platycalyx_. _Genetics_ 146: 1115–1121. CAS  PubMed  PubMed Central 


Google Scholar  * Muranty, H. (1996). Power of tests for quantitative trait loci detection using full-sib families in different schemes. _Heredity_ 76: 156–165. Article  Google Scholar  *


Muranty, H. and Goffinet, B. (1997a). Selective genotyping for location and estimation of the effect of a quantitative trait locus. _Biometrics_ 53: 629–643. Article  Google Scholar  *


Muranty, H. and Goffinet, B. (1997b). Multitrait and multipopulation QTL search using selective genotyping. _Genet Res_, 70: 259–265. Article  Google Scholar  * Nelder, J. A. and Mead, R.


(1965). A simplex method for function minimization. _Comput J_, 7: 308–313. Article  Google Scholar  * Risch, N. and Zhang, H. (1995). Extreme discordant sib pairs for mapping quantitative


trait loci in humans. _Science_ 268: 1584–1589. Article  CAS  PubMed  Google Scholar  * Soller, M. and Brody, T. (1976). On the power of experimental designs for the detection of linkage


between marker loci and quantitative trait loci in crosses between inbred lines. _Theor Appl Genet_, 47: 35–39. Article  CAS  PubMed  Google Scholar  * Zeng, Z. B. (1994). Precision mapping


of quantitative trait loci. _Genetics_ 136: 1457–1468. CAS  PubMed  PubMed Central  Google Scholar  Download references ACKNOWLEDGEMENTS Many thanks are due to Drs W. M. Muir and O.


Savolainen for their encouragement and support on the project. We thank Drs D. Gessler, J. Lin, R. Whitkus and M. Sillanpaa for their helpful comments on an earlier version of the


manuscript. We would also like to thank two anonymous reviewers for their criticisms and comments which have greatly improved the presentation of the manuscript. This research was supported


by the National Institutes of Health Grant GM55321 and the USDA National Research Initiative Competitive Grants Program 97–35205–5075 to S.X. AUTHOR INFORMATION Author notes * Claus Vogl


Present address: Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, U.S.A. AUTHORS AND AFFILIATIONS * Department of Botany and Plant Sciences, University


of California, Riverside, 92521, CA, USA Shizhong Xu * Department of Biology, University of Oulu, PO Box 3000, Oulu, FIN-90401, Finland Claus Vogl Authors * Shizhong Xu View author


publications You can also search for this author inPubMed Google Scholar * Claus Vogl View author publications You can also search for this author inPubMed Google Scholar CORRESPONDING


AUTHOR Correspondence to Shizhong Xu. APPENDIX DERIVATION OF THE EM ALGORITHM FOR SINGLE-TRAIT ANALYSIS APPENDIX DERIVATION OF THE EM ALGORITHM FOR SINGLE-TRAIT ANALYSIS First, let us define


the log likelihood function by: The MLEs of B and σ2 are obtained by solving ∂_l_/∂B=0 and ∂_l_/∂σ2=0 simultaneously. Derivation of ∂l/∂B where In the above equation, the truncation points


have been standardized, i.e. τ1=(_t_1 − X_j_B)/σ and τ2=(_t_2 − X_j_B)/σ, where _t_1 and _t_2 are the truncation points in the original scale. Note that: Hence, Define


_p_*(X_j_)=[_p_(X_j_)_g_(_y__j_|X_j_)]/[ ∑ x j ε_S__ p_(X_j_)_g_(_y__j_|X_j_)] as the posterior distribution of X_j_. We now have: Solving for ∂_l_/∂B=0, we have: Derivation of ∂l/∂σ2 where


Note that Therefore, Finally, we have Solving for ∂_l_/∂σ2=0, we obtain RIGHTS AND PERMISSIONS Reprints and permissions ABOUT THIS ARTICLE CITE THIS ARTICLE Xu, S., Vogl, C. Maximum


likelihood analysis of quantitative trait loci under selective genotyping. _Heredity_ 84, 525–537 (2000). https://doi.org/10.1046/j.1365-2540.2000.00653.x Download citation * Received: 22


June 1999 * Accepted: 13 October 1999 * Published: 01 May 2000 * Issue Date: 01 May 2000 * DOI: https://doi.org/10.1046/j.1365-2540.2000.00653.x SHARE THIS ARTICLE Anyone you share the


following link with will be able to read this content: Get shareable link Sorry, a shareable link is not currently available for this article. Copy to clipboard Provided by the Springer


Nature SharedIt content-sharing initiative KEYWORDS * EM algorithm * QTL mapping * simplex algorithm * truncated selection