Maximum likelihood analysis of quantitative trait loci under selective genotyping

Nature

Select a language for the TTS:
UK English Female
UK English Male
US English Female
US English Male
Australian Female
Australian Male
Language selected: (auto detect) - EN

Play all audios:

ABSTRACT Selective genotyping is a cost-saving strategy in mapping quantitative trait loci (QTLs). When the proportion of individuals selected for genotyping is low, the majority of the

individuals are not genotyped, but their phenotypic values, if available, are still included in the data analysis to correct the bias in parameter estimation. These ungenotyped individuals

do not contribute much information about linkage analysis and their inclusion can substantially increase the computational burden. For multiple trait analysis, ungenotyped individuals may

not have a full array of phenotypic measurements. In this case, unbiased estimation of QTL effects using current methods seems to be impossible. In this study, we develop a maximum

likelihood method of QTL mapping under selective genotyping using only the phenotypic values of genotyped individuals. Compared with the full data analysis (using all phenotypic values), the

proposed method performs well. We derive an expectation–maximization (EM) algorithm that appears to be a simple modification of the existing EM algorithm for standard interval mapping. The

new method can be readily incorporated into a standard QTL mapping software, e.g. MAPMAKER. A general recommendation is that whenever full data analysis is possible, the full maximum

likelihood analysis should be performed. If it is impossible to analyse the full data, e.g. sample sizes are too large, phenotypic values of ungenotyped individuals are missing or composite

interval mapping is to be performed, the proposed method can be applied. SIMILAR CONTENT BEING VIEWED BY OTHERS POPULATION SIZE IN QTL DETECTION USING QUANTILE REGRESSION IN GENOME-WIDE

ASSOCIATION STUDIES Article Open access 13 June 2023 LEVERAGING INFORMATION BETWEEN MULTIPLE POPULATION GROUPS AND TRAITS IMPROVES FINE-MAPPING RESOLUTION Article Open access 10 November

2023 THE FLASHFM APPROACH FOR FINE-MAPPING MULTIPLE QUANTITATIVE TRAITS Article Open access 22 October 2021 INTRODUCTION Statistical analysis of quantitative trait loci requires both the

phenotypic data and marker genotypes of individuals sampled from a reference population. It is generally believed that a large sample size is required to map QTLs with small effects.

However, obtaining a large sample can be very costly or even impossible. Usually, the cost of genotyping is higher than that of the phenotypic measurement. Lander & Botstein (1989)

showed that one can selectively genotype individuals from the extremes of the phenotypic distribution, yet receive almost identical power as when the whole sample is genotyped. If the cost

of the phenotypic measurement is low, selective genotyping can significantly reduce the cost. This selective genotyping technique has been widely utilized in QTL mapping experiments (e.g.

Groover et al., 1994). Under selective genotyping, phenotypic values of ungenotyped individuals still have to be included in the analysis, with their marker genotypes treated as missing

values, otherwise estimates of QTL effects will be biased (Lander & Botstein, 1989). A full likelihood function is given by Muranty & Goffinet (1997a). Exact maximum likelihood

estimates (MLEs) can be achieved via an iterative approach. However, Muranty & Goffinet (1997a) derive approximate MLEs under the assumption that QTL effects are small relative to the

residual standard deviation. Recently, Johnson et al. (1999) proposed an expectation–maximization (EM) algorithm implemented via Monte Carlo sampling for handling missing marker genotypes.

With phenotypic values of ungenotyped individuals excluded from the data analysis, Darvasi & Soller (1992) investigated an analysis of variance (ANOVA) approach to estimate QTL effects

with a bias correction. Without providing a detailed implementation, they also proposed a maximum likelihood approach for such a truncated data analysis. Muranty & Goffinet (1997b)

extended their selective genotyping to multiple trait QTL mapping, showing that selection on one trait can increase the power of QTL detection for a correlated trait. They also proposed a

selection index method for multiple trait selective genotyping. Instead of selecting the two tails of a single trait, they first established a selection index combining phenotypic values of

all traits, and then selected the two extremes on the scale of the index. Again, phenotypic values of ungenotyped individuals must be included in the data analysis to remove the bias in the

estimated QTL effect. In reality, different traits may be expressed in different stages. If selection is performed on an earlier displayed trait, individuals that fail to reach the criterion

of selection in this stage may be removed, and thus do not have the opportunity to express their phenotype for a later trait. In this case, unbiased estimates of QTL effects for the later

trait seem to be impossible based on the method of Muranty & Goffinet (1997a,b). Therefore, a new method is needed to handle missing values for both the genotypes and the phenotypes.

Such a method is now available as a result of work by Henshall & Goddard (1999). They adopted an entirely different approach by altering the roles of genotypes and phenotypes in the

likelihood function. They treated phenotypes as independent variables and genotypes as dependent variables. Because genotypes are binary in a population with only two genotypes, they

utilized a standard logistic regression approach. The advantages of this method are: a standard statistical package, such as SAS, is readily applied and estimates are not affected by

selection of the phenotype. The second advantage is important in handling the problem of missing phenotypes. As recognized by the authors, the logistic regression method, however, has not

been sufficiently generalized to handle populations with more than two segregating genotypes, e.g. the F2 family. Furthermore, it is not clear how to implement the composite interval mapping

(Jansen & Stam, 1994; Zeng, 1994) in the logistic regression framework. The objectives of this study are to develop a maximum likelihood approach to QTL mapping using samples containing

only the genotyped individuals and to compare its efficiency relative to that using samples of all individuals. The maximum likelihood solution will be achieved via an EM algorithm that is

simple enough to be incorporated into any standard interval mapping software. THEORY AND METHODS SINGLE-TRAIT ANALYSIS Consider an F2 population of _N_ individuals with phenotypic values

measured for trait _y_. Among the _N_ individuals, only _n_ (_n_ ≤ _N_) of them are selectively genotyped, with _n_/2 being selected from the upper extreme and _n_/2 being selected from the

lower extreme in the scale of _y_. This selection regime can be viewed as ‘disruptive selection’ with two known artificial truncation points. Individuals are genotyped only if _y__j_ ≤ _t_1

or _y__j_ ≥ _t_2, where _t_1 and _t_2 are, respectively, the (_n_/2)th and (_N_ − _n_/2)th ascendingly ordered phenotypic values of _y_ among the _N_ individuals. In real data analyses, the

two tails selected for genotyping may not be symmetrical. The two truncation points, _t_1 and _t_2, are not calculated from the distribution; rather, they take the largest phenotypic value

in the lower tail and the smallest phenotypic value in the upper tail. The phenotypic value of the _j_th individual is described by the following linear model: where _a_ and _d_ are the

additive and dominance effects of a QTL, respectively, _z__j_ and _w__j_ are indicator variables for the genotype of the QTL, which are defined as: where _Q__k__Q__l_ for _k_ ≤ _l_=1,2

denotes the QTL genotype, and _e__j_ is the residual effect distributed as N(0, σ2). For notational simplicity, let us define B=[μ, _a_, _d_]T and X_j_=[1, _z__j_, _w__j_], and rewrite model

(1) by: where X_j_=U_kl_ for genotype _Q__k_Q_l_ and This is a typical regression model. Because the QTL genotype is not observable but inferred from marker information, only the

conditional distribution of X_j_ given the marker genotype is available. Define the conditional probabilities of the QTL genotype, and thus X_j_, by _p_(X_j_) and the probability density of

_y__j_ given X_j_ by: The likelihood function is rewritten as: where _S_={U11, U12, U22}. Note that the first _n_ individuals are genotyped and the last _N_ − _n_ individuals are

ungenotyped, and _p_(X, _j_) takes its prior value for _j_=_n_ + 1, …, _N_. The phenotypic values of ungenotyped individuals contribute very little information to linkage analysis. Their

inclusion in the likelihood serves solely as a way to correct the bias in estimation of the QTL effects caused by selective genotyping. When the number of ungenotyed individuals becomes very

large, the maximum likelihood method implemented this way is computationally inefficient. An alternative way to construct the likelihood function is to include only the phenotypic values of

the genotyped individuals while still taking into account the selection process. The likelihood function then becomes: where: The denominator of _g_(_y__j_|X_j_) is Pr[(_y__j_ ≤ _t_1) ⋃

(_y__j_ ≥ _t_2)| X_j_], the conditional probability that the _j_th individual is selected for genotyping given X_j_, where Φ(τ1|X_j_)=Pr(_y__j_ ≤ _t_1|X_j_) and Φ(τ2|X_j_)=Pr(_y__j_ ≤

_t_2|X_j_) are standardized normal functions, and τ1|X_j_=(_t_1 − X_j_B)/σ and τ2|X_j_= (_t_2 − X_j_B)/σ are the standardized truncation points. The likelihood function can be searched via

an EM algorithm that is described below. In the E step, the conditional posterior distribution of X_j_ is obtained using initial values of B and σ2, The posterior distribution is then used

to calculate the expectations of various quantities that involve X_j_. In the M step, we estimate the parameters based on the following equations: and where φ(τ1|X_j_) and φ(τ2|X_j_) denote

the standardized normal densities, different from Φ(τ1|X_j_) and Φ(τ2|X_j_). The notation E_x_ stands for expectation with respect to X_j_, the missing genotype. The initial values of

parameters are then replaced by B^ and _σ^_2, forming a new cycle of iteration. After convergence, B^ and _σ^_2 will be the MLEs of B and σ2. Note that the terms involving σ and σ2 in the

right hand sides of eqns (7) and (8) are because of selective genotyping. Without selection, these terms will vanish and the EM equations will reduce to the standard ones (Zeng, 1994). In

the simple model described in this study, only one non-QTL effect, μ, is included in the model. If the model includes many covariates, as seen in composite interval mapping, the ECM approach

should be adopted (Jiang & Zeng, 1995). The derivations of B^ and _σ^_2 are given in the Appendix. MULTIPLE-TRAIT ANALYSIS Let us define a 1 × _m_ vector for _m_ traits measured in the

_j_th individual by Y_j_=[_y__j_1, _y__j_2, …, _y__jm_]. The multivariate linear model is expressed by: where X_j_=[1, _z__j_, _w__j_] remains the same as in the single-trait model, and is a

1 × _m_ vector for the residuals with a multivariate normal distribution, i.e. E_j_ ≈ N_m_(0, V), where: Assume that the criterion of selection is a linear combination of all traits, called

the selection index and denoted by _I__j_=∑_m__k_=1_c__k__y__kj_=Y_j_C. The selection index is a generalized criterion of selection. If _c_1=1 and _c__j_≠1=0, then the index becomes the

phenotypic value of the first trait. The score of the selection index can be similarly partitioned into a genetic and residual component: where B_I_=∑_m__k_=1_c__k_B_k_=BC and

_e__jI_=∑_m__k_=1_c__k__e__jk_=E_j_C. The expectation and variance of _I__j_ are E(_I__j_|X_j_)=X_j_BC and Var(_I__j_|X_j_)=σ2_I_=CTVar(E_j_)C=CTVC, respectively. The two truncation points

of selection in the scale of the index are defined, again, by _t_1 and _t_2, respectively. The probability density of Y_j_ without selection is: After truncation selection on index _I__j_,

the joint density becomes _g_(Y_j_|X_j_) = [_f_(Y_j_|X_j_)]/[1 + Φ(τ2|X_j_) − Φ(τ2|X_j_), where the denominator is the probability that the _j_th individual is selected for genotyping, i.e.

− X_j_BC)/σ_I_ are the standardized truncation points in the scale of the index. The likelihood function appears the same as eqn (4). Again, the MLEs can be obtained by using an EM

algorithm, which requires first calculating the posterior distribution of X_j_ and then maximizing the expectation of the log likelihood. The EM equations are given as follows: and where

Β=(1/σ2_I_)[σ1_I_, σ2_I_, …, σ_mI_], is a 1 × _m_ vector for the simple regression coefficients of the traits on the index. Note that the multivariate EM equations are simple extensions of

the univariate EM by multiplying Β and ΒTΒ by the appropriate terms in eqns (7) and (8). Again, eqns (12) and (13) will reduce to the standard ones (Jiang & Zeng, 1995) under random

selection. STATISTICAL POWER UNDER SELECTIVE GENOTYPING It is difficult to evaluate the power of QTL mapping when a genome-wide chromosomal scanning is performed because the distribution of

the test statistic under either hypothesis (null or alternative) is unknown. The usual practice is to evaluate the power under the assumption that the position of the QTL is known so that

only point-wise test statistics are considered (Muranty, 1996). The distribution of a point-wise test statistic is usually known, at least asymptotically. Although the power calculated this

way cannot be applied to a whole genome-wide analysis, it may be used to compare relative efficiencies of different methods. It is certainly appropriate to use this power to evaluate mapping

procedures under the candidate gene approach. Theoretical work has been conducted for systems with two contrasting genotypes in the segregating population, e.g. backcrosses or half-sibs

(Darvasi & Soller, 1992). In this study, we evaluate the statistical power of QTL detection for a single-trait model in systems with three possible genotypes, e.g. F2 families, under the

assumption that the trait is controlled by a single QTL whose genotype is observable. Throughout the discussion, we will emphasize the difference in power between QTL detection with and

without selective genotyping. Power calculation without selective genotyping has been extensively investigated by researchers (e.g. Soller & Brody, 1976 and Muranty, 1996). Denote the

general linear model in matrix notation by: The null hypothesis is H0 : _a_=_d_=0, which is expressed in matrix notation by H0 : KB=0, where: The generalized likelihood ratio test statistic

(Graybill, 1976) for testing this hypothesis is: where _q_=2 is the rank of K, _p_=3 is the number of parameters in the full model, σ ^ Ω 1 2 and σ ^ Ω 0 2 are the residual variances

estimated from the full model and the reduced model (KB=0), respectively. Graybill (1976) showed that λ follows a noncentral _F_ distribution denoted by _F_(λ : _q_, _n_ − _p_, δ), where δ

is the noncentrality parameter given by: Muranty (1996) called λ the _F_-test statistic because of the nature of _F_ distribution. In genetic studies, a different sample will involve a

different X because a completely different segregation process will occur for a different experiment. When the sample size is not too small, however, XTX will be fairly constant from sample

to sample. Therefore, we can substitute XTX by its expectation. Defining and we have Substituting XTX by E(XTX) and after some algebraic manipulation, we get: where σ2G=_a_2/2 + _d_2 is the

total genetic variance. The statistical power is then calculated as: where 1 − Ψ is the Type II error and _F_−1(1 − α : _q_, _n_ − _p_, 0) is the critical value for testing H0 at a Type I

error rate of α. Under selective genotyping, the exact form of the distribution of the likelihood-ratio test statistic is unknown. To derive the power under selective genotyping, we must

assume that the test statistic still follows a noncentral _F_ distribution but with a different noncentrality parameter. This approximation is valid when the selection intensity is weak or

the QTL in question has a small effect. In fact, Darvasi & Soller (1992) have already made this approximation when calculating the number of genotyped individuals required to achieve a

given power under an additive effect model in a backcross design. Selective genotyping will change the conditional distribution of _y__j_ given its genotype and the frequencies of the three

genotypes in the mapping population. These changes will eventually modify _a_, _d_, σ2 and E(XTX), leading to an increase in the noncentrality parameter and thus an increase in the power.

Let us denote the phenotypic value in the selected population by _y_*. Using the theory of truncated selection (Cohen, 1991), we found that the conditional expectation and variance of _y_*

given genotype _Q__k__Q__l_ are: and respectively, where τ1|U_kl_=(_t_1 − U_kl_B)/σ for _k_ ≤ _l_=1,2. Let us now define the probability that an individual with genotype _Q__k__Q__l_ is

selected for genotyping by _q__kl_=1 + Φ(τ1|U_kl_) − Φ(τ2|U_kl_). According to Bayes’ theorem, the frequency of genotype _Q__k__Q__l_ in the selected population can be defined by _p__kl_=[(1

+ |_k_ − _l_|)_q__kl_]/(_q_11 + 2_q_12 + _q_22) for _k_ ≤ _l_=1, 2. The modified additive and dominance effects after selective genotyping become: and respectively. The altered residual

variance takes the weighted average of the within-genotype residual variances, i.e. The modified E(XTX) resulting from selective genotyping is: where and Denote B*=[μ* _a_* _d_*]T as the

vector of parameters after selection, then the noncentrality parameter under selective genotyping is Subsequently, the statistical power under selective genotyping is calculated using eqn

(18) but with the noncentrality parameter replaced by δ*. ILLUSTRATION In this section we demonstrate the application of the method using simulated data and show the general behaviour of the

method that one expects to observe in QTL mapping experiments. SINGLE-TRAIT QTL MAPPING In the first simulation study, we assumed that a single QTL is located at position 25 cM of a 100-cM

chromosome segment covered by 11 evenly spaced codominant markers. The size of the QTL (measured by the percentage of phenotypic variance explained by the QTL) is 0.05. The actual genetic

effects that generate such a QTL are _a_=0.229 and _d_=0.162. In an F2 population, these genetic effects will make up a genetic variance of σ2G=_a_2/2 + _d_2=0.0525. The residual variance

was set at σ2=1.0, leading to _h_2=0.0525/(0.0525 + 1.0)=0.05. The number of individuals genotyped was fixed at 100. We then varied the total number of individuals measured for the phenotype

to control different levels of selection pressure. We set up four levels of proportion genotyped: 100%, 50%, 10% and 5%. The total numbers of phenotypically measured individuals

corresponding to the four proportions were 100, 200, 1000 and 2000, respectively. Under selective genotyping, three methods of QTL mapping were compared: (i) full data analysis (FULL) where

all phenotypic values, including ungenotyped individuals, were included in the data analysis with the marker genotypes of ungenotyped individuals treated as missing values; (ii) biased

analysis (BIAS) where only phenotypic values of genotyped individuals were included in the analysis with the likelihood function constructed as if there were no selection; and (iii) the true

method of selective genotyping (SELECT) proposed in this study where only genotyped individuals were included and the likelihood was constructed with correction for the bias. The QTL

location was estimated as the mean chromosomal position that shows the highest value of the test statistic. Each simulation was repeated 100 times. The mean and standard deviations of the

estimates are given in Table 1. Under random selection with the low variance explained by the QTL and the small sample size (_n_=100), estimation of the QTL position is not only severely

biased towards the centre of the chromosome but is also subject to a large estimation error. Estimates of the QTL effects and the residual variance are quite close to the expected values,

although with relatively large errors. With selective genotyping (SELECT), although the same numbers of individuals are included in the analysis, the bias in QTL position estimate has been

progressively corrected as the selection intensity increases; for instance, when the proportion selected is 5%, the estimation is almost unbiased with the estimation error reduced to

one-third of what is observed under random selection. Compared with the FULL method, the SELECT method has a slightly increased estimation error in the QTL position estimate. This indicates

that inclusion of the large number of ungenotyped individuals does provide a little information about linkage for a reason to be explained later. The BIAS method, using the same amount of

phenotypic information as the SELECT method, has almost identical estimation error of the QTL position as the SELECT method. Both the FULL and the SELECT methods provide estimates of the QTL

effects close to the expectations with similar estimation errors. The BIAS method, however, gives severely biased estimates of the QTL effects, because of the use of an incorrect likelihood

function. The residual variance is estimated very closely to the expectation by both the FULL and SELECT methods. However, estimate of the FULL method has decreased the already small

estimation error. This explains the slightly decreased estimation error of the QTL position by the FULL method. The BIAS method, again, gives a very biased estimate of the residual variance.

Finally, selective genotyping has increased the score of the test statistic up to threefold (see the last column of Table 2). MULTIPLE TRAIT QTL MAPPING In the second simulation study, we

investigated QTL mapping for two correlated traits under selective genotyping. The marker map remains the same as previously described. The first trait is controlled by one QTL at the same

location (25 cM) with the same effect as described in the previous experiment, i.e. _a_1=0.229 and _d_1= 0.162. The second trait is controlled by a QTL located at 85 cM with identical

effects, i.e. _a_2=0.229 and _d_2=0.162. The residual variances were set at σ21=σ22=1 and the residual covariance set at σ12=0.5. The selection criterion was _I__j_=_c_1_y__j_1 + _c_2_y__j_2

where _c_1=1 and _c_2=0, i.e. only the first trait was selected. A total of 2500 individuals were measured for phenotype, but 250 (10%) were selectively genotyped. The simulation was

replicated 50 times. Figure 1 gives the average likelihood ratio test statistic profiles under selective genotyping (10%). The solid (_T_1), dotted (_T_2) and dashed (_T_12) lines represent

the likelihood-ratio test statistic profiles for the first trait, the second trait and both traits (joint test), respectively. Note that the likelihood-ratio test statistic profiles

(functions of the _F_-test statistics) are used here. They are defined as: and where _σ^_2Ω is the estimated residual variance under model Ω which defines the linear model by the set of

parameters included in the model: We used _T_ instead of λ to depict the test statistic profiles because _T_ approximates a χ2_q_ distribution and thus bears the additive property, i.e.

_T_12=_T_1 + _T_2. Although the two traits have an identical genetic variance, the first trait has a substantially higher test statistic profile than the second one because the first trait

is directly selected for genotyping. As a comparison, we repeated the simulation under random selection, i.e. we generated 250 individuals and genotyped all of them (100%) for mapping. The

corresponding test statistic profiles are given in Fig. 1(b). Compared with random selection (Fig. 1b), the increase in the test statistic profile for the first trait (Fig. 1a) is obvious. A

slight increase in the test statistic profile for the second trait is also observed because of its correlation to the first trait. POWER UNDER SELECTIVE GENOTYPING As reported in this

section, we first calculated the predicted powers under various proportions of genotyped individuals using the theoretical formula given in eqn (18). We then conducted simulation experiments

to verify our theoretical prediction. The effects of the QTL were again set at _a_=0.229 and _d_=0.162. The number of individuals genotyped remained at _n_=100. We varied the total number

of individuals measured (_N_) to control the proportions selected for genotyping (see column 2 of Table 3). Because the population mean was set at μ = 0, the truncation points are

symmetrical, and thus _t_2=−_t_1, where the values of _t_1 were found by trial and error so that the theoretical proportions selected equal the predetermined proportions (see column 3 of

Table 3). The values of the noncentrality parameter are listed in column 4 of Table 3. The critical value for testing the hypothesis at a Type I error rate of α=0.05 is _F_−1(0.95 : 2, 97,

0)=3.09, which was used to calculate the theoretical powers (listed in column 5 of Table 3). We then simulated 1000 samples under each level of proportion and conducted QTL analysis for each

sampled data set. The empirical power under each setting was calculated as the proportion of the samples that have the _F-_like test statistic greater than 3.09. These empirical powers,

given in the last column of the table, are fairly close to the corresponding theoretical predictions. DISCUSSION When the phenotypic values of ungenotyped individuals are included in the

data analysis, standard methods with proper handling of missing markers are used (Lander & Botstein, 1989; Muranty & Goffinet, 1997a,b; Henshall & Goddard, 1999; Johnson et al.,

1999). A problem occurs if the number of ungenotyped individuals is large because of the increased computational burden; for example, if 10% of the test population is genotyped, to genotype

250 individuals, one needs to measure an additional 2250 individuals for their phenotypes. The total sample size will be 2500. Because the 2250 ungenotyped individuals contribute very little

to linkage analysis but serve as bias correctors, their phenotypic values do not have to be included in the analysis. These individuals, however, do contribute to the estimation of the

residual variance. The estimate of the residual variance usually has very small estimation error. When the number of individuals genotyped is small, however, the residual variance estimate

from only the genotyped individuals may not be sufficiently accurate. In this case, it is important to include the ungenotyped individuals. The methods described above (e.g. Muranty &

Goffinet, 1997a,bJohnson et al., 1999) are not the only ways to include the ungenotyped individuals. An alternative way is to ignore completely the genetic effects for the ungenotyped

individuals, partition the residual variance of an ungenotyped individual into a genetic and a pure environmental component, and use a mixed-model approach. This can be accomplished via the

following maximum likelihood analysis. Define the model for an ungenotyped individual by _y__j_=μ + _r__j_ for _j_=_n_ + 1, …, _N_, where _r__j_ is the residual effect with an N(0, σ2G + σ2)

distribution and σ2G=_a_2/2 + _d_2, as defined previously. The probability density of _y__j_ for the ungenotyped individual will be: The likelihood function including all individuals will

be: Note that ungenotyped individuals do not contribute to the estimation of B except μ, but they are used to estimate σ2G + σ2. The MLE may be directly searched or obtained via an EM

algorithm. In either way, the speed of convergence may be faster than the methods that treat X_j_ as missing values because there is no need to update _p_(X_j_) for an ungenotyped

individual. Further investigation is required to explore the properties of this alternative approach. It is not hard to imagine that ungenotyped individuals may not have a full measurement

of phenotypic values. This may occur, for example, in QTL mapping for the trait of flowering time. An investigator may choose to visit the field for the first few days when the population of

plants begins to flower and the last few days when the population approaches the end of the flowering season. In this case, plants that flower in the middle of the season may not have a

record of phenotype. Another example comes from multiple trait analysis in forest trees. One may decide to select early growth rate for QTL mapping, but later the investigator may want to

map QTLs for later growth rate as well. If selection is on the early trait, because of limited space, the investigator may not keep the culled individuals in the field. Then the final

population would be a selected population with regard to the later growth rate. The maximum likelihood analysis proposed in this study is the proper tool for handling such centrally

truncated data. It is undesirable to use only one tail of the trait distribution to carry out QTL mapping because the total variance of the trait is artificially deflated. However, if the

data happen to be single-tail truncated for some technical reasons, the proposed method can readily be applied for correcting the bias. A typical example of single-tail truncation can be

seen in artificial selection of plant and animal breeding. Another example may come from longitudinal data analysis where the phenotypic value of an individual depends on it longevity. Only

surviving individuals have a complete measurement of phenotype, whereas individuals not surviving only have partial information; for example, the yearly egg production of a chicken strongly

depends on the viability of the chicken. If a chicken dies in the middle of the year, we do not know the phenotypic record of her yearly production, but we do know that her yearly egg

production is greater than the current production in her record at the time when she dies. An unbiased analysis must be performed by taking into account these partial records. For

multiple-trait analysis, selective genotyping has been a problem because if all traits are deemed to be important to the researcher, which traits should be selected? The selection index

approach of Muranty & Goffinet (1997b) is a compromise between the traits. Because the selection criterion now becomes a single ‘trait’, it is easy to apply in practice. Lin &

Ritland (1997) suggested that an individual should be genotyped if at least one of _m_ traits exhibits the extreme value. Under this selection regime, different individuals seem to have

different criteria of selection; for example, if individual _j_ is selected because its _k_th phenotypic value is first observed as being extreme, then the criterion for _j_ is (_y__jk_ ≤

_t_1_k_) ⋃ (_y__jk_ ≥ _t_2_k_). On the other hand, if individual _i_ is selected because its _l_th phenotypic value is first observed as being extreme, then the criterion for _i_ is (_y__il_

≤ _t_1_l_) ⋃ (_y__il_ ≥ _t_2_l_). In both the index selection and the method of Lin & Ritland (1997), the selection criterion of each individual is a single trait (one-dimensional

selection), and thus the proposed method will apply. Another selection regime may be the so-called independent culling level selection where an individual will not be genotyped if any one of

the _m_ phenotypic values fails to reach the extreme. This selection regime is perhaps more rigorous than the previous two methods, but it is hard to programme because it is a multiple

dimensional selection (requiring multiple integration). Further study may be necessary to compare different selection regimes. Nonetheless, when phenotypic values of ungenotyped individuals

are included, methods of selection will be irrelevant to the statistical issue. Once selection is carried out on the phenotypic value of one trait (primary trait), QTL mapping for a highly

correlated trait (secondary trait) will also benefit. However, if the two traits are not correlated, the effective sample size in terms of the secondary trait will be comparable to a

_random_ sample of _n_, where _n_ is the number of genotyped individuals. Therefore, one should be cautious about the power of QTL mapping for traits less correlated to a highly selected

primary trait. An advantage of the logistic regression of Henshall & Goddard (1999) is that selection does not bias estimates of QTL effects, irrespective of whether phenotypic values of

ungenotyped individuals are included in the data analysis. This is because the roles of marker genotypes and the phenotypes in the likelihood function have been altered, just like the

discordant sib-pair mapping of Risch & Zhang (1995). Further investigation on the logistic regression, however, shows that selective genotyping can alter the estimation of the QTL

effect. The equivalence between logistic regression and the maximum likelihood holds only approximately when the effect of a QTL is small. This can be shown by looking at the posterior

probability of a QTL genotype given the phenotypic value of individual _j_: where _p_(X_j_) is the prior probability of the QTL genotype, independent of marker information, and

_g_(_y__j_|X_j_)=[_f_(_y__j_|X_j_)]/{Φ(τ1|X_j_)+[1 − Φ(τ2|X_j_)]}. However, the logistic regression model uses r ( x j ) = [ p ( x j ) f ( y j | x j ) ] / [ Σ x j ∈ S p ( x j ) f ( y j | x j

) ] , i.e. the term Φ(τ1|X_j_)+[1 − Φ(τ2|X_j_)] in the denominator of _g_(_y__j_|X_j_) has vanished. The exact maximum likelihood function should be built using _p_*(X_j_) instead of

_r_(X_j_). However, using _r_(X_j_) may still be justifiable because: (i) when the size of the QTL is small, Φ(τ1|X_j_) + [1 − Φ(τ2|X_j_)] can be considered as a constant across different

genotypes so that the corresponding terms in the numerator and denominator cancel each other out, leading to _p_*(X_j_) ≈ _r_(X_j_); and (ii) _r_(X_j_) is much easier to handle than

_p_*(X_j_) in the maximum likelihood analysis. In addition to the approximate nature of the logistic regression, there are two unsolved problems: (a) modification is required to map a QTL in

an F2 population; and (b) an exact interval mapping has not been available. An approximate interval mapping was accomplished via interpolation (Henshall & Goddard, 1999). Solving for

the first problem requires a multicategorical response model, e.g. models for nominal or ordinal responses (Fahrmeir & Tutz, 1994). The second problem involves missing QTL genotypes and

may be solved using the EM or MCMC algorithm of C. Vogl & S. Xu (unpublished results) for mapping viability loci. Quantitative trait loci mapping is usually performed after a marker map

is fully developed. If the trait under selection has a strong genetic component, selective genotyping can also cause distortion of the inferred marker map from the true one. The distortion

is reflected by the change in both the marker order and the distances between markers. Because the severity of marker map distortion depends on the sizes and locations of QTLs, marker

mapping and QTL mapping should be carried out concurrently under selective genotyping. One can apply the general idea of EM to concurrent mapping. To do this, one first maps QTLs under the

assumption that the marker map is known without error, and then corrects the marker map by taking into account the distortion caused by selective genotyping under the assumption that the

sizes and locations of QTLs are known. This completes one cycle of iteration, and the iteration should continue until a criterion of convergence is reached. The problem can be very

complicated, especially when the marker order is allowed to change. Many theoretical and practical problems may exist in concurrent mapping, and further investigation is deemed necessary.

Maximization of the likelihood function is not an easy task. Special algorithms and computer programs are required. We developed an EM algorithm that appears to be a simple modification of

the existing EM algorithm for standard interval mapping. As a result, it can be readily incorporated into a standard QTL mapping software, e.g. MAPMAKER (Lander & Botstein, 1989). One

caveat about the EM algorithm is that when the selection intensity is too high, the EM algorithm may take a very large number of iterations to converge and sometime may not converge at all.

This is not a problem of the EM algorithm itself, rather, it is caused by numerical overflow when the bias adjustment of the residual variance is conducted. Recall that we added an

is high, this term can be numerically unstable. Proper handling of this numerical overflow is required which is, unfortunately, beyond our technical ability. We found that when the

proportion selected is 40% or more, the problem rarely happens. In our simulation studies, when numerical overflow occurred, the EM algorithm was replaced by the simplex algorithm (Nelder

& Mead, 1965) for direct search of MLEs. The simplex method is usually slower than the EM algorithm, but it can handle highly selected data. Another caveat is the sensitivity of the

proposed method to departure from normality. Because the likelihood function involves Φ(τ), in addition to φ(τ), we anticipate that the method is more sensitive to deviation from normality

than the methods using also the ungenotyped individuals. In conclusion, we developed an exact maximum likelihood approach to map QTLs under selective genotyping using phenotypic values of

genotyped individuals only. Compared with the full data analysis (using all phenotypic values), the proposed method performs well: the average test statistic is slightly lower; estimates of

QTL parameters are almost identical; and the estimate of residual variance is subject to a relatively large error. The slightly lower test statistic value may be caused by the relatively

large increase in the estimation error of the residual variance. A general recommendation is that whenever full data analysis is possible, the full maximum likelihood analysis should be

performed. If it is impossible or difficult to analyse the full data, e.g. the sample size is too large, the phenotypic values of ungenotyped individuals are missing or composite interval

mapping is to be performed, then the proposed method should be applied with the understanding that there is little to lose. REFERENCES * Cohen, A. C. (1991). _Truncated and Censored Samples

— Theory and Applications_. Marcel Dekker, New York. Book Google Scholar * Darvasi, A. and Soller, M. (1992). Selective genotyping for determination of linkage between a marker locus and a

quantitative trait locus. _Theor Appl Genet_, 85: 353–359. Article CAS PubMed Google Scholar * Fahrmeir, L. and Tutz, G. (1994). _Multivariate Statistical Modelling Based on Generalized

Linear Models_. Springer-Verlag, New York, NY. Book Google Scholar * Graybill, F. A. (1976). _Theory and Application of the Linear Model_. Duxbury Press, North Scituate, MA. Google

Scholar * Groover, A., Devey, M., Fiddler, T., Lee, J., Megraw, R., Mitchell-Olds, T. _et al_ (1994). Identification of quantitative trait loci influencing wood specific gravity in an

outbred pedigree of loblolly pine. _Genetics_ 138: 1293–1300. CAS PubMed PubMed Central Google Scholar * Henshall, J. M. and Goddard, M. E. (1999). Multiple-trait mapping of quantitative

trait loci after selective genotyping using logistic regression. _Genetics_ 151: 885–894. CAS PubMed PubMed Central Google Scholar * Jansen, R. C. and Stam, P. (1994). High resolution

of quantitative traits into multiple loci via interval mapping. _Genetics_ 136: 1447–1455. CAS PubMed PubMed Central Google Scholar * Jiang, C. and Zeng, Z. B. (1995). Multiple trait

analysis of genetic mapping for quantitative trait loci. _Genetics_ 140: 1111–1127. CAS PubMed PubMed Central Google Scholar * Johnson, D. L., van Jansen, R. C. and Arendonk, A. M.

(1999). Mapping quantitative trait loci in a selectively genotyped outbred population using a mixture model approach. _Genet Res_, 73: 75–83. Article Google Scholar * Lander, E. S. and

Botstein, D. (1989). Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. _Genetics_ 121: 185–199. CAS PubMed PubMed Central Google Scholar * Lin, J. Z. and

Ritland, K. (1997). Quantitative trait loci differentiating the outbreeding _Mimulus guttatus_ from the inbreeding _M. platycalyx_. _Genetics_ 146: 1115–1121. CAS PubMed PubMed Central

Google Scholar * Muranty, H. (1996). Power of tests for quantitative trait loci detection using full-sib families in different schemes. _Heredity_ 76: 156–165. Article Google Scholar *

Muranty, H. and Goffinet, B. (1997a). Selective genotyping for location and estimation of the effect of a quantitative trait locus. _Biometrics_ 53: 629–643. Article Google Scholar *

Muranty, H. and Goffinet, B. (1997b). Multitrait and multipopulation QTL search using selective genotyping. _Genet Res_, 70: 259–265. Article Google Scholar * Nelder, J. A. and Mead, R.

(1965). A simplex method for function minimization. _Comput J_, 7: 308–313. Article Google Scholar * Risch, N. and Zhang, H. (1995). Extreme discordant sib pairs for mapping quantitative

trait loci in humans. _Science_ 268: 1584–1589. Article CAS PubMed Google Scholar * Soller, M. and Brody, T. (1976). On the power of experimental designs for the detection of linkage

between marker loci and quantitative trait loci in crosses between inbred lines. _Theor Appl Genet_, 47: 35–39. Article CAS PubMed Google Scholar * Zeng, Z. B. (1994). Precision mapping

of quantitative trait loci. _Genetics_ 136: 1457–1468. CAS PubMed PubMed Central Google Scholar Download references ACKNOWLEDGEMENTS Many thanks are due to Drs W. M. Muir and O.

Savolainen for their encouragement and support on the project. We thank Drs D. Gessler, J. Lin, R. Whitkus and M. Sillanpaa for their helpful comments on an earlier version of the

manuscript. We would also like to thank two anonymous reviewers for their criticisms and comments which have greatly improved the presentation of the manuscript. This research was supported

by the National Institutes of Health Grant GM55321 and the USDA National Research Initiative Competitive Grants Program 97–35205–5075 to S.X. AUTHOR INFORMATION Author notes * Claus Vogl

Present address: Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, U.S.A. AUTHORS AND AFFILIATIONS * Department of Botany and Plant Sciences, University

of California, Riverside, 92521, CA, USA Shizhong Xu * Department of Biology, University of Oulu, PO Box 3000, Oulu, FIN-90401, Finland Claus Vogl Authors * Shizhong Xu View author

publications You can also search for this author inPubMed Google Scholar * Claus Vogl View author publications You can also search for this author inPubMed Google Scholar CORRESPONDING

AUTHOR Correspondence to Shizhong Xu. APPENDIX DERIVATION OF THE EM ALGORITHM FOR SINGLE-TRAIT ANALYSIS APPENDIX DERIVATION OF THE EM ALGORITHM FOR SINGLE-TRAIT ANALYSIS First, let us define

the log likelihood function by: The MLEs of B and σ2 are obtained by solving ∂_l_/∂B=0 and ∂_l_/∂σ2=0 simultaneously. Derivation of ∂l/∂B where In the above equation, the truncation points

have been standardized, i.e. τ1=(_t_1 − X_j_B)/σ and τ2=(_t_2 − X_j_B)/σ, where _t_1 and _t_2 are the truncation points in the original scale. Note that: Hence, Define

_p_*(X_j_)=[_p_(X_j_)_g_(_y__j_|X_j_)]/[ ∑ x j ε_S__ p_(X_j_)_g_(_y__j_|X_j_)] as the posterior distribution of X_j_. We now have: Solving for ∂_l_/∂B=0, we have: Derivation of ∂l/∂σ2 where

Note that Therefore, Finally, we have Solving for ∂_l_/∂σ2=0, we obtain RIGHTS AND PERMISSIONS Reprints and permissions ABOUT THIS ARTICLE CITE THIS ARTICLE Xu, S., Vogl, C. Maximum

likelihood analysis of quantitative trait loci under selective genotyping. _Heredity_ 84, 525–537 (2000). https://doi.org/10.1046/j.1365-2540.2000.00653.x Download citation * Received: 22

June 1999 * Accepted: 13 October 1999 * Published: 01 May 2000 * Issue Date: 01 May 2000 * DOI: https://doi.org/10.1046/j.1365-2540.2000.00653.x SHARE THIS ARTICLE Anyone you share the

following link with will be able to read this content: Get shareable link Sorry, a shareable link is not currently available for this article. Copy to clipboard Provided by the Springer

Nature SharedIt content-sharing initiative KEYWORDS * EM algorithm * QTL mapping * simplex algorithm * truncated selection

Mixed sex accommodation breaches, january 2017

Official Statistics MIXED SEX ACCOMMODATION BREACHES, JANUARY 2017 Number of breaches of the policy on mixed sex accommo...

4 key pieces of relationship advice from married psychologists who have worked with 40,000 couples

Renowned psychologists John and Julie Gottman have interviewed 40,000 couples in quest to find out what makes love last....

Anthony ray hinton wins christopher award for 'sun does shine'

Quinton, Ala.-based author Anthony Ray Hinton will receive a Christopher Award for Sun Does Shine: How I Found Life and ...

Ving rhames to star in bet+ original series ‘legacy’

EXCLUSIVE: Ving Rhames is set as the lead of _Legacy_, a three-episode original series for BET+. The series will begin s...

Javascript support required...

Strictly come dancing 2018: has it takes two accidentally revealed?

Strictly Come Dancing heads to Blackpool this weekend where the remaining contestants will battle it out once again in h...