# 2.3: Maximum Likelihood - Biology

## Section 2.3a: What is a likelihood?

Since all of the approaches described in the remainer of this chapter involve calculating likelihoods, I will first briefly describe this concept. A good general review of likelihood is Edwards (1992). Likelihood is defined as the probability, given a model and a set of parameter values, of obtaining a particular set of data. That is, given a mathematical description of the world, what is the probability that we would see the actual data that we have collected?

To calculate a likelihood, we have to consider a particular model that may have generated the data. That model will almost always have parameter values that need to be specified. We can refer to this specified model (with particular parameter values) as a hypothesis, H. The likelihood is then:

[L(H|D)=Pr(D|H) label{2.1}]

Here, L and Pr stand for likelihood and probability, D for the data, and H for the hypothesis, which again includes both the model being considered and a set of parameter values. The | symbol stands for “given,” so equation 2.1 can be read as “the likelihood of the hypothesis given the data is equal to the probability of the data given the hypothesis.” In other words, the likelihood represents the probability under a given model and parameter values that we would obtain the data that we actually see.

For any given model, using different parameter values will generally change the likelihood. As you might guess, we favor parameter values that give us the highest probability of obtaining the data that we see. One way to estimate parameters from data, then, is by finding the parameter values that maximize the likelihood; that is, the parameter values that give the highest likelihood, and the highest probability of obtaining the data. These estimates are then referred to as maximum likelihood (ML) estimates. In an ML framework, we suppose that the hypothesis that has the best fit to the data is the one that has the highest probability of having generated that data.

For the example above, we need to calculate the likelihood as the probability of obtaining heads 63 out of 100 lizard flips, given some model of lizard flipping. In general, we can write the likelihood for any combination of H “successes” (flips that give heads) out of n trials. We will also have one parameter, pH, which will represent the probability of “success,” that is, the probability that any one flip comes up heads. We can calculate the likelihood of our data using the binomial theorem:

\$\$ L(H|D)=Pr(D|p)= {n choose H} p_H^H (1-p_H)^{n-H} label{2.2} \$\$

In the example given, n = 100 and H = 63, so:

\$\$ L(H|D)= {100 choose 63} p_H^{63} (1-p_H)^{37} label{2.3} \$\$ Figure 2.2. Likelihood surface for the parameter pH, given a coin that has been flipped as heads 63 times out of 100. Image by the author, can be reused under a CC-BY-4.0 license.

We can make a plot of the likelihood, L, as a function of pH (Figure 2.2). When we do this, we see that the maximum likelihood value of pH, which we can call \$hat{p}_H\$, is at \$hat{p}_H = 0.63\$. This is the “brute force” approach to finding the maximum likelihood: try many different values of the parameters and pick the one with the highest likelihood. We can do this much more efficiently using numerical methods as described in later chapters in this book.

We could also have obtained the maximum likelihood estimate for pH through differentiation. This problem is much easier if we work with the ln-likelihood rather than the likelihood itself (note that whatever value of pH that maximizes the likelihood will also maximize the ln-likelihood, because the log function is strictly increasing). So:

\$\$ ln{L} = ln{n choose H} + H ln{p_H}+ (n-H)ln{(1-p_H)} label{2.4} \$\$

Note that the natural log (ln) transformation changes our equation from a power function to a linear function that is easy to solve. We can differentiate:

\$\$ frac{d ln{L}}{dp_H} = frac{H}{p_H} - frac{(n-H)}{(1-p_H)}label{2.5} \$\$

The maximum of the likelihood represents a peak, which we can find by setting the derivative \$frac{d ln{L}}{dp_H}\$ to zero. We then find the value of pH that solves that equation, which will be our estimate \$hat{p}_H\$. So we have:

\$\$ egin{array}{lcl} frac{H}{hat{p}_H} - frac{n-H}{1-hat{p}_H} & = & 0 frac{H}{hat{p}_H} & = & frac{n-H}{1-hat{p}_H} H (1-hat{p}_H) & = & hat{p}_H (n-H) H-Hhat{p}_H & = & nhat{p}_H-Hhat{p}_H H & = & nhat{p}_H hat{p}_H &=& H / n end{array} label{2.6}\$\$

Notice that, for our simple example, H/n = 63/100 = 0.63, which is exactly equal to the maximum likelihood from figure 2.2.

Maximum likelihood estimates have many desirable statistical properties. It is worth noting, however, that they will not always return accurate parameter estimates, even when the data is generated under the actual model we are considering. In fact, ML parameters can sometimes be biased. To understand what this means, we need to formally introduce two new concepts: bias and precision. Imagine that we were to simulate datasets under some model A with parameter a. For each simulation, we then used ML to estimate the parameter \$hat{a}\$ for the simulated data. The precision of our ML estimate tells us how different, on average, each of our estimated parameters \$hat{a}_i\$ are from one another. Precise estimates are estimated with less uncertainty. Bias, on the other hand, measures how close our estimates \$hat{a}_i\$ are to the true value a. If our ML parameter estimate is biased, then the average of the \$hat{a}_i\$ will differ from the true value a. It is not uncommon for ML estimates to be biased in a way that depends on sample size, so that the estimates get closer to the truth as sample size increases, but can be quite far off in a particular direction when the number of data points is small compared to the number of parameters being estimated.

In our example of lizard flipping, we estimated a parameter value of \$hat{p}_H = 0.63\$. For the particular case of estimating the parameter of a binomial distribution, our ML estimate is known to be unbiased. And this estimate is different from 0.5 – which was our expectation under the null hypothesis. So is this lizard fair? Or, alternatively, can we reject the null hypothesis that pH = 0.5? To evaluate this, we need to use model selection.

## Section 2.3b: The likelihood ratio test

Model selection involves comparing a set of potential models and using some criterion to select the one that provides the “best” explanation of the data. Different approaches define “best” in different ways. I will first discuss the simplest, but also the most limited, of these techniques, the likelihood ratio test. Likelihood ratio tests can only be used in one particular situation: to compare two models where one of the models is a special case of the other. This means that model A is exactly equivalent to the more complex model B with parameters restricted to certain values. We can always identify the simpler model as the model with fewer parameters. For example, perhaps model B has parameters x, y, and z that can take on any values. Model A is the same as model B but with parameter z fixed at 0. That is, A is the special case of B when parameter z = 0. This is sometimes described as model A is nested within model B, since every possible version of model A is equal to a certain case of model B, but model B also includes more possibilities.

For likelihood ratio tests, the null hypothesis is always the simpler of the two models. We compare the data to what we would expect if the simpler (null) model were correct.

For example, consider again our example of flipping a lizard. One model is that the lizard is “fair:” that is, that the probability of heads is equal to 1/2. A different model might be that the probability of heads is some other value p, which could be 1/2, 1/3, or any other value between 0 and 1. Here, the latter (complex) model has one additional parameter, pH, compared to the former (simple) model; the simple model is a special case of the complex model when pH = 1/2.

For such nested models, one can calculate the likelihood ratio test statistic as

\$\$ Delta = 2 cdot ln{frac{L_1}{L_2}} = 2 cdot (ln{L_1}-ln{L_2}) label{2.7}\$\$

Here, Δ is the likelihood ratio test statistic, L2 the likelihood of the more complex (parameter rich) model, and L1 the likelihood of the simpler model. Since the models are nested, the likelihood of the complex model will always be greater than or equal to the likelihood of the simple model. This is a direct consequence of the fact that the models are nested. If we find a particular likelihood for the simpler model, we can always find a likelihood equal to that for the complex model by setting the parameters so that the complex model is equivalent to the simple model. So the maximum likelihood for the complex model will either be that value, or some higher value that we can find through searching the parameter space. This means that the test statistic Δ will never be negative. In fact, if you ever obtain a negative likelihood ratio test statistic, something has gone wrong – either your calculations are wrong, or you have not actually found ML solutions, or the models are not actually nested.

To carry out a statistical test comparing the two models, we compare the test statistic Δ to its expectation under the null hypothesis. When sample sizes are large, the null distribution of the likelihood ratio test statistic follows a chi-squared (χ2) distribution with degrees of freedom equal to the difference in the number of parameters between the two models. This means that if the simpler hypothesis were true, and one carried out this test many times on large independent datasets, the test statistic would approximately follow this χ2 distribution. To reject the simpler (null) model, then, one compares the test statistic with a critical value derived from the appropriate χ2 distribution. If the test statistic is larger than the critical value, one rejects the null hypothesis. Otherwise, we fail to reject the null hypothesis. In this case, we only need to consider one tail of the χ2 test, as every deviation from the null model will push us towards higher Δ values and towards the right tail of the distribution.

For the lizard flip example above, we can calculate the ln-likelihood under a hypothesis of pH = 0.5 as:

\$\$ egin{array}{lcl} ln{L_1} &=& ln{left(frac{100}{63} ight)} + 63 cdot ln{0.5} + (100-63) cdot ln{(1-0.5)} onumber ln{L_1} &=& -5.92 onumber end{array} label{2.8}\$\$

We can compare this to the likelihood of our maximum-likelihood estimate :

\$\$ egin{array}{lcl} ln{L_2} &=& ln{left(frac{100}{63} ight)} + 63 cdot ln{0.63} + (100-63) cdot ln{(1-0.63)} onumber ln{L_2} &=& -2.50 onumber end{array} label{2.9}\$\$

We then calculate the likelihood ratio test statistic:

\$\$ egin{array}{lcl} Delta &=& 2 cdot (ln{L_2}-ln{L_1}) onumber Delta &=& 2 cdot (-2.50 - -5.92) onumber Delta &=& 6.84 onumber end{array} label{2.10}\$\$

If we compare this to a χ2 distribution with one d.f., we find that P = 0.009. Because this P-value is less than the threshold of 0.05, we reject the null hypothesis, and support the alternative. We conclude that this is not a fair lizard. As you might expect, this result is consistent with our answer from the binomial test in the previous section. However, the approaches are mathematically different, so the two P-values are not identical.

Although described above in terms of two competing hypotheses, likelihood ratio tests can be applied to more complex situations with more than two competing models. For example, if all of the models form a sequence of increasing complexity, with each model a special case of the next more complex model, one can compare each pair of hypotheses in sequence, stopping the first time the test statistic is non-significant. Alternatively, in some cases, hypotheses can be placed in a bifurcating choice tree, and one can proceed from simple to complex models down a particular path of paired comparisons of nested models. This approach is commonly used to select models of DNA sequence evolution (Posada and Crandall 1998).

## Section 2.3c: The Akaike Information Criterion (AIC)

You might have noticed that the likelihood ratio test described above has some limitations. Especially for models involving more than one parameter, approaches based on likelihood ratio tests can only do so much. For example, one can compare a series of models, some of which are nested within others, using an ordered series of likelihood ratio tests. However, results will often depend strongly on the order in which tests are carried out. Furthermore, often we want to compare models that are not nested, as required by likelihood ratio tests. For these reasons, another approach, based on the Akaike Information Criterion (AIC), can be useful.

The AIC value for a particular model is a simple function of the likelihood L and the number of parameters k:

[AIC = 2k − 2ln L label{2.11}]

This function balances the likelihood of the model and the number of parameters estimated in the process of fitting the model to the data. One can think of the AIC criterion as identifying the model that provides the most efficient way to describe patterns in the data with few parameters. However, this shorthand description of AIC does not capture the actual mathematical and philosophical justification for equation (2.11). In fact, this equation is not arbitrary; instead, its exact trade-off between parameter numbers and log-likelihood difference comes from information theory (for more information, see Burnham and Anderson 2003, Akaike (1998)).

The AIC equation (2.11) above is only valid for quite large sample sizes relative to the number of parameters being estimated (for n samples and k parameters, n/k > 40). Most empirical data sets include fewer than 40 independent data points per parameter, so a small sample size correction should be employed:

\$\$ AIC_C = AIC + frac{2k(k+1)}{n-k-1} label{2.12}\$\$

This correction penalizes models that have small sample sizes relative to the number of parameters; that is, models where there are nearly as many parameters as data points. As noted by Burnham and Anderson (2003), this correction has little effect if sample sizes are large, and so provides a robust way to correct for possible bias in data sets of any size. I recommend always using the small sample size correction when calculating AIC values.

To select among models, one can then compare their AICc scores, and choose the model with the smallest value. It is easier to make comparisons in AICc scores between models by calculating the difference, ΔAICc. For example, if you are comparing a set of models, you can calculate ΔAICc for model i as:

[ΔAIC_{c_i} = AIC_{c_i} − AIC_{c_{min}} label{2.13}]

where AICci is the AICc score for model i and AICcmin is the minimum AICc score across all of the models.

As a broad rule of thumb for comparing AIC values, any model with a ΔAICci of less than four is roughly equivalent to the model with the lowest AICc value. Models with ΔAICci between 4 and 8 have little support in the data, while any model with a ΔAICci greater than 10 can safely be ignored.

Additionally, one can calculate the relative support for each model using Akaike weights. The weight for model i compared to a set of competing models is calculated as:

\$\$ w_i = frac{e^{-Delta AIC_{c_i}/2}}{sum_i{e^{-Delta AIC_{c_i}/2}}} label{2.14} \$\$

The weights for all models under consideration sum to 1, so the wi for each model can be viewed as an estimate of the level of support for that model in the data compared to the other models being considered.

Returning to our example of lizard flipping, we can calculate AICc scores for our two models as follows:

\$\$ egin{array}{lcl} AIC_1 &=& 2 k_1 - 2 ln{L_1} = 2 cdot 0 - 2 cdot -5.92 AIC_1 &=& 11.8 AIC_2 &=& 2 k_2 - 2 ln{L_2} = 2 cdot 1 - 2 cdot -2.50 AIC_2 &=& 7.0 end{array} label{2.15} \$\$

Our example is a bit unusual in that model one has no estimated parameters; this happens sometimes but is not typical for biological applications. We can correct these values for our sample size, which in this case is n = 100 lizard flips:

\$\$ egin{array}{lcl} AIC_{c_1} &=& AIC_1 + frac{2 k_1 (k_1 + 1)}{n - k_1 - 1} AIC_{c_1} &=& 11.8 + frac{2 cdot 0 (0 + 1)}{100-0-1} AIC_{c_1} &=& 11.8 AIC_{c_2} &=& AIC_2 + frac{2 k_2 (k_2 + 1)}{n - k_2 - 1} AIC_{c_2} &=& 7.0 + frac{2 cdot 1 (1 + 1)}{100-1-1} AIC_{c_2} &=& 7.0 end{array} label{2.16} \$\$

Notice that, in this particular case, the correction did not affect our AIC values, at least to one decimal place. This is because the sample size is large relative to the number of parameters. Note that model 2 has the smallest AICc score and is thus the model that is best supported by the data. Noting this, we can now convert these AICc scores to a relative scale:

\$\$ egin{array}{lcl} Delta AIC_{c_1} &=& AIC_{c_1}-AIC{c_{min}} &=& 11.8-7.0 &=& 4.8 end{array} label{2.17} \$\$

\$\$ egin{array}{lcl} Delta AIC_{c_2} &=& AIC_{c_2}-AIC{c_{min}} &=& 7.0-7.0 &=& 0 end{array} \$\$

Note that the ΔAICci for model 1 is greater than four, suggesting that this model (the “fair” lizard) has little support in the data. This is again consistent with all of the results that we've obtained so far using both the binomial test and the likelihood ratio test. Finally, we can use the relative AICc scores to calculate Akaike weights:

\$\$ egin{array}{lcl} sum_i{e^{-Delta_i/2}} &=& e^{-Delta_1/2} + e^{-Delta_2/2} &=& e^{-4.8/2} + e^{-0/2} &=& 0.09 + 1 &=& 1.09 end{array} label{2.18}\$\$

\$\$ egin{array}{lcl} w_1 &=& frac{e^{-Delta AIC_{c_1}/2}}{sum_i{e^{-Delta AIC_{c_i}/2}}} &=& frac{0.09}{1.09} &=& 0.08 end{array} \$\$

\$\$ egin{array}{lcl} w_2 &=& frac{e^{-Delta AIC_{c_2}/2}}{sum_i{e^{-Delta AIC_{c_i}/2}}} &=& frac{1.00}{1.09} &=& 0.92 end{array} \$\$

Our results are again consistent with the results of the likelihood ratio test. The relative likelihood of an unfair lizard is 0.92, and we can be quite confident that our lizard is not a fair flipper.

AIC weights are also useful for another purpose: we can use them to get model-averaged parameter estimates. These are parameter estimates that are combined across different models proportional to the support for those models. As a thought example, imagine that we are considering two models, A and B, for a particular dataset. Both model A and model B have the same parameter p, and this is the parameter we are particularly interested in. In other words, we do not know which model is the best model for our data, but what we really need is a good estimate of p. We can do that using model averaging. If model A has a high AIC weight, then the model-averaged parameter estimate for p will be very close to our estimate of p under model A; however, if both models have about equal support then the parameter estimate will be close to the average of the two different estimates. Model averaging can be very useful in cases where there is a lot of uncertainty in model choice for models that share parameters of interest. Sometimes the models themselves are not of interest, but need to be considered as possibilities; in this case, model averaging lets us estimate parameters in a way that is not as strongly dependent on our choice of models.

### The Method

Suppose again that we have an observable random variable (s) for an experiment, that takes values in a set (S). Suppose also that distribution of (s) depends on an unknown parameter ( heta), taking values in a parameter space (Theta). Of course, our data variable (s) will almost always be vector valued. The parameter ( heta) may also be vector valued. We will denote the probability density function of (s) on (S) by (f_ heta) for ( heta in Theta). The distribution of ( s ) could be discrete or continuous.

The likelihood function is the function obtained by reversing the roles of (s) and ( heta) in the probability density function that is, we view ( heta) as the variable and (s) as the given information (which is precisely the point of view in estimation).

The at ( s in S ) is the function ( L_<s>: Theta o [0, infty) ) given by [ L_s( heta) = f_ heta(s), quad heta in Theta ]

In the method of , we try to find the value of the parameter that maximizes the likelihood function for each value of the data vector.

Suppose that the maximum value of ( L_<s> ) occurs at ( u(s) in Theta ) for each ( s in S ). Then the statistic ( u(s) ) is a of ( heta ).

The method of maximum likelihood is intuitively appealing&mdashwe try to find the value of the parameter that would have most likely produced the data we in fact observed.

Since the natural logarithm function is strictly increasing on ( (0, infty) ), the maximum value of the likelihood function, if it exists, will occur at the same points as the maximum value of the logarithm of the likelihood function.

The at ( s in S ) is the function ( ln L_<s> ): [ ln L_<s>( heta) = ln f_ heta(s), quad heta in Theta ] If the maximum value of ( ln L_<s> ) occurs at ( u(s) in Theta ) for each ( s in S ). Then the statistic ( u(s) ) is a maximum likelihood estimator of ( heta )

The log-likelihood function is often easier to work with than the likelihood function (typically because the probability density function (f_ heta(s)) has a product structure).

### Vector of Parameters

An important special case is when (s < heta>= ( heta_1, heta_2, ldots, heta_k)) is a vector of (k) real parameters, so that (Theta subseteq R^k). In this case, the maximum likelihood problem is to maximize a function of several variables. If (Theta) is a continuous set, the methods of calculus can be used. If the maximum value of (L_s) occurs at a point (s< heta>) in the interior of (Theta), then (L_s) has a local maximum at (s< heta>). Therefore, assuming that the likelihood function is differentiable, we can find this point by solving [ frac L_s(s< heta>) = 0, quad i in <1, 2, ldots, k>] or equivalently [ frac ln L_s(s< heta>) = 0, quad i in <1, 2, ldots, k>] On the other hand, the maximum value may occur at a boundary point of (Theta), or may not exist at all.

### Random Samples

The most important special case is when the data variables form a random sample from a distribution.

Suppose that (s = (X_1, X_2, ldots, X_n)) is a random sample of size (n) from the distribution of a random variable (X) taking values in (R), with probability density function (g_ heta) for ( heta in Theta). Then (s) takes values in (S = R^n), and the likelihood and log-likelihood functions for ( s = (x_1, x_2, ldots, x_n) in S ) are egin L_s( heta) & = prod_^n g_ heta(x_i), quad heta in Theta ln L_s( heta) & = sum_^n ln g_ heta(x_i), quad heta in Theta end

### Extending the Method and the Invariance Property

Returning to the general setting, suppose now that (h) is a one-to-one function from the parameter space (Theta) onto a set (Lambda). We can view (lambda = h( heta)) as a new parameter taking values in the space (Lambda), and it is easy to re-parameterize the probability density function with the new parameter. Thus, let ( hat_lambda(s) = f_(lambda)>(s)) for ( s in S ) and ( lambda in Lambda ). The corresponding likelihood function for ( s in S ) is [ hat_s(lambda) = L_sleft[h^<-1>(lambda) ight], quad lambda in Lambda ] Clearly if (u(s) in Theta) maximizes (L_s) for (s in S). Then (hleft[u(s) ight] in Lambda) maximizes (hat_s) for (s in S). It follows that if (U) is a maximum likelihood estimator for ( heta), then (V = h(U)) is a maximum likelihood estimator for ( lambda = h( heta) ).

If the function (h) is not one-to-one, the maximum likelihood function for the new parameter (lambda = h( heta)) is not well defined, because we cannot parameterize the probability density function in terms of (lambda). However, there is a natural generalization of the method.

Suppose that ( h: Theta o Lambda ), and let ( lambda = h( heta) ) denote the new parameter. Define the for ( lambda ) at ( s in S) by [ hat_s(lambda) = maxleft<>( heta): heta in h^<-1> ight> quad lambda in Lambda ] If ( v(s) in Lambda ) maximizes ( hat_<s> ) for each ( s in S ), then ( V = v(s) ) is a of ( lambda ).

This definition extends the maximum likelihood method to cases where the probability density function is not completely parameterized by the parameter of interest. The following theorem is known as the : if we can solve the maximum likelihood problem for ( heta ) then we can solve the maximum likelihood problem for ( lambda = h( heta) ).

In the setting of the previous theorem, if ( U ) is a maximum likelihood estimator of ( heta ), then ( V = h(U) ) is a maximum likelihood estimator of ( lambda ).

As before, if (u(s) in Theta) maximizes (L_s) for (s in S). Then (hleft[u(s) ight] in Lambda) maximizes (hat_s) for (s in S).

## 2. Installation

System Requirements

In order to run the PhyloNet toolkit, you must have Java 1.8.0 or later installed on your system. All references to the java command assume that Java 1.7 is being used.

• To check your Java version, type "java -version" on your command line.

To link to the new downloaded Java 1.8, for mac, try these two commands from command line:

Acquire the current release of PhyloNet by downloading the most recent version of the PhyloNet JAR file. You will have a file named PhyloNet_X.Y.Z.jar , where X is the major version number and Y and Z are the minor version numbers.

Installing the file

Place the jar file in the desired installation directory. The remainder of this document assumes that it is located in directory \$PHYLONET_DIRECTORY . Installation is now complete. In order to run PhyloNet, you must execute the file PhyloNet_X.Y.Z.jar , as described in the next section.

## 3.2 Exploratory analysis

We now explore the available data and analyze the number of claim counts per insured.

### 3.2.1 Summary statistics disregarding exposure

We start our analysis by computing the mean and variance of the number of observed claims. If we denote by (n_i) the number of claims observed for policyholder (i) , we can compute the mean and variance as

[ mu = E(X) = frac<1> cdot sum_^m n_i]

[ sigma^2 = E((X - mu)^2) = frac<1> cdot sum_^m (n_i - mu)^2. ]

In these formulas (m) denotes the number of observations.

### 3.2.2 Summary statistics taking into account exposure

The previous calculation of the mean and variance does not consider the difference in exposure between policyholders. However, it is important to take exposure into account. Let (d_i) be the exposure for policyholder (i) , then we calculate the mean as

[ mu_< ext> = sum_^m frac^m d_i> frac = frac^m n_i>^m d_i>] and the variance as [ sigma^2_< ext>=frac^m(n_i-mu_ cdot d_i)^2>^m d_i> . ] For more intuition behind these estimators, check out the blog of Arthur Charpentier and Section 15.6.6 from Klugman et al..

This is the expected number of accidents for a policyholder who is insurerd throughout the whole year, i.e. (d_i = 1) .

### 3.2.3 Empirical probability distribution

table allows us to easily construct a contingency table of the counts. prop.table can be used to obtain the empirical probability distribution

We can create a better barplot using ggplot

• ggplot() : starts the construction of a ggplot figure
• geom_bar(. ) : creates a bar plot
• aes(<var>) : specifies the variables used to create the plot. To specify your own theme, you define some visualisation parameters and colors that will be used in your ggplot calls.

Instead of manually changing all details of the plot, ggplot also offers some general layout schemes. In this tutorial we use the black and white theme theme_bw() . The weight argument in aes allows you to weight the number of policyholders who file 0 claims, 1 claim and so on by exposure instead of simply counting the number of policyholders. ### 3.2.4 The (a, b, 0) class of distributions

We test whether the data could come from a distribution in the (a, b, 0) class of distributions. Distributions in this family satisfy [ frac<>> = a cdot k+ b, quad k = 1,ldots,infty ]

• geom_point : adds a scatterplot to ggplot. Two variables have to be specified in aes .
• xlab : specifies the name of the label on the x-axis. The observations ((k, frac<>>)) seem to be on a straight line with positive intercept. This indicates that the Negative Binomial distribution might be a good fit for the data.

### References

. 1921 Risk, uncertainty, and profit . Boston, MA : Riverside Press . Google Scholar

. 1972 Production, information costs, and economic organization . Am. Econ. Rev. 62, 777–795. Google Scholar

. 1988 The firm, the market, and the law . Chicago, IL : University of Chicago Press . Google Scholar

. 1992 Dynamics of organizational populations . New York, NY : Oxford University Press . Google Scholar

. 2000 The demography of corporations and industries . Princeton, NJ : Princeton University Press . Google Scholar

. 2001 Fitness and age: review of Carroll and Hannan's demography of corporations and industries . J. Econ. Lit. 39, 105–119. (doi:10.1257/jel.39.1.105). Crossref, Google Scholar

. 1958 The size distribution of business firms . Am. Econ. Rev. 48, 607–617. Google Scholar

Stanley MHR, Amaral LAN, Buldyrev SV, Havlin S, Leschhorn H, Maass P, Salinger MA& Stanley HE

. 1996 Scaling behavior in the growth of companies . Nature 379, 804–806. (doi:10.1038/379804a0). Crossref, ISI, Google Scholar

Amaral LAN, Buldyrev SV, Havlin S, Leschhorn H, Maass P, Salinger MA, Stanley HE& Stanley MHR

. 1997 Scaling behavior in economics. I. Empirical results for company growth . J. Phys. France 7, 621–633. (doi:10.1051/jp1:1997180). Crossref, Google Scholar

Buldyrev SV, Amaral LAN, Havlin S, Leschhorn H, Maass P, Salinger MA, Stanley HE& Stanley MHR

. 1997 Scaling behavior in economics. II. Modeling of company growth . J. Phys. France 7, 635–650. (doi:10.1051/jp1:1997181). Crossref, Google Scholar

. 2001 Zipf distribution of U.S. firm sizes . Science 293, 1818–1820. (doi:10.1126/science.1062081). Crossref, PubMed, Google Scholar

. 2010 Investigating the exponential age distribution of firms . Economics Discussion Papers No. 2010–12, Kiel Institute for the World Economy. Google Scholar

. 1965 Social structure and organizations . Handbook of organizations (ed.

), pp. 153–193. Chicago, IL : Rand McNally . Google Scholar

Dunne T, Roberts MJ& Samuelson L

. 1989 The growth and failure of U.S. manufacturing plants . Q. J. Econ. 104, 671–698. (doi:10.2307/2937862). Crossref, Google Scholar

. 1982 Organizational mortality in the newspaper industries of Argentina and Ireland: an ecological approach . Adm. Sci. Q. 27, 169–198. (doi:10.2307/2392299). Crossref, Google Scholar

. 1990 Organizational mortality: the liabilities of newness and adolescence . Adm. Sci. Q. 35, 530–547. (doi:10.2307/2393316). Crossref, Google Scholar

. 1937 The nature of the firm . Economica 4, 386–405. (doi:10.1111/j.1468-0335.1937.tb00002.x). Crossref, Google Scholar

. 1960 The problem of social cost . J. Law Econ. 3, 1–44. (doi:10.1086/466560). Crossref, Google Scholar

(eds) 1991 The nature of the firm, origins, evolution, and development . Oxford, UK : Oxford University Press . Google Scholar

. 1985 Asset specificity and economic organization . Int. J. Ind. Organ. 3, 365–378. (doi:10.1016/0167-7187(85)90030-X). Crossref, Google Scholar

. 1969 The organization of economic activity: issues pertinent to the choice of market versus nonmarket allocation . The analysis and evaluation of public expenditure: the PPB system , vol. 1, US Joint Economic Committee, 91st Congress, 1st Session, 5973. Washington, DC : US Government Printing Office . Google Scholar

. 1986 The costs and benefits of ownership: a theory of vertical and lateral integration . J. Political Econ. 94, 691–719. (doi:10.1086/261404). Crossref, Google Scholar

. 1990 Property rights and the nature of the firm . J. Political Econ. 98, 1119–1158. (doi:10.1086/261729). Crossref, Google Scholar

. 2002 Complexity, flexibility, and the make-or-buy decision . Am. Econ. Rev. 92, 433–437. (doi:10.1257/000282802320191750). Crossref, Google Scholar

. 1947 Production, information costs, and economic organization . New York, NY : Macmillan . Google Scholar

. 1959 The theory of the growth of the firm . New York, NY : Oxford University Press . Google Scholar

. 1981 A dynamic model of voluntary affiliation . Soc. Forces 59, 705–728. (doi:10.1093/sf/59.3.705). Crossref, Google Scholar

. 1983 An ecology of affiliation . Am. Sociol. Rev. 48, 519–532. (doi:10.2307/2117719). Crossref, Google Scholar

. 1998 Rethinking age dependence in organizational mortality . Am. J. Sociol. 104, 126–164. (doi:10.1086/210004). Crossref, Google Scholar

Freeman J, Carroll GR& Hannan MT

. 1983 The liability of newness: age dependence in organizational death rates . Am. Sociol. Rev. 48, 692–710. (doi:10.2307/2094928). Crossref, Google Scholar

. 2005 Is failure good? Strateg. Manag. J. 26, 617–641. (doi:10.1002/smj.470). Crossref, Google Scholar

. 1990 Generational innovation: the reconfiguration of existing systems and the failure of established firms . Adm. Sci. Q. 35, 9–30. (doi:10.2307/2393549). Crossref, Google Scholar

. 2000 Aging, obsolescence, and organizational innovation . Adm. Sci. Q. 45, 81–112. (doi:10.2307/2666980). Crossref, Google Scholar

Navaretti GB, Castellani D& Pieri F

. 2014 Age and firm growth: evidence from three European countries . Small Business Econ. 43, 823–837. (doi:10.1007/s11187-014-9564-6). Crossref, Google Scholar

. 2010 The exponential age distribution and the Pareto firm size distribution . J. Ind. Competition Trade 10, 389–395. (doi:10.1007/s10842-010-0071-4). Crossref, Google Scholar

. 2003 Statistical methods for survival data analysis , p. 534. New York, NY : Wiley Interscience . Google Scholar

. 1958 Nonparametric estimation from incomplete observations . J. Am. Stat. Assoc. 53, 457–481. (doi:10.1080/01621459.1958.10501452). Crossref, ISI, Google Scholar

. 1972 Theory and applications of hazard plotting for censored failure data . Technometrics 14, 945–966. (doi:10.1080/00401706.1972.10488991). Crossref, Google Scholar

. 1978 Nonparametric inference for a family of counting processes . Ann. Stat. 6, 701–726. (doi:10.1214/aos/1176344247). Crossref, Google Scholar

. 2008 The births and deaths of business establishments in the United States . Mon. Labor Rev. , December 2008, pp. 3–18. See http://www.bls.gov/opub/mlr/2008/12/art1full.pdf. Google Scholar

. 1979 Some additional evidence on survival biases . J. Finance 34, 197–206. (doi:10.1111/j.1540-6261.1979.tb02080.x). Crossref, Google Scholar

## Mathematical Formulation

We present a mathematical formulation of the three-period life cycle model.

Set
P = set of periods =

Parameters
(w_p) = wage income in period (p), (forall p in P)
(r) = interest rate
(eta) = discount factor

Decision Variables
(c_p) = consumption in period (p), (forall p in P)

Objective Function
Let (u()) be the utility function and let (u(c_p)) be the utility value associated with consuming (c_p). Utility in future periods is discounted by a factor of (eta). Then, the objective function is to maximize the total discounted utility:

maximize (u(c_1) + eta u(c_2) + eta^ <2>u(c_3))

Constraints
The main constraint in the life cycle model is the lifetime budget constraint, which asserts that, over the life cycle, the present value of consumption equals the present value of wage income. From above, (r) is the interest rate therefore, (R = 1 + r) is the gross interest rate. If I invest one dollar in this period, then I receive (R) dollars in the next period. The expression for the present value of the consumption stream over the life cycle is

Similarly, the expression for the present value of the wage income stream over the life cycle is
[w_1 + frac + frac>.]

The lifetime budget constraint states that the present value of the two streams must be equal:
[c_1 + frac + frac> = w_1 + frac + frac>.]

To avoid numerical difficulties, we add constraints requiring the consumption variables to take a non-negative value:
(c_1 geq 0.0001, c_2 geq 0.0001, c_3 geq 0.0001)

Providing an initial starting point is helpful for some solvers as an example, one possible starting point for this example is
(c_1 approx 1, c_2 approx 1, c_3 approx 1)

To solve the three-period life cycle consumption problem, we need to specify a utility function and the values of the parameters. The solution specifies the amount that Joey should consume in each period to maximize his utility. Note that, in the next case study, the Life Cycle Consumption Problem, we generalize the model from three periods to (n) periods.

To solve your own three-period life cycle problems, check out the Three-Period Life Cycle Problem demo.

## Collaborative Research Projects

Biostatistics Center researchers advance the center’s mission of collaborative research by developing and implementing innovative practical methods for the design, execution, data monitoring, analyses and reporting of clinical studies. Current Biostatistics Center research includes the design and analyses of studies that focus on patient-focused outcome measures that integrate efficacy and safety, personalized treatment, cost-effectiveness analyses, response-adaptive randomization, and pragmatic evaluation of diagnostic technologies.

## Practical data analysis using real biological examples

Now available with Macmillan’s new online learning platform Achieve, Analysis of Biological Data provides a practical foundation of statistics for biology students. Every chapter has several biological or medical examples related to key statistics concepts, and each example is prefaced by a substantial description of the biological setting. The emphasis on real and interesting examples carries into the problem sets where students have a wealth of practice problems based on real data.

The third edition features over 200 new examples and problems. These include new calculation practice problems, which guide the student step by step through the methods, and a greater number of examples and topics that come from medical and human health research. Every chapter has been carefully edited for even greater clarity and ease of use, and is easier than ever to access through Achieve.

Achieve for Analysis of Biological Data connects the problem-solving approach and real world examples in the book to rich digital resources that foster further understanding and application of statistics. Assets in Achieve support learning before, during, and after class for students, while providing instructors with class performance analytics in an easy-to-use interface.

• Over 3,000 homework questions of varying difficulty, Bloom’s level, and question type. Every homework question includes a hint, answer-specific feedback, and a fully worked solution. Question types in Achieve include
• Multiple choice
• Ranking
• Sorting
• Numeric entry
• Multi-part questions
• Questions with algorithmically regenerating values
• Whiteboard-style problem-solving videos
• StatTutor video lessons
• Animated lectures and documentary-style videos that illustrate real world scenarios involving statistics

Statistical software options

• CrunchIt!, Macmillan’s proprietary online statistical software powered by R, handles every computation and graphing function an introductory statistics student needs. CrunchIt! is preloaded with data sets, and it allows editing and importing additional data.
• Students also receive access to JMP Student Edition (developed by SAS). With the student edition of JMP, students handle large data, visualizations, and analysis for which the professional version is renowned. Additionally, text-specific data sets are included for download.
• For other statistical software, Achieve includes data sets, including those for
• Excel
• Minitab
• R & RCmdr
• SPSS
• TI Calculators
• Mac-text & PC-text
• CSV file export

Achieve online homework. Based on research from Macmillan’s Learning Science team, Achieve marries the powerful, tutorial-style assessment of Sapling Learning with rich book-specific resources in one easy-to-use, accessible platform.

New practice and assignment problems to every chapter covering all major concepts and skills.

Integrated online activities with the text for learning the R statistical software environment.

New chapter added on survival analysis , a vital topic in biostatistics.

New instructor resources , including answers to assignment problems and R Code labs, are available at whitlockschluter3e.zoology.ubc.ca.

Look Inside The Analysis of Biological Data

Michael C. Whitlock Dolph Schluter

## 3.1 MLE Regression

Okay, so we have a method for calculating ML estimates in R, so now we want to apply that to some regression models. Let’s return to our formula for linear model. Let’s use the same data fom our least squares example to see if we can get similar results. Let’s return to our linear model. We will start by looking at only a single predictor, mpg so our system of linear equations looks like this:

Recall from the video, we are assuming each datum is drawn from some distribution with a mean equal to (eta_0 + eta_1x_1) .

So, this distribution, then becomes our distribution we are maximizing, where

[ y_i sim mathcal(eta_0 + eta_1x_i, sigma) ]

Let’s try plugging this distribution into our loglik function to calculate the likelihood of this distribution.

Fantastic! We have found similar parameter estimates to our least squares method! Lets see how this compares to what glm gives us.

So what about applying this to multiple parameters? Well, let’s see. We will use the same model formula as our multiple parameter OLS:

Which can be represented by the system of linear equations:

This means each y value is drawn from some hypothetical distribution where:

[ y_i sim mathcal(eta extbf, sigma) ]

And we can compare these results to glm .

### 3.1.1 Another tangent on optimizers

So one thing you may notice is that I specified the values where the algorithm should start looking for our parameter estimates

This is a result of us having at least a decent idea of where the idea minimum should be. What happens when we specify really bad starting values for the optimization algorithm?

You see we find new optima. NOw what if we use a different optimization function

This optimizer does actually find the global maxima despite the bad starting values.

I firmly believe understanding optimizers will make anyone better at diagnosing the functionality of their linear models.

## 5. Discussion

### 5.1. Tensor Estimation

Simulations demonstrate that tensor estimates may be considerably improved by exploiting the Rician noise distributions of MR data ( Fig. 1 ). However, using this information requires estimating two additional parameters (the intensity of the reference signal and noise level) which are not needed for LLMMSE. Thus, a minimum of seven diffusion weighted images and a reference image are required. At moderate and high SNR, DTEMRL improves the reliability of tensor estimation, and the magnitude of improvements is greater for tensors of high anisotropy. For very low SNR, simulations indicate that the DTEMRL method may reduce estimation performance, likely due to variability introduced with the additional parameters. These results suggest the need for regularization of DTEMRL when the SNR is very low or, similarly, when very few DW images are acquired. Experiments with clinical data demonstrate consistent improvements with DTEMRL. DTEMRL operates in the stable, “high SNR” regime for DTI studies using only one acquisition at 1.5T. Typically, 3 to 5 averages are acquired at 1.5T to improve SNR by 70 to 120 percent. Simulations indicate that improvement in SNR would reduce the likelihood that DTEMRL would continue to out perform LLMMSE without decreasing proportional improvement of DTEMRL over LLMMSE.

Experiments with clinical data demonstrate that DTEMRL remains robust in spite of the approximate nature of the tensor model, presence of artifact, and spatially heterogeneous tissue. Reproducibilities of FA ( Fig. 3 E, F ), tensor coefficients ( Fig. 5 A ), and MD ( Fig. 5 B ) are greater (lower standard deviation) for DTEMRL than LLMMSE. The percent improvements are greatest for tensor coefficients in white matter. Negative impacts of reduced SNR are mitigated by high FA. Once data is of sufficient SNR for DTEMRL to offer improved reliability, the proportional benefits are essentially constant across SNR while the magnitude of the improvement increases with FA ( Fig. 1 B, E ). Although numeric optimization depends upon the initialization accuracy, DTEMRL tensor estimates remain stable in spite of a 20 percent mis-specification of initial noise level ( Fig. 1 C, F ).

Without a valid ground truth, the full reliability cannot be assessed with in vivo data. The “high SNR” estimates are not a suitable proxy because the estimates with LLMMSE and DTEMRL are different. In LLMMSE, including additional observations reduces the variability in the DW image intensity, but also reinforces bias on each DW image. With DTEMRL, additional observations enable refinement of the noise estimate, and reduce both variability and bias in the estimated DW image intensities during tensor estimation. Low SNR tends to positively bias FA in regions of low anisotropy and negatively bias FA in regions of high anisotropy with the LLMMSE method . The systematic bias between FA estimated with LLMMSE and DTEMRL ( Fig. 4 ) is in the opposite direction, which suggests that DTEMRL would reduce bias in the estimated tensors. However, additional acquisitions using k-space averaging and/or complex-valued imaging data are required to generate unbiased, high SNR clinical data and verify potential DTEMRL bias correction properties.

### 5.2. Noise Level Estimation

The underlying noise estimation procedure ( Fig. 2 ) is stable, accurate, and does not depend on spatial correlations or the existence of a background region. It also avoids dealing with spatially correlated noise, which is common in DTI due to up-sampling and/or interpolation. With the widespread use of parallel imaging methods, this noise level estimator – while specifically developed for use in our improved tensor estimation procedure – could also have far wider utility beyond diffusion tensor imaging.

### 5.3. Conclusion

The bimodal performance of DTEMRL suggests an opportunity for a hybrid approach to tensor estimation even when SNR is unknown. Simulations indicate that DTEMRL either substantially improves tensor estimation or results in degraded reliability ( Fig. 1 ) which is influenced by initialization. The newly presented noise level estimation method provides a robust SNR estimate that does not depend on tensor estimation, while the LLMMSE method estimates FA. Together, these estimates may enable a decision framework to transition between DTEMRL and LLMSE based on expected performance.

DTEMRL provides a platform on which to develop ML approaches for robust DW image analysis, regularization, and spatial filtering. MR images are often corrupted by artifacts which are not well modeled by additive or Rician noise. Detection and/or removal of these artifacts could be accomplished directly with likelihood measures. Alternatively, DTEMRL could be desensitized to outliers through use of a robust likelihood function. Furthermore, prior probabilities could be associated with spatial distribution for tensor field regularization or with the tensors themselves to transform this maximum likelihood approach into a Bayesian maximum a posteriori approach. To facilitate clinical applications and further research, the DTEMRL research software may be optimized, as the current Matlab implementation requires 200 ms per voxel on a PC.