# How to calculate confidence interval

In mathematical statistics, when analyzing and systematizing various data, the method of confidence intervals is often used to draw practical conclusions. With its help, a certain sample of the mean or fraction is performed, taking into account the standard error. Due to this, the reliability of the probability increases, since the estimate expands in both directions from the value under study.

In fact, the method is based on the model of classical mathematical statistics, which implies infinitely possible samples in the general population. Let there be a master sample epsilon with a distribution function known up to a certain parameter tau (Fe (x, τ)). From this general population, a sample of size en was obtained, including the range from x1 to xn. This parameter can be considered one-dimensional and belonging to the range from τ to R. Mathematically, this position is described as τ є T c R.

If we assume that for some interval iod lying from zero to one, there are statistics S-(X|n|, J) and S+(X|n|, J), and they correspond to the inequality P{ S-(X| n|, J) < τ < S+(X|n|, J)}, then the considered interval is called the confidence interval with respect to the parameter τ. Moreover, the level of confidence or the value of the confidence interval depends on the value of the statistics S and the sample. The probability that the true parameter theta falls within the interval from S-1 to S+1 is found from the expression: P = 1 – J.

The general method for constructing confidence functions can be used when studying statistics Y (S (X|n|, t), where: t is the estimated parameter, S (X|n| is a point estimate. The following properties are known:

the distribution function Fy(x), where y takes on a random known value, is not affected by tau.

the graph of the statistic function is continuous and monotonic in tau.

After such a function is found, it is necessary to set the significance level j. As a rule, its value is taken small so that the confidence probability P is as large as possible. Then the constructed interval will necessarily include the true value of the parameter. And also when plotting, it is taken into account that the probability of hitting outside the interval on both sides is equal to each other: (inf, S-(X|n|, j)), (S+(X|n|, j + inf).

Then it is necessary to find the quantiles of the statistics y of order y (j/2) and y (1 – j/2). Based on the definition of the quantile, it can be argued that the probability that the statistic hits y in the considered interval will be determined by the difference in the distribution function of the corresponding points and equal to P = 1 – j. This can be described by the following expression: P (y (j/2) < Y (S (X|n|), t) < y (1-j/2) = F (y (1-j/2) – F ( y (j/2) = 1 – j.

Property of Statistics and Distribution

Since statistics on the player are constructed in such a way that they are monotonic and continuous in theta, it is possible to find the inverse function of y-1. For definiteness, it is assumed that the y is monotonically increasing in theta. Then the location probability will be equivalent to the inequality: y-1(j/2) < t < y-1(-j/2). From here we can get the confidence interval for theta: P (S – (X | n |, j) < t < S + (X | n |, j)) = 1 – j. Where: S -(X | n |, j) = Y-1(y (a /2)), S +(X | n |, j) = Y-1(y (1- a /2).

Thus, it is possible to determine the confidence probability of theta falling into the interval from S- to S+ from the value of the inverse function at points equal to the quantiles of the statistics y of order j/2 and 1 – j/2. Moreover, when the function under consideration monotonically decreases, the signs in the inequality are reversed.

Using the general approach of calculating confidence intervals, one can calculate the probability for a normal general population based on a number of statements. Let the sample X|n,| taken from the population E ~ N (j, ς2), i.e. having a normal distribution law with mathematical expectation j and variance sigma squared. For such a state, the following is true:

A function of the form (Xj) * √ n / ς corresponds to the standard normal distribution law. X is the expectation of the unknown, from which the true value is subtracted to obtain a value that has zero probability. After that, the value is centered by dividing by the standard deviation: ς / √ n. Since the law of the original general population is normal, the arithmetic mean of random variables will also be a normally distributed random variable.

If the statistic S2 is not biased from the dispersion point, then the function (X – a) * √n / S will obey the Student’s distribution with n – 1 degrees of freedom.

The statistic n – 1, multiplied by the unbiased center of variance and referred to the true value, follows a chi-square distribution. The numerator of the formula contains the sum of squares of normal distributions, which are reduced to normal standards.

When a biased estimate of variance is considered, the statistic nS2 / ς2 corresponds to a chi-square distribution with en degrees of freedom.

Exact interval

There are a number of rules that allow you to build exact intervals for the mathematical expectation and variance of a normally distributed random variable. There are two cases – in one case, the variance may be known, and in the other case, it may not. It should be noted that the exact confidence probability is constructed using the general scheme. Use the following rules to provide accurate predictions:

When constructing a confidence interval with an error of 1 – E, it is determined by the boundaries: P{ X – (ς / √n) * Z (1-E/2) and X + (ς / √n) * Z (1-E/2) }. Where: 1-E/2 is the quantile of the standard normal distribution, which is symmetrical.

When the variance is unknown, the unbiased estimate S2 is taken as the estimate. The distribution formula in this case will look like: P{ X – (s / √n) * t (1-E/2) and X + (s / √n) * t (1-E/2)}. The value t(1-E/2)} corresponds to the quantile of the Student’s distribution with n – 1 free degree.

If the expectation is unknown, then the following formula is valid: The confidence interval with known j can be considered as an estimate of the mathematical expectation: S2 = 1/n Σ (xi-j)2, that is, the equality of the true variance. It is with probability 1 – E is in the interval from n * S2, divided by the distribution quantile V (1-E / 2) with levels of freedom of the nth degree.

The measure of expectation equals the true variance, which ranges from n * S2 divided by chi-square with 1-E/2 degrees of freedom to n * S2 with E/2 quantiles. If the expectation is unknown, the unbiased estimate S2 is taken as the variance. This changes the number of degrees of freedom chi-square. It becomes n – 1. Everything else remains unchanged.

Asymptotic approximation

However, it is not always possible to calculate an exact confidence interval. In this case, an approximate probability is constructed – asymptotic. Let for some j Є (0,1) there exist a set of statistics S-(X|n|, j) and S-(X|n|, j) such that lim P{ S-(X|n|, j) < t < S-(X|n|, j), } = 1- j, with en tending to infinity, then the region bounded by the interval (S-(X|n|, j), S-(X |n|, j)) is an asymptotic approximation. Its construction is based on the properties of normal estimates. That is, to begin with, it is necessary to choose an estimate for the parameter that has the property of asymptotic normality.

Theta can be estimated using the formula: t = t (x|n|), with √n (tt) * (d / n → ∞) ~ N (0, ς2), and ς2 is the asymptotic scattering coefficient. If several analyzes of one parameter are done, then the one with the lowest coefficient is considered the best.

Applying the continuity theorem to statistics, we can show that a function of the form √n (t – t), referred to the statistical average deviation ς (t) by distribution, converges as n → ∞ to a random variable with a standard distribution. That is, for a sequence of random vectors, the expression is true: kn = (k (n1), …, (k (nm). And if the given function is continuous H: Rm → R, then H (k (n) * d / n → ∞, then convergence takes place: (√n (t – t) / ς (t)) / *(d / n → ∞) k ~ N (0, 1).

Hence the following relation will be valid: P{-z (1-j/2) < z (1-j/2} → 1 – j = 1 / √2p ∫ (e -y2/2) dy. Thus, the probability of hitting will be in the region P є (z (1 – j / 2), – z (1 – j / 2)) and will tend to minus yd. Here z is the quantile 1-j / 2. The accuracy of the interval estimate is characterized by the width of the confidence region The larger the sample size, the narrower the considered interval will be (smaller width) and the more accurate the interval estimate will be.

Examples of problem solving

Let there be a n number of trials, of which m are successful. The sample result is described by the function X|n| = (a1, …, an) and includes zeros and ones. The likelihood statistic has the form: L (X|n|, p) = pm * q nm, p є t = (0, 1). To estimate the likelihood, it is necessary to compose a function and find estimates of the parameter P. The statistics has the form: ln L = mLnp + (nm) ln (1-p). The maximum result of losing one is described by the expression: d ln L / dP \u003d (m / p) – (n – m) / (1 – p) \u003d (m – mp -np + mp) / p (1 – p) \u003d 0.

From here we get the estimate: p = m / n. Now we need to make sure that p maximizes the likelihood function. That is, d2LnL / dp2 = – m / p2 – (n – m) / (1 – p)2 < 0. It is easy to prove that the result of the estimate will be asymptotically normal: √n ((m/n) – p) = (m -np) / √n = (Σ (ς ip) / √n) (d / n → ∞) ~ N (0, pq). Using the statement, it is quite easy to show the following convergence: √n ((m/n) – p) / (√m/n (1 – m/n))) (d / n → ∞) ~ N (0, 1). Analyzing the results obtained, it can be argued that the formula for calculating the confidence interval will look like: ((m / n) – z (1-j / 2) √m / n (1- m / n) / √n, (m / n + – z (1-j/2) √m / n (1-m / n) / √n)).

More practical is the following problem. Suppose there is an enterprise where they decided to find out the average salary. In what units it will be measured – does not matter. For this, thirty employees were randomly selected, according to the analysis of whose income it was revealed that their monthly salary is 30 thousand, taking into account the standard deviation of five thousand. It is necessary to calculate the average salary for the middle of the month with an error of less than 0.01 percent.

First, you should briefly write down the condition. It is known that n = 30, Xs = 30000, S = 5000, and P = 0.99. To solve the problem, it is necessary to use a table corresponding to Student’s theorem. It contains reference values for t – criteria with different probabilities. According to it, for given values of n and P, the criterion is 2.756.

Substituting the initial data into the formula and performing the calculations, it can be argued that the required confidence region is limited to the interval from 27484 to 32516: 30000 – 2.756 * (5000 / √30) < Xs < 3000 + 2.756 * (5000 / √30). That is, the average salary of employees for half a month lies in the interval (13742, 16258). Problem solved.

How to find confidence interval?

In practice, quite often it is not so easy to calculate the trust region. The thing is that a high probability is often found in a large sample, so you have to perform cumbersome calculations. Considering that the confidence probability determines the accuracy of the results obtained, in other words, shows the probability with which an incorrect solution falls within the found interval, a sample percentage of 95 to 99.9% is usually used.

For high accuracy of obtaining a range, they use services Calculate confidence interval