» How to calculate confidence interval. Confidence interval. ABC of medical statistics. Chapter III How to Find a Confidence Interval in Statistics

How to calculate confidence interval. Confidence interval. ABC of medical statistics. Chapter III How to Find a Confidence Interval in Statistics

Often the appraiser has to analyze the real estate market of the segment in which the appraisal object is located. If the market is developed, it can be difficult to analyze the entire set of presented objects, therefore, a sample of objects is used for analysis. This sample is not always homogeneous, sometimes it is required to clear it of extremes - too high or too low market offers. For this purpose, it is applied confidence interval. Target this study- conduct a comparative analysis of two methods for calculating the confidence interval and choose the best calculation option when working with different samples in the estimatica.pro system.

Confidence interval - calculated on the basis of the sample, the interval of values ​​of the characteristic, which with a known probability contains the estimated parameter of the general population.

The meaning of calculating the confidence interval is to build such an interval based on the sample data so that it can be asserted with a given probability that the value of the estimated parameter is in this interval. In other words, the confidence interval with a certain probability contains unknown value estimated value. The wider the interval, the higher the inaccuracy.

There are different methods for determining the confidence interval. In this article, we will consider 2 ways:

  • through the median and standard deviation;
  • through the critical value of the t-statistic (Student's coefficient).

Stages comparative analysis different ways CI calculation:

1. form a data sample;

2. process it statistical methods: calculate the mean, median, variance, etc.;

3. we calculate the confidence interval in two ways;

4. Analyze the cleaned samples and the obtained confidence intervals.

Stage 1. Data sampling

The sample was formed using the estimatica.pro system. The sample included 91 offers for the sale of 1-room apartments in the 3rd price zone with the type of planning "Khrushchev".

Table 1. Initial sample

The price of 1 sq.m., c.u.

Fig.1. Initial sample



Stage 2. Processing of the initial sample

Sample processing by statistical methods requires the calculation of the following values:

1. Arithmetic mean

2. Median - a number characterizing the sample: exactly half of the sample elements are greater than the median, the other half is less than the median

(for a sample with an odd number of values)

3. Range - the difference between the maximum and minimum values ​​in the sample

4. Variance - used to more accurately estimate the variation in data

5. The standard deviation for the sample (hereinafter referred to as RMS) is the most common indicator of the dispersion of adjustment values ​​around the arithmetic mean.

6. Coefficient of variation - reflects the degree of dispersion of adjustment values

7. oscillation coefficient - reflects the relative fluctuation of the extreme values ​​of prices in the sample around the average

Table 2. Statistical indicators of the original sample

The coefficient of variation, which characterizes the homogeneity of the data, is 12.29%, but the coefficient of oscillation is too large. Thus, we can state that the original sample is not homogeneous, so let's move on to calculating the confidence interval.

Stage 3. Calculation of the confidence interval

Method 1. Calculation through the median and standard deviation.

The confidence interval is determined as follows: the minimum value - the standard deviation is subtracted from the median; the maximum value - the standard deviation is added to the median.

Thus, the confidence interval (47179 CU; 60689 CU)

Rice. 2. Values ​​within confidence interval 1.



Method 2. Building a confidence interval through the critical value of t-statistics (Student's coefficient)

S.V. Gribovsky in the book "Mathematical methods for assessing the value of property" describes a method for calculating the confidence interval through the Student's coefficient. When calculating by this method, the estimator himself must set the significance level ∝, which determines the probability with which the confidence interval will be built. Significance levels of 0.1 are commonly used; 0.05 and 0.01. They correspond to confidence probabilities of 0.9; 0.95 and 0.99. With this method, the true values ​​of the mathematical expectation and variance are considered to be practically unknown (which is almost always true when solving practical evaluation problems).

Confidence interval formula:

n - sample size;

The critical value of t-statistics (Student's distributions) with a significance level ∝, the number of degrees of freedom n-1, which is determined by special statistical tables or using MS Excel (→"Statistical"→ STUDRASPOBR);

∝ - significance level, we take ∝=0.01.

Rice. 2. Values ​​within the confidence interval 2.

Step 4. Analysis of different ways to calculate the confidence interval

Two methods of calculating the confidence interval - through the median and Student's coefficient - led to different values ​​of the intervals. Accordingly, two different purified samples were obtained.

Table 3. Statistical indicators for three samples.

Index

Initial sample

1 option

Option 2

Mean

Dispersion

Coef. variations

Coef. oscillations

Number of retired objects, pcs.

Based on the calculations performed, we can say that the values ​​of the confidence intervals obtained by different methods intersect, so you can use any of the calculation methods at the discretion of the appraiser.

However, we believe that when working in the estimatica.pro system, it is advisable to choose a method for calculating the confidence interval, depending on the degree of market development:

  • if the market is not developed, apply the method of calculation through the median and standard deviation, since the number of retired objects in this case is small;
  • if the market is developed, apply the calculation through the critical value of t-statistics (Student's coefficient), since it is possible to form a large initial sample.

In preparing the article were used:

1. Gribovsky S.V., Sivets S.A., Levykina I.A. Mathematical methods for assessing the value of property. Moscow, 2014

2. Data from the estimatica.pro system

The method for estimating a random error is based on the provisions of the theory of probability and mathematical statistics. It is possible to estimate a random error only in the case when repeated measurements of the same quantity have been carried out.

Let, as a result of the performed measurements, P quantity values X: X 1 , X 2 , …, x n. Denote by the arithmetic mean

In probability theory, it is proved that with an increase in the number of measurements P the arithmetic mean value of the measured value approaches the true:

With a small number of measurements ( P£ 10) the average value may differ significantly from the true one. In order to know how accurately the value characterizes the measured value, it is necessary to determine the so-called confidence interval of the result obtained.

Since an absolutely accurate measurement is impossible, the probability of the correctness of the statement " x has a value exactly equal to» is equal to zero. The probability of the statement x has a value» is equal to one (100%). Thus, the probability of the correctness of any intermediate statement lies in the range from 0 to 1. The purpose of the measurement is to find such an interval in which, with a predetermined probability a(0 < a < 1) находится истинное значение измеряемой величины. Этот интервал называется confidence interval , and the value inextricably linked with it aconfidence level (or reliability factor). The average value calculated by formula (3) is taken as the middle of the interval. Half the width of the confidence interval is the random error D s x(Fig. 1).



Obviously, the width of the confidence interval (and hence the error D s x) depends on how much the individual measurements of quantity x i from the mean value. The "scatter" of the measurement results relative to the average is characterized by root mean square error s, which is found by the formula

, (4)

The width of the desired confidence interval is directly proportional to the root mean square error:



. (5)

Proportionality factor t n, a called Student's coefficient; it depends on the number of experiments P and confidence level a.

On fig. one, a, b It is clearly shown that, other things being equal, in order to increase the probability that the true value falls into the confidence interval, it is necessary to increase the width of the latter (the probability of "covering" the value X wider interval above). Therefore, the value t n, a should be greater, the higher the confidence level a.

With an increase in the number of experiments, the average value approaches the true value; so with the same probability a the confidence interval can be taken narrower (see Fig. 1, a, c). Thus, with the growth P the sudent coefficient should decrease. Table of values ​​of the Student's coefficient depending on P and a given in the appendices to this manual.

It should be noted that the confidence level has nothing to do with the accuracy of the measurement result. Value a are set in advance, based on the requirements for their reliability. In most technical experiments and in laboratory practice, the value a is taken equal to 0.95.

Calculation of a random error in measuring a quantity X carried out in the following order:

1) the sum of the measured values ​​is calculated, and then the average value of the quantity is calculated according to the formula (3);

2) for each i th experiment, the difference between the measured and average values ​​is calculated, as well as the square of this difference (deviation) (D x i) 2 ;

3) the sum of squared deviations is found, and then the root mean square error s according to formula (4);

4) according to a given confidence level a and the number of experiments P from the table on p. 149 applications select the appropriate value of the Student's coefficient t n, a and the random error D s x according to formula (5).

For the convenience of calculations and verification of intermediate results, the data are entered in a table, the last three columns of which are filled in according to the model of Table 1.

Table 1

Experience number X D X (D X) 2
P
S= S=

In each particular case, the value X has a certain physical meaning and corresponding units of measure. This can be, for example, the acceleration of free fall g (m/s 2), fluid viscosity h (Pa×s) etc. Missing columns of the table. 1 may contain intermediate measured values ​​necessary to calculate the corresponding values X.

Example 1 To determine acceleration a body movements measured time t passing their way S no initial speed. Using the known relation , we obtain the calculation formula

Path Measurement Results S and time t are given in the second and third columns of Table. 2. After performing calculations using formula (6), we fill in

fourth column with acceleration values a i and find their sum, which we write under this column in the cell "S =". Then we calculate the average value according to the formula (3)

.

table 2

Experience number S, m t, c a, m/s 2 D a, m/s 2 (D a) 2 , (m/s 2) 2
2,20 2,07 0,04 0,0016
2,68 1,95 -0,08 0,0064
2,91 2,13 0,10 0,0100
3,35 1,96 -0,07 0,0049
S= 8,11 S= 0,0229

Subtracting from each value a i average, find the differences D a i and put them in the fifth column of the table. Squaring these differences, we fill in the last column. Then we calculate the sum of squared deviations and write it down in the second cell "S =". According to formula (4), we determine the root-mean-square error:

.

Given the value of the confidence probability a= 0.95, for the number of experiments P= 4 from the table in the appendices (p. 149) select the value of the Student's coefficient t n, a= 3.18; using formula (5), we estimate the random error in measuring the acceleration

D s a= 3.18×0.0437 » 0.139 ( m/s 2) .

Let's build a confidence interval in MS EXCEL for estimating the mean value of the distribution in the case known value dispersion.

Of course the choice level of trust completely depends on the task at hand. Thus, the degree of confidence of the air passenger in the reliability of the aircraft, of course, should be higher than the degree of confidence of the buyer in the reliability of the light bulb.

Task Formulation

Let's assume that from population having taken sample size n. It is assumed that standard deviation this distribution is known. Necessary on the basis of this samples evaluate the unknown distribution mean(μ, ) and construct the corresponding bilateral confidence interval.

Point Estimation

As is known from statistics(let's call it X cf) is unbiased estimate of the mean this population and has the distribution N(μ;σ 2 /n).

Note: What if you need to build confidence interval in the case of distribution, which is not normal? In this case, comes to the rescue, which says that with a sufficiently large size samples n from distribution non- normal, sampling distribution of statistics Х av will be approximately correspond normal distribution with parameters N(μ;σ 2 /n).

So, point estimate middle distribution values we have is sample mean, i.e. X cf. Now let's get busy confidence interval.

Building a confidence interval

Usually, knowing the distribution and its parameters, we can calculate the probability that a random variable will take a value from the interval we specified. Now let's do the opposite: find the interval in which the random variable falls with a given probability. For example, from properties normal distribution it is known that with a probability of 95%, a random variable distributed over normal law, will fall into the interval approximately +/- 2 from mean value(see article about). This interval will serve as our prototype for confidence interval.

Now let's see if we know the distribution , to calculate this interval? To answer the question, we must specify the form of distribution and its parameters.

We know the form of distribution is normal distribution(remember that we are talking about sampling distribution statistics X cf).

The parameter μ is unknown to us (it just needs to be estimated using confidence interval), but we have its estimate X cf, calculated based on sample, which can be used.

The second parameter is sample mean standard deviation will be known, it is equal to σ/√n.

Because we do not know μ, then we will build the interval +/- 2 standard deviations not from mean value, but from its known estimate X cf. Those. when calculating confidence interval we will NOT assume that X cf will fall within the interval +/- 2 standard deviations from μ with a probability of 95%, and we will assume that the interval is +/- 2 standard deviations from X cf with a probability of 95% will cover μ - the average of the general population, from which sample. These two statements are equivalent, but the second statement allows us to construct confidence interval.

In addition, we refine the interval: a random variable distributed over normal law, with a 95% probability falls within the interval +/- 1.960 standard deviations, not +/- 2 standard deviations. This can be calculated using the formula \u003d NORM.ST.OBR ((1 + 0.95) / 2), cm. sample file Sheet Spacing.

Now we can formulate a probabilistic statement that will serve us to form confidence interval:
"The probability that population mean located from sample average within 1.960" standard deviations of the sample mean", is equal to 95%.

The probability value mentioned in the statement has a special name , which is associated with significance level α (alpha) by a simple expression trust level =1 . In our case significance level α =1-0,95=0,05 .

Now, based on this probabilistic statement, we write an expression for calculating confidence interval:

where Zα/2 standard normal distribution(such a value of a random variable z, what P(z>=Zα/2 )=α/2).

Note: Upper α/2-quantile defines the width confidence interval in standard deviations sample mean. Upper α/2-quantile standard normal distribution is always greater than 0, which is very convenient.

In our case, at α=0.05, upper α/2-quantile equals 1.960. For other significance levels α (10%; 1%) upper α/2-quantile Zα/2 can be calculated using the formula \u003d NORM.ST.OBR (1-α / 2) or, if known trust level, =NORM.ST.OBR((1+confidence level)/2).

Usually when building confidence intervals for estimating the mean use only upper α/2-quantile and do not use lower α/2-quantile. This is possible because standard normal distribution symmetrical about the x-axis ( density of its distribution symmetrical about average, i.e. 0). Therefore, there is no need to calculate lower α/2-quantile(it is simply called α /2-quantile), because it is equal upper α/2-quantile with a minus sign.

Recall that, regardless of the shape of the distribution of x, the corresponding random variable X cf distributed approximately fine N(μ;σ 2 /n) (see article about). Therefore, in general, the above expression for confidence interval is only approximate. If x is distributed over normal law N(μ;σ 2 /n), then the expression for confidence interval is accurate.

Calculation of confidence interval in MS EXCEL

Let's solve the problem.
The response time of an electronic component to an input signal is an important characteristic of a device. An engineer wants to plot a confidence interval for the average response time at a confidence level of 95%. From previous experience, the engineer knows that the standard deviation of the response time is 8 ms. It is known that the engineer made 25 measurements to estimate the response time, the average value was 78 ms.

Solution: An engineer wants to know the response time of an electronic device, but he understands that the response time is not fixed, but random variable, which has its own distribution. So the best he can hope for is to determine the parameters and shape of this distribution.

Unfortunately, from the condition of the problem, we do not know the form of the distribution of the response time (it does not have to be normal). , this distribution is also unknown. Only he is known standard deviationσ=8. Therefore, while we cannot calculate the probabilities and construct confidence interval.

However, although we do not know the distribution time separate response, we know that according to CPT, sampling distribution average response time is approximately normal(we will assume that the conditions CPT are performed, because the size samples large enough (n=25)) .

Furthermore, average this distribution is equal to mean value unit response distributions, i.e. μ. BUT standard deviation of this distribution (σ/√n) can be calculated using the formula =8/ROOT(25) .

It is also known that the engineer received point estimate parameter μ equal to 78 ms (X cf). Therefore, now we can calculate the probabilities, because we know the distribution form ( normal) and its parameters (Х ср and σ/√n).

Engineer wants to know expected valueμ of the response time distribution. As stated above, this μ is equal to mathematical expectation sampling distribution of average response time. If we use normal distribution N(X cf; σ/√n), then the desired μ will be in the range +/-2*σ/√n with a probability of approximately 95%.

Significance level equals 1-0.95=0.05.

Finally, find the left and right border confidence interval.
Left border: \u003d 78-NORM.ST.INR (1-0.05 / 2) * 8 / ROOT (25) = 74,864
Right border: \u003d 78 + NORM. ST. OBR (1-0.05 / 2) * 8 / ROOT (25) \u003d 81.136

Left border: =NORM.INV(0.05/2, 78, 8/SQRT(25))
Right border: =NORM.INV(1-0.05/2, 78, 8/SQRT(25))

Answer: confidence interval at 95% confidence level and σ=8msec equals 78+/-3.136ms

AT example file on sheet Sigma known created a form for calculation and construction bilateral confidence interval for arbitrary samples with a given σ and significance level.

CONFIDENCE.NORM() function

If the values samples are in the range B20:B79 , a significance level equal to 0.05; then MS EXCEL formula:
=AVERAGE(B20:B79)-CONFIDENCE(0.05,σ, COUNT(B20:B79))
will return the left border confidence interval.

The same boundary can be calculated using the formula:
=AVERAGE(B20:B79)-NORM.ST.INV(1-0.05/2)*σ/SQRT(COUNT(B20:B79))

Note: The TRUST.NORM() function appeared in MS EXCEL 2010. Earlier versions of MS EXCEL used the TRUST() function.

Any sample gives only an approximate idea of ​​the general population, and all sample statistical characteristics (mean, mode, variance ...) are some approximation or say an estimate of the general parameters, which in most cases cannot be calculated due to the inaccessibility of the general population (Figure 20) .

Figure 20. Sampling error

But you can specify the interval in which, with a certain degree of probability, lies the true (general) value of the statistical characteristic. This interval is called d confidence interval (CI).

So the general average with a probability of 95% lies within

from to, (20)

where t - tabular value of Student's criterion for α =0.05 and f= n-1

Can be found and 99% CI, in this case t chosen for α =0,01.

What is the practical significance of a confidence interval?

    A wide confidence interval indicates that the sample mean does not accurately reflect the population mean. This is usually due to an insufficient sample size, or to its heterogeneity, i.e. large dispersion. Both give a large error in the mean and, accordingly, a wider CI. And this is the reason to return to the research planning stage.

    Upper and lower CI limits assess whether the results will be clinically significant

Let us dwell in more detail on the question of the statistical and clinical significance of the results of the study of group properties. Recall that the task of statistics is to detect at least some differences in general populations, based on sample data. It is the clinician's task to find such (not any) differences that will help diagnosis or treatment. And not always statistical conclusions are the basis for clinical conclusions. Thus, a statistically significant decrease in hemoglobin by 3 g/l is not a cause for concern. And, conversely, if some problem in the human body does not have a mass character at the level of the entire population, this is not a reason not to deal with this problem.

We will consider this position in example.

The researchers wondered if boys who had some kind of infectious disease were lagging behind their peers in growth. For this purpose, a selective study was conducted, in which 10 boys who had this disease took part. The results are presented in table 23.

Table 23. Statistical results

lower limit

upper limit

Specifications (cm)

middle

From these calculations it follows that the selective average height of 10-year-old boys who have had some kind of infectious disease is close to normal (132.5 cm). However, the lower limit of the confidence interval (126.6 cm) indicates that there is a 95% probability that the true average height of these children corresponds to the concept of "short stature", i.e. these children are stunted.

In this example, the results of the confidence interval calculations are clinically significant.

A person can recognize his abilities only by trying to apply them. (Seneca)

Confidence intervals

general review

Taking a sample from the population, we will obtain a point estimate of the parameter of interest to us and calculate the standard error in order to indicate the accuracy of the estimate.

However, for most cases, the standard error as such is not acceptable. It is much more useful to combine this measure of precision with an interval estimate for the population parameter.

This can be done by using knowledge of the theoretical probability distribution of the sample statistic (parameter) in order to calculate a confidence interval (CI - Confidence Interval, CI - Confidence Interval) for the parameter.

In general, the confidence interval extends the estimates in both directions by some multiple of the standard error (of a given parameter); the two values ​​(confidence limits) that define the interval are usually separated by a comma and enclosed in parentheses.

Confidence interval for mean

Using the normal distribution

The sample mean has a normal distribution if the sample size is large, so knowledge of the normal distribution can be applied when considering the sample mean.

In particular, 95% of the distribution of the sample means is within 1.96 standard deviations (SD) of the population mean.

When we have only one sample, we call this the standard error of the mean (SEM) and calculate the 95% confidence interval for the mean as follows:

If this experiment is repeated several times, the interval will contain the true population mean 95% of the time.

This is usually a confidence interval, such as the range of values ​​within which the true population mean (general mean) lies with a 95% confidence level.

Although it is not quite strict (the population mean is a fixed value and therefore cannot have a probability referred to it) to interpret the confidence interval in this way, it is conceptually easier to understand.

Usage t- distribution

You can use the normal distribution if you know the value of the variance in the population. Also, when the sample size is small, the sample mean follows a normal distribution if the data underlying the population are normally distributed.

If the data underlying the population are not normally distributed and/or the general variance (population variance) is unknown, the sample mean obeys Student's t-distribution.

Calculate the 95% confidence interval for the population mean as follows:

Where - percentage point (percentile) t- Student distribution with (n-1) degrees of freedom, which gives a two-tailed probability of 0.05.

In general, it provides a wider interval than when using a normal distribution, because it takes into account the additional uncertainty that is introduced by estimating the population standard deviation and/or due to the small sample size.

When the sample size is large (of the order of 100 or more), the difference between the two distributions ( t-student and normal) is negligible. However, always use t- distribution when calculating confidence intervals, even if the sample size is large.

Usually 95% CI is given. Other confidence intervals can be calculated, such as 99% CI for the mean.

Instead of product of standard error and table value t- distribution that corresponds to a two-tailed probability of 0.05 multiply it (standard error) by a value that corresponds to a two-tailed probability of 0.01. This is a wider confidence interval than in the 95% case because it reflects increased confidence that the interval does indeed include the population mean.

Confidence interval for proportion

The sampling distribution of proportions has a binomial distribution. However, if the sample size n reasonably large, then the proportion sample distribution is approximately normal with mean .

Estimate by sampling ratio p=r/n(where r- the number of individuals in the sample with the characteristics of interest to us), and the standard error is estimated:

The 95% confidence interval for the proportion is estimated:

If the sample size is small (usually when np or n(1-p) less 5 ), then the binomial distribution must be used in order to calculate the exact confidence intervals.

Note that if p expressed as a percentage, then (1-p) replaced by (100p).

Interpretation of confidence intervals

When interpreting the confidence interval, we are interested in the following questions:

How wide is the confidence interval?

A wide confidence interval indicates that the estimate is imprecise; narrow indicates a fine estimate.

The width of the confidence interval depends on the size of the standard error, which in turn depends on the sample size, and when considering a numeric variable from the variability of the data, give wider confidence intervals than studies of a large data set of few variables.

Does the CI include any values ​​of particular interest?

You can check whether the likely value for a population parameter falls within a confidence interval. If yes, then the results are consistent with this likely value. If not, then it is unlikely (for a 95% confidence interval, the chance is almost 5%) that the parameter has this value.