CS460/660- Machine Learning

Confidence Interval | October 2020

Documented by: Akshay Priyadarshi, 5th Yr. Int. MSc., School of Physical Sciences, NISER | akshay.priyadarshi@niser.ac.in
Instructor: Dr. Subhankar Mishra, Asst. Prof., School of Computer Sciences (CSS), NISER

Confidence Interval

Jump to:
    Example Problem
    A quick recap
    Methodology
    Computational Analysis
    Further Comments

Example Problem

Problem: We want to predict the outcome of an election between two candidates (0 and 1) in a region (R). The region has a population of 1 lac. Ideally, prediction would be made after asking all the people. But, naturally, due to limited resources and time, we cannot ask everyone.

Suggested solution: We ask a random selection of few people (region S) about their opinion and try to predict the general outcome.
Let's say, we inquired 1000 people randomly selected from the total population, and 540 out of them selected 'Option 1'. The probability of Option 1, p' would be 0.54 which is 540/1000.

The Snag: This probability (p') might be different than the probability (p) of Option 1 in the total population (region R). The error in prediction would be = p-p'.

Using confidence interval: This predicted outcome p', since made on a limited population (S) would have a given percentage of accuracy. From the book "Using R for Introductory Statistics" by John Verzani, we have the information that random samples are not exactly “random”, but they obey the laws of probability. If the original population (R) holds some assumptions, then a certain precision, and some conclusions about a population parameter (like, mean) can be made.

Confidence interval is the probability that a population parameter will be in the interval estimate i.e. the confidence that we have over the interval. Interval Estimate is a range of values or, interval that may contain the desired parameter.

A quick recap

Normal Distribution, which appears to be like the bell cuve, is also known as the Gaussian distribution. It is a probability distribution which is symmetric about the mean (average), showing that data near the mean are more frequent as compared to the ones far from the mean. In this distribution, we have nearly, mean=median=mode of the sample.

Mean is the average of all the values. \(\mu=\frac{1}{N}\Sigma x_i\), where \(x_i\)'s are the elements. In our case considering of 0's and 1's, the mean can also be given by \(N p\), where N is the total population and p is the probability of occurance of 1.
Standard deviation is a measure of how spread out numbers are in the sample. By the law of probability, we have \(\sigma_{p}\)=\(\sqrt{N p(1-p)}\).
Z-score is simply stating the separation of an element from the mean in terms of standard deviation. \(Z-score=\frac{x_i-\mu}{\sigma}\).

More summarised info related to Normal Distributions can be found over here.

Methodology

The normal distribution has a probability density function as: \[P(x)=\frac{e^{-\frac{(x-\mu)^2}{(2\sigma^2)}}}{\sigma\sqrt{2\pi}}\] Upon integrating this fuction, we can get the area under the curve, and hence the percentage distribution of the sample. Eg. nearly 95% of the samples lie within 2 standard deviations of the mean. So, the area between limits of 2 standard deviations (SD) would be ~95% of the total area under the curve. So, upon a random selection from the whole population, the probability of getting an element from this interval of 2SD would be ~95%. An idea of how the percentage distribution between two margins is calculated, can be read from here.

The probability of occurance of 1 in region R is p. We have assumed p to be 0.54. So, there are 0.54xN people who will vote 1. Therefore, if we select a person randomly, the chances that they'd vote 1 would be =(number of people voting 1)/(total number)= \(\frac{0.54N}{N} =0.54\). This proabability is be equal to the probability (p) of getting 1 from R. In theory, this probability should be known. Here, since we are trying to solve a case when we don't know p (and hence the \(\sigma_{p})\), we will have to make adjustments. The adjustment is to substitute standard error \(\sigma_{p'}\) of the sample (S) for the unknown \(\sigma_{p}\) of the total population (R).

For n random samples taken from the total population, the mean of the sample is \(\mu'=\mu\), and standard deviation of the samples is \(SD(p')=\sqrt{\frac{(p(1-p))}{n}}\). If n is large, a different formula is also applicable. The standard error for p' is the standard deviation with unknown parameters (p) replaced by sample estimates (p'). Here, \(\sigma_{p'}=SE(p')=\sqrt{\frac{(p'(1-p'))}{n}}\).

Since, p' is known, \(\sigma_{p'}\) can be calculated, and according to Figure 2, we can say that the unknown p of the total population (R) will be between \(p'\pm2\sigma_{p'}\) with a probability of ~95%. Similarly, p will lie between \(p'\pm\sigma_{p'}\) with a probability of ~68%.

Computational analysis

The code:

This code window shows the python code and output for Confidence interval.

Description of the code:

We can select total population, and the approximate probability of occurance of 1. Here, the population has two options: 0 and 1.

The code starts by generating random numbers between 1 and 100 which are uniformly distributed (have equal probability). If the required probability of 1 is 0.54, then all random numbers lesser than 54 (i.e 0.54*100) are replaced by 1, and the rest are replaced by 0.

The code generates the required population and also gives the actual probability of occurance of 1 in the sample.

A given number of samples from the total population is selected randomly. The probability (p') of getting 1 in the sample, and the corresponding standard deviation (\(\sigma_{p'}\)) is calculated.

An upper limit and a lower limit is calculated using \(p'\pm(2\times\sigma_{p'})\).

The code checks if the original probability (p) lies in this range of limits.

The code repeats this for a specified number of times (here, 1000), to show the percentage of runs when the original probability lies in the range.

The graph and the output give us the summary of the runs.

The graph from the code:

A description of the Graph: In the graph generated using the python code,

the x-axis denotes the probability of '1' in the sample.

The red line corresponds to the probability in the total population.

The height of the blue bars show the frequency of the probability of '1' in the randomly selected samples.

The green lines show the lower limit, and the black lines show the upper limit (both are within 2 standard deviations from the probability).

The overlap of such lines from multiple samples result in darker lines.

Ideally we expect that on nearly 95% of the times, the red line should lie between the green and black lines.

Note:

Here, the python code is shown for sampling with replacement. This means that, after a number is randomly selected from the list, it is still available for random selection in the next selections. In the book "Using R for Introductory Statistics" by John Verzani, it was said that sampling with replacement would be truly independent. If replacement is not possible, sampling without replacement gives a close approximation.

Example: If one has to select two items from a list of [Apple, Ball, Cat] and they chose "Ball" as the first item. If we consider sampling with replacement, the second choice could be "Ball" again. If we consider sampling without replacement, they cannot have "Ball" as also the second choice.

By changing a syntax in the code we can get the results for selection without replacement. The code line
sample = random.choices(Total, k=n)
is for selection with replacement. It can be replaced with

sample = random.sample(list(Total), k=n)
for selection without replacement.

Fig.4 Graph from the python code for selection without replacement.

Computationally, the results with both the methods didn't have any major difference.

More information on the Confidence Interval, and its different types can be got from Chapter 8 of the book, "Using R for Introductory Statistics" by John Verzani.

References:

"Using R for Introductory Statistics" by John Verzani.
"Characteristics of a Normal Distribution" at SPHweb.

If you come across any error, kindly inform.