CS460/660- Machine Learning

Confidence Interval | October 2020

Documented by: Akshay Priyadarshi, 5th Yr. Int. MSc., School of Physical Sciences, NISER | akshay.priyadarshi@niser.ac.in
Instructor: Dr. Subhankar Mishra, Asst. Prof., School of Computer Sciences (CSS), NISER


Confidence Interval

Jump to:
    Example Problem
    A quick recap
    Methodology
    Computational Analysis
    Further Comments

Example Problem

Problem: We want to predict the outcome of an election between two candidates (0 and 1) in a region (R). The region has a population of 1 lac. Ideally, prediction would be made after asking all the people. But, naturally, due to limited resources and time, we cannot ask everyone.

Example problem

Fig.1 Example Problem.

Suggested solution: We ask a random selection of few people (region S) about their opinion and try to predict the general outcome.
Let's say, we inquired 1000 people randomly selected from the total population, and 540 out of them selected 'Option 1'. The probability of Option 1, p' would be 0.54 which is 540/1000.

The Snag: This probability (p') might be different than the probability (p) of Option 1 in the total population (region R). The error in prediction would be = p-p'.

Using confidence interval: This predicted outcome p', since made on a limited population (S) would have a given percentage of accuracy. From the book "Using R for Introductory Statistics" by John Verzani, we have the information that random samples are not exactly “random”, but they obey the laws of probability. If the original population (R) holds some assumptions, then a certain precision, and some conclusions about a population parameter (like, mean) can be made.

Confidence interval is the probability that a population parameter will be in the interval estimate i.e. the confidence that we have over the interval. Interval Estimate is a range of values or, interval that may contain the desired parameter.


A quick recap

Normal Distribution, which appears to be like the bell cuve, is also known as the Gaussian distribution. It is a probability distribution which is symmetric about the mean (average), showing that data near the mean are more frequent as compared to the ones far from the mean. In this distribution, we have nearly, mean=median=mode of the sample.

More summarised info related to Normal Distributions can be found over here.
Normal Distribution

Fig.2 Normal Distribution.
(\(\sigma\) is standard deviation & \(\mu\) is mean.)
Image credits:SPHweb.


Methodology

The normal distribution has a probability density function as: \[P(x)=\frac{e^{-\frac{(x-\mu)^2}{(2\sigma^2)}}}{\sigma\sqrt{2\pi}}\] Upon integrating this fuction, we can get the area under the curve, and hence the percentage distribution of the sample. Eg. nearly 95% of the samples lie within 2 standard deviations of the mean. So, the area between limits of 2 standard deviations (SD) would be ~95% of the total area under the curve. So, upon a random selection from the whole population, the probability of getting an element from this interval of 2SD would be ~95%. An idea of how the percentage distribution between two margins is calculated, can be read from here.

The probability of occurance of 1 in region R is p. We have assumed p to be 0.54. So, there are 0.54xN people who will vote 1. Therefore, if we select a person randomly, the chances that they'd vote 1 would be =(number of people voting 1)/(total number)= \(\frac{0.54N}{N} =0.54\). This proabability is be equal to the probability (p) of getting 1 from R. In theory, this probability should be known. Here, since we are trying to solve a case when we don't know p (and hence the \(\sigma_{p})\), we will have to make adjustments. The adjustment is to substitute standard error \(\sigma_{p'}\) of the sample (S) for the unknown \(\sigma_{p}\) of the total population (R).

For n random samples taken from the total population, the mean of the sample is \(\mu'=\mu\), and standard deviation of the samples is \(SD(p')=\sqrt{\frac{(p(1-p))}{n}}\). If n is large, a different formula is also applicable. The standard error for p' is the standard deviation with unknown parameters (p) replaced by sample estimates (p'). Here, \(\sigma_{p'}=SE(p')=\sqrt{\frac{(p'(1-p'))}{n}}\).

Since, p' is known, \(\sigma_{p'}\) can be calculated, and according to Figure 2, we can say that the unknown p of the total population (R) will be between \(p'\pm2\sigma_{p'}\) with a probability of ~95%. Similarly, p will lie between \(p'\pm\sigma_{p'}\) with a probability of ~68%.


Computational analysis

The code:

This code window shows the python code and output for Confidence interval.

Description of the code:

The graph from the code:

Graph from the python code

Fig.3 Graph from the python code

A description of the Graph: In the graph generated using the python code,


Note:

More information on the Confidence Interval, and its different types can be got from Chapter 8 of the book, "Using R for Introductory Statistics" by John Verzani.


References:

  1. "Using R for Introductory Statistics" by John Verzani.
  2. "Characteristics of a Normal Distribution" at SPHweb.

If you come across any error, kindly inform.