Documented by: Akshay Priyadarshi, 5th Yr. Int. MSc., School of
Physical Sciences, NISER | akshay.priyadarshi@niser.ac.in
Instructor: Dr. Subhankar Mishra, Asst. Prof., School of Computer
Sciences (CSS), NISER
Problem: We want to predict the outcome of an election between two candidates (0 and 1) in a region (R). The region has a population of 1 lac. Ideally, prediction would be made after asking all the people. But, naturally, due to limited resources and time, we cannot ask everyone.
Suggested solution: We ask a random selection of few people
(region S) about their opinion and try to predict the general outcome.
Let's say, we inquired 1000 people randomly selected from the
total population, and 540 out of them selected 'Option 1'. The
probability of Option 1, p' would be 0.54 which is 540/1000.
The Snag: This probability (p') might be different
than the probability (p) of Option 1 in the total population (region
R). The error in prediction would be = p-p'.
Using confidence interval:
This predicted outcome p', since made on a limited population (S)
would have a given percentage of accuracy. From the book "Using R for
Introductory Statistics" by John Verzani, we have the information that
random samples are not exactly “random”, but they obey the laws of
probability. If the original population (R) holds some assumptions,
then a certain precision, and some conclusions about a population
parameter (like, mean) can be made.
Confidence interval is the probability that a population parameter
will be in the interval estimate i.e. the confidence that we have over
the interval. Interval Estimate is a range of values or, interval that
may contain the desired parameter.
Normal Distribution, which appears to be like the bell cuve, is also known as the Gaussian distribution. It is a probability distribution which is symmetric about the mean (average), showing that data near the mean are more frequent as compared to the ones far from the mean. In this distribution, we have nearly, mean=median=mode of the sample.
(\(\sigma\) is standard deviation & \(\mu\) is mean.)
Image credits:SPHweb.
The normal distribution has a probability density function as: \[P(x)=\frac{e^{-\frac{(x-\mu)^2}{(2\sigma^2)}}}{\sigma\sqrt{2\pi}}\] Upon integrating this fuction, we can get the area under the curve, and hence the percentage distribution of the sample. Eg. nearly 95% of the samples lie within 2 standard deviations of the mean. So, the area between limits of 2 standard deviations (SD) would be ~95% of the total area under the curve. So, upon a random selection from the whole population, the probability of getting an element from this interval of 2SD would be ~95%. An idea of how the percentage distribution between two margins is calculated, can be read from here.
The probability of occurance of 1 in region R is p. We have assumed p
to be 0.54. So, there are 0.54xN people who will vote 1. Therefore, if
we select a person randomly, the chances that they'd vote 1 would be
=(number of people voting 1)/(total number)= \(\frac{0.54N}{N}
=0.54\). This proabability is be equal to the probability (p) of
getting 1 from R. In theory, this probability should be known. Here,
since we are trying to solve a case when we don't know p (and hence
the \(\sigma_{p})\), we will have to make adjustments. The adjustment
is to substitute standard error \(\sigma_{p'}\) of the sample (S)
for the unknown \(\sigma_{p}\) of the total population (R).
For n random samples taken from the total population, the mean of
the sample is \(\mu'=\mu\), and standard deviation of the
samples is \(SD(p')=\sqrt{\frac{(p(1-p))}{n}}\).
If n is large, a different formula is also applicable. The standard error for
p' is the standard deviation with unknown parameters (p) replaced by sample
estimates (p'). Here, \(\sigma_{p'}=SE(p')=\sqrt{\frac{(p'(1-p'))}{n}}\).
Since, p' is known, \(\sigma_{p'}\) can be calculated, and according
to Figure 2, we can say that the unknown p of the total population (R)
will be between \(p'\pm2\sigma_{p'}\) with a probability of ~95%.
Similarly, p will lie between \(p'\pm\sigma_{p'}\) with a probability
of ~68%.
A description of the Graph: In the graph generated using the python code,
Note:
More information on the Confidence Interval, and its different types can be got from Chapter 8 of the book, "Using R for Introductory Statistics" by John Verzani.
If you come across any error, kindly inform.