Dated:22nd October, 2020
— Aman Dimri (aman.dimri@niser.ac.in)
Data by itself is not useful and needs to be intepreted to draw interesting conclusions and results. In statistics, we have a probabilistic view of our outcomes and hence statistical methods are used to provide a sense of confidence to the results we are interpretating from the data. These methods in literature are called Statistical Hypothesis Tests[1] or simply in layman terms Significance Tests.
Before we move on, let us understand what the term hypothesis means. According to the Oxford dictionary, "Hypothesis is a supposition or proposed explanation made on the basis of limited evidence as a starting point for further investigation." In more simplest words it is a proposed explanation for some phenomenon. However, for a hypothesis to be called scientific it must be falsifiable. Generally, hypotheisis in statistics and machine learning take the following forms of definition:[2]
Hypothesis in Machine Learning: Candidate model that approximates a target function for mapping examples of inputs to outputs.
Hypothesis in Statistics: Probabilistic explanation about the presence of a relationship between observations.
Necessity Of Hypothesis Testing In Machine Learning (ML):
After reading up a lot, I found two major uses for hypothesis testing in ML.
Making Sure Sample Data Is Representative Of Real Data: Consider the following scenario. Suppose you were trying to make an ML algorithm to test whether a set of people have COVID-19 or not based on several features like body temperature, pulse rate, breathing ability, places visited in the past two months and hand-sanitizing frequency. Now you receive a sample data by surveying your own neighbourhood. Based on this training data your algorithm is going to develop a model and make the predictions. For this to be of any importance the sample data must be a true reflection of the real population data on which the predictions are to be performed. How can we be sure that this is actually the case? Further, if sample distribution data is not a carbon copy of population distribution data (which is generally the case) then how much difference is a significant difference that might lead to bad modelling? Such questions can be answered using hypothesis testing.
Testing The Validity Of The Some Model Given By ML Algorithm: This feels like something obvious but consider the following scenario to cement this notion. Suppose you are a shop owner considering to install a coffee vending machine. You develop an ML algorithm to predict the customer proportion who would be interested to buy coffee from your vending machine. How can you be sure that this prediction is a true reflection of the real customer proportion who would actually buy coffee from you? Such questions upto a certain confidence level can be answered using hypothesis testing.
Key Concepts And Definitions:
Before I actually define the methodology of hypothesis testing, we need to know some key definitions. Consider the following classic scenario of drug testing. Suppose from earlier experiments we know that the mean reaction time for the general population assumed to be following a normal distribution is 1.30μs. Suppose you now administer a drug into 100 people. You find out that the drugged people have an average reaction time of 1.27 μs with a standard deviation of 0.1s. Now you want to answer if the drug has any effect on the reaction time or not?
Null Hypothesis (H0): It is a statement about the status quo or about some assumption/distribution that is assumed to be true. Generally, it is a statement regarding some distribution parameter believing no anomaly is present. That is why this is called 'Null Hypothesis' or a hypothesis where nothing special happens. In our scenario:
H0 : The drug has no effect on people and the mean reaction time reamins 1.30μs (μτ=1.30μs)
Alternating Hypothesis (H1/Ha): This is the hypothesis that directly contradicts the Null hypothesis and is the statement that we are trying to show some evidence for. It is basically the outcome of the experiment that we hope to show. In our scenario:
Ha: The drug has some effect on people and the mean reaction time changes (μτ ≠ 1.30μs)
p-value (p): To show evidence for your alternating hypothesis you will need some new sample data. In our scenario this sample data corresponds to the reaction times of the 100 people after administering them with the drug. The p-value is the conditional probability of getting the sample data you got given the Null hypothesis is true. Mathematically:
p-value = P(Sample Data|H0 is true)
In our scenario this is the probability of getting a reaction time of 1.27 μs given that the probability distribution for the 100 people is a normal distribution with a mean reaction time at 1.30μs just like the assumed population distribution. For the expert readers, since #samples > 30, the standard deviation for the distribution can be approximated using central limit theorem and a Z-statistic can be used to quickly compute the p-value.
Significance Level (α): This is a pre-chosen threshold value. The significance level is compared to the p-value obtained for the sample data using which we reject or fail to reject the Null hypothesis. This value is generally taken to be 5% or 0.05 but it can be anything depending upon the task at hand.
Confidence Level: Alternatively, one can define:
Confidence Level = 1 - Significance Level
and use the notion of confidence level instead of significance level equivalently. A confidence level of 95% is equivalent to a significance level of 5%.
Methodology Of Hypothesis Testing:
The algorithm for hypothesis testing is beautifully summarised in the following flowchart.
Fig.1 - Hypothesis Testing In A Nutshell
Interpreting The Results Of Hypothesis Testing:
There are two main ways in which the results of hypothesis testing are stated based on p-value and significance level (α):-
'Reject' Vs 'Failure To Reject' The Null Hypothesis:
The p-value is compared to a pre-chosen α. Then:
If p ≤ α, we reject the Null hypthesis to infer that there is some evidence that the alternating hypothesis is true. This gives a statistically significant result,
If p > α, we fail to reject the Null hypothesis and no significant result is obtained.
For eg: Consider a scenario where you are trying to show that the average unemployment rate in your town is greater than the national average of 15%. You form the following hypotheses.
H0- The unemployment rate in my town is equal to the national average of 15%.
Ha- The unemployment rate in my town is greater than the national average of 15%.
You set a significance level of 3%. Now let's look at the following cases:
You computed a p-value of 0.05 or 5% for your sample data. Then you would make either of the following equivalent conclusions.
The test found that the unemployment rate im my town is equal to the national average of 15%, failing to reject the Null hypothesis at a 3% significance level (or a 97% confidence level).
You computed a p-value of 0.01 or 1% for your sample data. Then you would make either of the following equivalent conclusions.
The test found that the unemployment rate im my town is greater than the national average of 15%, rejecting the Null hypothesis at a 3% significance level (or a 97% confidence level).
By now, a keen reader grasps the idea that the p-value is simply a probabilistic measure. Hence at the end of this test, we can only comment on how likely is the Null hypothesis true. Rejecting the Null hypothesis implies that the available statistical evidence is enough to infer that the null hypothesis is unlikely to be true. Hence the statement of alternating hypothesis (which is a total contradiction to the statement of Null hypothesis) is likely to be true. Else, it means that there is not sufficient statistical evidence to reject the Null hypothesis and no realistic comments can be made about the alternating hypothesis.
Common Misinterpretations In Hypothesis Testing:
p-value And The Conditional Probability:
A common mistake that people perform while implementing this technique is that they consider the p-value as the probability of the Null hypothesis (H0) to be true given our sample data. This is completely incorrect because mathematically this is represented as:
P(H0 is true|Sample Data)
However the p-value is defined as:
p-value = P(Sample Data|H0 is true)
This is a classic mistake of misinterpretating P(A|B) and P(B|A) and anyone with a basic knowledge of probability knows that these are two separate entities.
Making Exact Claims About Null Hypothesis:
Often rejecting/failing to reject the Null hypothesis makes pepole infer that the Null hypothesis is false/true. However this is not correct. However it does imply that we have chosen to reject/fail to reject the Null hypothesis at a specific statistical significance level (α) based on empirical evidence and the chosen statistical test. Using hypothesis testing one can only makes probabilistic claims. You cannot make an exact binary classification for the Null hypothesis to be true or false. This notion might get better once I introduce the possible errors in results that might arise following this method.
Post-hoc Tuning Of The Significance Level Parameter (α):
Often people misterpret this fact and think that α is a tunable parameter which can be optimized at the end of the hypothesis testing algorithm. If you have developed this notion, you couldn't be further away from reality. Not only this is completely wrong but if you are a reasearcher it is HIGHLY UNETHICAL for you to do so. You would essentially be lying to the whole scientific community.
Consider the following scenario to understand this. Suppose you want to make a claim that there is toxic amount of lead in maggi and if you prove to do so you will be compensated with $1 million. Suppose the maximum non-lethal concentration of lead that can be in a food product (set by the government) be 10ppm. To give evidence for your claim you form the following hypotheses:
H0- The concentration of lead in maggi is 10ppm which is in compliance with the government standards.
Ha- The concentration of lead in maggi is greater than 10ppm which is a toxic dose for the cosumers ingesting the product.
Now you set a significance level (α) to be 3%, compile sample data and find that the p-value for your test statistic (concentration of lead in maggi) is 4%. Now hypthesis testing would conclude that the concentration of lead in maggi is 10ppm, failing to reject the Null hypothesis at 3% significance level. Now your inference cannot win you the million dollars. However, you got greedy and decided to tune α post-hoc to 5%. Now hypothesis testing concluded that the concentration of lead in maggi is greater than 10ppm, rejecting the Null hypothesis at a significance level of 5%. You decided to frame it alternatively by claiming that Nestle is harming it's consumers by allowing the concentration of lead in maggi to be greater than 10ppm, thereby rejecting the Null hypothesis at a 95% confidence level. However this time around your inference might get you a million dollars.
I hope that now anyone reading realizes that how sensitive is the outcome of hypothesis testing on the significance parameter α. Hence post-hoc tuning of α to conclude results is equivalent to lying using statistics. This is why value of α is ALWAYS ALWAYS fixed according to the nature of problem and even before taking the sample data.
Errors In Hypothesis Testing:
Hypothesis testing only makes probabilistic claims about it's outcomes. Since it's classification isn't crisp binary type (true/false), there is a possiblity that we might wrongly reject/fail to reject the Null hypothesis. For eg: Suppose you set a significance level of 1%. THen there is a 1 in 100 chance that you might reject the Null hypothesis even though it is actually true. Following are the two possible ways that you can make wrong conclusions based on hypothesis testing.
Type-I Error: This error occurs when you reject your Null-hypothesis when in reality the Null hypothesis is true. The significance level (α) sets the probability of making a type-I error as we are going to reject the Null hypothesis provided the sample data returns a value whose probability of happening is less than α. Mathematically:
P(Making A Type-I Error) = α
Type-I Error: This error occurs when you fail to reject the Null hypothesis when in reality the Null hypothesis is false. The probability of this error not only depends on α but also the area overlap between the true distribution and the distribution as given by Null hypothesis. Higher the overlap, more is the possiblity of type-II error. This type of error is quantified mathematically by defining a parameter called power as follows:
Power = 1 - P(Making A Type-II Error)
The only parameter in our hand that can be used to minimize these errors is α. Decreasing α reduces the probability of type-I error but increases the probability of type-II error. Intuitively, type-II error increases with decreasing α because the new accepting overlap area between the true and the Null hypothesis distributions is larger in the latter case. Hence such a value of α is chosen that this trade-off is viable for the problem in hand. However, as explained earlier α must be chosen before any sample data analysis is performed.
Scenario
Based On Hypothesis Test
Reality
Possible Outcomes
Failed To Reject H0
Rejected H0
H0 Is True
Correct Conclusion
Type-I Error
H0 Is False
Type-II Error
Correct Conclusion
Table-1: Possible Errors In Hypothesis Testing
Hypothesis Tests:
There are an incredible variety of hypothesis tests to tackle a variety of problems. This section gives the list compiled by Jason Brownie[4] in his blog at machine learning mastery giving some examples of hypothesis tests and where they are generaly used.
Variable Distribution Type Tests (Gaussian)
D’Agostino’s K2 Test
Shapiro-Wilk Test
Anderson-Darling Test
Variable Relationship Tests (correlation)
Chi-Squared Test
Spearman’s Rank Correlation
Pearson’s Correlation Coefficient
Kendall’s Rank Correlation
Compare Sample Means (parametric)
Student’s t-test
The Z-Test
Paired Student’s t-test
Analysis of Variance Test (ANOVA)
Repeated Measures ANOVA Test
Compare Sample Means (nonparametric)
Kruskal-Wallis H Test
Mann-Whitney U Test
Wilcoxon Signed-Rank Test
Friedman Test
It is not really possible to explain all these complicated tests in this one lecture note. However I am going to touch upon the t-test and the Z-test as they come up a lot and are very easy to implement. Both these tests are generally used to compare sample means and follow the same methodology as hypothesis testing (of course!!) with the underlying representative distributions arising from a t-statistic and Z statistic respectively[5]. The p-value for these tests are computed using what is called a T table or Z table (one/two tailed depending upon the problem). To use the T/Z table, we first compute a Z-score which basically tells how many standard deviations is the observed value from the supposed mean value. Mathematically:
\[Z-Score = \frac{\bar{x}-\mu}{\frac{\sigma}{\sqrt{N}}} \]
\[\bar{x}=\text{Sample mean Value}, \hspace{3mm} \mu=\text{Mean Value From Null Hypothesis}\]
\[\sigma=\text{Sample Standard Deviation Value}, \hspace{3mm} N=\text{Number Of Sample Values}\]
The Z-Score along with Z/T tables[5] can be directly used to compute the p-value and finish the hypothesis test. The reason that I explained these two tests is that they are incredibly easy to understand and implement and yet are quite useful. The next part deals with implementing them in python using the scipy.stats library.
Implementing T-Test Using Python:
Consider the following scenario. A government survey tells that the national average height among it's male population is 2.4m. However you believe that for your town this statistic is not true. To prove your claim you put forth the following hypotheses:
H0- The average male height in my town is exactly equal to the national average of 2.4m
Ha- The average male height in my town is not equal to the national average of 2.4m
You set the confidence level to 5%. Now you collect sample data by randomly asking 10 people. Since the sample size is less than 30, a t-statistic must be used[5].
The following Code snippet runs the T-Test In Pythton:
from scipy.stats import ttest_1samp
import numpy as np
import pandas as pd
#Importing Data
data = pd.read_csv('C:\Users\PRADEEP\Downloads\height.csv')
print(data.shape)
data.columns = ['Index','Height(in m)']
#Setting Up The Sample Data
heights = data['Height(in m)']
#Printing Sample Data
print(heights)
#Calculating The Mean Parameter For Sample Data
heights_mean = np.mean(heights)
#Printing Mean Of Sample Data
print 'The mean value for our sample data is:' ,heights_mean
#Performing The T-Test.
#Setting The Mean Parameter For Null Hypothesis (2.4). Computing P-value.
tset, pval = ttest_1samp(heights, 2.4)
#Printing P-value
print 'The p-value for our hypothesis is:' ,pval
# Significance level (alpha) = 5%
#Interpreting The Results
if pval < 0.05:
print(" We are rejecting null hypothesis with a significance level of 5%.")
else:
print("we are failing to reject null hypothesis with a significance level of 5%.")
Implementing Z-Test Using Python:
Consider the same scenario as in the case that we were implementing the T-test. However this time around you decided to sample 200 people instead of just 10. Since the sample size is greater than 30, a Z-statistic must be used[5].
To run the Z-test firstly import the following header files of python.
import numpy as np
import pandas as pd
from scipy import stats
from statsmodels.stats import weightstats as stests
Now in the code snippet of the T-Test replace the following lines:
#Performing The T-Test.
#Setting The Mean Parameter For Null Hypothesis (2.4). Computing P-value.
tset, pval = ttest_1samp(heights, 2.4)
with the following lines:
#Performing The Z-Test.
#Setting The Mean Parameter For Null Hypothesis (2.4). Computing P-value.
ztest ,pval = stests.ztest(heights, x2=None, value=2.4)
and voila!! We now have successfully implemented the Z-test. The only thing left to do is to interpret the results.
One-Tailed And Two-Tailed Tests:
This is the final subsection of this lecture note wherein I want to expalin the notion of One/Two Tailed Tests.
One-Tailed Test - In this statistical test, the region for rejecting the Null hypothesis with a certain significance level lies on only one-side of the sampling distribution.
Fig.2 - One-Tailed Hypothesis Testing[6] With A significance level of 5%
Two-Tailed Test - This is a statistical test wherein the region for rejecting the Null hypothesis lies on both sides of the distributio and tests whether a sample is greater than or less than a certain range of values.
Fig.3 - Two-Tailed Hypothesis Testing[6] With A significance level of 2.5%>
The fact that you will be running a one or a two tailed test completely depends on the alternating hypothesis. To understand this consider the following scenario. Suppose you have designed a new motor engine and you want to claim that it is different from the older motor engines on the basis of the carbon emissions. Suppose the older motor engines have a carbon emission of 2g for every litre of fuel burned. To prove your claims you'll put forward the following hypotheses:
H0- My motor engine has carbon emission of 2g for every litre of fuel burned which is exactly equal to that of older engines
Ha- My motor engine is different from the older engines and does not have a carbon emission of 2g for every litre of fuel burned
In this case, you dont care if you have higher or lower emissions but simply that it is not the same as that of the previous generations. Hence a two-tailed test must be employed.
On the other hand, suppose you wanted to claim that your motor engine is more environment friendly. Then you would have set up your hypothesies as:
H0- My motor engine has carbon emission of 2g for every litre of fuel burned which is exactly equal to that of older engines
Ha- My motor engine is more environment friendly when compared to the older engines and has a carbon emission of less than 2g for every litre of fuel burned
In this case, you explicitly want the carbon emissions to be lower and hence a one-tailed test must be employed. A keen reader will identify the fact that both the Z and the t tests can be one or two tailed. To implement one/two tailed Z-tests, there is an additional keyword in python (a parameter of ztest) called alternative which takes in 'one-sided'/'two-sided' values to implement the one/two tailed Z-test respectively. If the parameter is not explicitly mentioned in the definition the default value is set to 'one-sided'. To implement one tailed t-test, ttest_1samp() is used and to implement the two-tailed t-test, ttest_ind() command is used. Thus in the previous code snippets, I was implementing one tailed t/Z test.