Empirical Risk Minimization And PAC Learning Framework

There are several questions that arises when we design and anaylze algorithms that learn from data. Following are the questions that can arise:

What can be learned efficiently?
How much instances are required to learn sufficiently?
Is there a general model of learning.

To address these questions we will introduce Probably Approximately Correct (PAC) learning framework. In order to answer this question , we need to understand first what is Empirical Risk Minimization? We will prove some fundamental theorems related to ERM and then explain PAC Learning along with some fundamental theorems.In this lecture, first we look at empirical risk minimization algorithm for learning and analyze its performance on finite hypothesis class

1.1 Empirical Risk Minimization

We will take into account the model of learning for supervised learning. Given a set of training data $S$ sampled from unknown data generating distribution , $D$ and labelled with some true labelling function $f$, the goal of learner is to come up with a hypothesis function $h$ which has small expected loss $L_{D,f}(h)$. Since, the learner has no information of $D$, a natural strategy for the learner in such scenario could be to output a hypothesis $h$ which performs well on sampled training data $S$ and hope that it will perform well on underlying input distribution. This approach is known as 'Empirical Risk Minimization.' So instead of minimizing expected loss, ERM algorithm will minimize sample loss (empirical loss) which is defined as follows: $$L_S(h) = \frac{1}{|S|} \sum_{i=0}^{|S|}l(h,(x_i,f(x_i)))$$ But there is a small problem with this approach, which is that ERM can memorize the sample data and outut the following hypothesis. \begin{equation} \bar h_S(x)=\begin{cases} f(x_i), & \text{if $(x_i,f(x_i)) \in S$}.\\ 0, & \text{otherwise}. \end{cases} \end{equation} which has 0 sample loss. Although , $\bar h_S$ performs well on sampled data, it has learned nothing about the input data except on the finite sampled points and thus will not generalize well for arbitrary distribution. Note that it can do well on input distribution that has most of its mass sampled data points but for arbitrary distribution, its a bad choice. This problem in learning is known as overfitting. A common soluution to the problem of overfitting is to incorporate prior knowledge(inductive bias) about the learning task into learning algorithm.For example, by restricting a learning algorithm to output hypothesis from some hypothesis class $H$ with a gurantee that there exists a good $h \in H$, will guard us against overfitting. Using $H$ with a good gurantee of $h$ is the prior knowledge we are adding to learning process. Adding prior knowledge to ERM algorithm, we have $ERM_h$ algorithm which minimizes the empirical loss over $h \in H$ instead of all possible functions. Formally, $$ERM_H(S) \in \underset{h \in H} {argmin}(L_S(h))$$

1.2 $ERM_h$ on Finite Hypothesis Class

In this section, we prove the following theorem.
Theorem 1.1. Let $X$ be any domain set, $Y={0,1}$ be label set and let $H$ be a finite set of function from $X \rightarrow Y$. Assuming that training data $ S \subset X \times Y$ is sampled i.i.d (independent and identically distributed) from some distribution $D$ over $X$ and labelled using some $f \in H$ then $ERM_H$ is guranteed to come up with an h that has small expected loss given sufficiently large sample $S$. Loss function is $l$ assumed to be bounded.

Proof. Lets define bad samples to be those on which $ERM_H$ may come up with a bad h.
Definition 1.2 For some fixed success parameter $\epsilon > 0$, we say $S$ is a bad sample if $L_{D,f}(h_S) > \epsilon$ where $h_S$ is the output of $ERM_H$ on $S$.

Also lets define the set of bad hypothesis to be $H_B = \{h \in H|L_{D,f}(h)> \epsilon\}$. In order to prove the theorem, we wish to upper bound the probability of getting a bad sample $S$ which is given by $$D^m[\{S|_x|L_{D,f}(h_S) >\epsilon \}]$$ Note that since we have assumed that our true labelling function is contained in our hypothesis class , we know that the output hypothesis $h_S$ will always have a zero empirical loss that is $L_S(h_S) = 0$. Having made that observation, lets define the set of misleading examples to be set of samples which makes a bad hypothesis candidate for output hypothesis by $ERM_H$ $$M = \{S|_x|\exists h\in H_B s.t. L_S(h) = 0\}$$. Now observe that set of bad examples is a subset of misleading samples, which give us \begin{equation} \begin{split} D^m[\{ S|_x|L_{D,f}(h_S) > \epsilon\}] & \leq D^m[\{S|_x| \exists h \in H_B s.t. L_S(h) = 0 \}] \\ & =D^m[\cup_h\in H_B \{ S|_x|L_S(h)=0\} ] \\ & \leq \sum_{h \in H_B} D^m[\{ S|_x|L_S(h) = 0\}] \text{ Using Union Bound} \end{split} \end{equation} Let us first consider the case $0-1$ loss function. We will generalize the result to bounded loss functions later. With $0-1$ loss function function, we know that $L_{D,f}(h)$ is a fraction of points on which h makes error, so a bad hypothesis $h \in H_B$ makes error on at least $\epsilon$ fraction of points. So we have for $h \in H_B$ $$D^m[\{S|_x|L_{D,f(h_S) > \epsilon} \}] \leq (1 - \epsilon)^m$$ where we have used the i.i.d assumption of sampling process. This gives us an upper bound on the probability of getting bad samples $$D^m[\{ S|_x|L_{D,f}(h_S) > \epsilon \}] \leq |H_B|(1- \epsilon)^m$$ which for $m \geq \frac{1}{\epsilon} log(|H|/\delta)$. We now give similar results for bounded loss function. Let l be a loss function bounded in the interval $[0,B]$. Observe that the expectation of $L_S(h) = \frac{1}{|S|} \sum_{i=0}^|S| l(h(x_i),f(x_i))$ with respect to $D^m$ is $L_{D,f}(h)$. With that observation we note that for $h \in H_B$ $$D^m[\{S|_x|L_S(h)= 0\}] = P_{S \sim D^m}[\{L_S(h) = 0\}] \leq P_{S \sim D^m}[E[L_S(h)]-L_S(h) \geq \epsilon]$$ Since we have a random variable, $L_S(h)$, deviating from its expectation by more than $\epsilon$ , we call for concentration inequalities to bound the probability. Since, $L_S(h)$ is average i.i.d bounded random variables we use Hoeffling inequality $$P_{S \sim D^m} [\{E[L_S(h)] - L_S(h) \geq \epsilon\}] \leq P_{S \sim D^m}[\{|L_S(h)-E[L_S(h)]| \geq \epsilon \}] \leq 2exp(-\frac{2m\epsilon^2}{B^2})$$ Using this, we have that probability of getting a bad sample for bounded loss function is upper bounded by $$D^m[\{S|_x|L_{D,f}(h_S) > \epsilon\}] 2|H_B|exp(-\frac{2m\epsilon^2}{B^2})$$ which for $ m = O()\frac{1}{\epsilon^2}log(\frac{2|H|}{\delta})$ is $ \leq \delta.$

Theorem 1.2 Hoeffding's Inequality: Let $X_1,X_2,...,X_n$ be i.i.d random variable bounded by the interval $[0,B].$ Let $\bar X = \frac{1}{n}(X_1 + ...+X_n)$ be the empirical mean of these variables, then $$P[|\bar X - E[\bar X]| \geq t] \leq 2exp(-\frac{2nt^2}{B^2})$$.

Above we introduced the $ERM_H$ learning algorithm and proved that $ERM_H$ succeeds on finite hypothesis classes under some assumptions in outputting a good $h \in H, L_{D,f}(h) \leq \epsilon$, with $1-\delta$ probability given sufficiently large samples $S$ where size of $S$ is function of $\epsilon$ and $\delta$.

In the lecture notes above, we introduced the $ERM_H$ algorithm and proved that it succeeds on finite hypothesis classes in outputting a hypothesis which will be probably approximately correct (PAC). In the lecture below, we introduce the notion of Probabably Approximately Correct Learning. We also show that $ERM_H$ succeeds on finite hypothesis class with relaxed assumption.

2.1 PAC Learning

Definition 2.1 (PAC Learnability) : Hypothesis class H is PAC learnable withh respect to loss function , $l$, if there exists a function $m_H : (0,1) \times (0,1) \rightarrow N$ and a learning algorithm $A$ with the following property: For every $\epsilon,\delta \in (0,1)$, for every distribution $D$ over $X$ and for every labelling function $f \in H$, we have $P_{S \sim D^m}(L_{D,f}(A(S)) \geq \epsilon) \leq \delta$ for every $m \geq m_H(\epsilon,\delta)$ where examples are in $S$ are generated by i.i.d by $D$ and labelled by $f$.

The definition of PAC learning has two parameters, a accuracy parameter $\epsilon$,determining the quality of output hypothesis, and a confidence parameter $\delta$, indicating how often is the learning algorithm successful in meeting the accuracy requirement of output hypothesis. Also , note that PAC learning gives distribution free gurantee , that is the sample complexity $m_H$, only depends on $\epsilon$ and $\delta$ is independent of data generating process $(D,f)$. With the PAC learning definition in place, we rephrase the result we showed in above lecture notes.

Corollary 2.1 Every finite hypothesis class is PAC learnable with $m_H \leq \frac{1}{\epsilon^2} log(\frac{|H|}{\delta})$. Furthermore, $ERM_H$ algorithm achieves this.

In the next section we generalize the learning framework introduced in the above lecture, and extend the notion of PAC learning to this new framework.

2.2 A New Framework for Learning

One of the main weakness of the learning framework we defined in lecture above (ERM) is that we assumed that there exist a true labelling function $f$. Assuming labels are fully determined by the features of input elements is not realistic, we modify the old framework to take care of this assumption.We now assume that our data is generated by the following model:

Data Generation Model: We relax the existence of true labelling function assumption by putting a distribution over the labels as well. Formally, we now have data from a distribution, $D$ over $X \times Y$ where $X$ and $Y$ are domain set and label set respectively. One can view $D$ as the product of two distributions, a distribution over domain set $D_x$ and a conditional distribution over label set given a domain point $D_{y|x}$ that is $D(x,y) = D_x(x)D_{y|x}(y|x)$
True Risk and Empirical Risk: Recall from above lecture note we defined true risk of a hypothesis to be expected loss over the underlying data generating distribution. Since, we have the distributions over the labels as well, true risk in the new model is $$L_D(h) = E_{(x,y) \sim D}[l(h,(x,y))]$$ and Empirical risk remains same $$L_S(h)= \frac{1}{|S|} \sum_{i=0}^{|S|}l(h,(x_i,y_i))$$

The goal of learning algorithm remains same, we wish to find $h \in H$ that minimizes the true risk. Note that given any $D$ over $X \times (0,1)$ the best model in the above model is always given by \begin{equation} \bar h(x)=\begin{cases} 1, & \text{if $D_{y|x}(y=1|x) >1/2$}.\\ 0, & \text{otherwise}. \end{cases} \end{equation} $\bar h(x)$ is known as Bayes Optimal Predictor. Although ,$\bar h(x)$ is the best hypothesis, its definition requires knowldedge of $D$ and thus we cannot utilize it.
In the section to come, we define the notion of PAC Learning for this new model , also known as agnostic PAC learning, a natural extension of previous one.

2.3 Agnostic PAC Learning

Note that in PAC learning , not only did we assume that there exists a true labelling function $f$ , we also assumed that $f$ belongs to our hypothesis class $H$. In the new model we have no such assumptions about the labelling process, so what we can hope to achieve, when we talk about successful learning, is to do well with respect to the best function in given hypothesis class $H$. We define agnostic PAC learnability.

Definition 2.3 (Agnostic PAC Learnability): Hypothesis class $H$ is agnostic PAC learnable with respect to loss function, $l$ if there exist a function $m_h:(0,1) \times (0,1) \rightarrow N$ and a learning algorithm $A$ with the following property: For every $\epsilon, \delta \in (0,1)$, for every distribution $D$ over $X \times Y$, we have $P_{S \sim D^m}(L_D(A(S)) \geq min_{h \in H}(L_D(h))+\epsilon) \leq \delta$ for every $m\geq m_h(\epsilon,\delta)$ where examples in $S$ are generated by i.i.d by $D$.

Note that definition of agnostic PAC learning reduces to PAC learning when true labelling function $f$ belongs to $H$ as $min_{h \in H}(L_D(h)$ will be 0.So if a class is agnostic PAC learnable, it is also PAC learnable. In the lecture to come below we extend the result of previous lecture and show that finite hypothesis class is also agnostic PAC learnable by $ERM_H$.

2.4 $ERM_H$ on Finite Hypothesis Classes in New Framework

Recall, the goal of learning is to minimize the expected loss with respect to underlying unknown data generating distribution $D$.Since we have no knowledge of $D$ but only access to i.i.d samples from $D$, we minimize empirical loss and rely on law of large numbers , which states that sample of average i.i.d random variables is close to to their expected value with high probability,to get some good performance on expectation. To make the intuition in the previous statement work, we would have to make empirical loss close to expected loss for all $h \in H$ simultaneously as then the empirical loss will act as good proxy for expected loss and $ERM_H$ will end up performing well.So formally, we want empirical loss to uniformly converge on $H$ to expected loss on as the number of samples increases.Recall the notion of uniform convergence

Definition 2.4 (Uniform Convergence): Given a domain set $E$ and a real valued function $f_n : E \rightarrow R$ for $n= 1,2,3,...$. We say that the sequence $(f_n) n\in N$ is uniformly convergent on $E$ with limit $f:E \rightarrow R$ if for every $\epsilon > 0$,there exists a natural number $N$ such that for all $n \geq N$ and $x \in E$ $$|f_n(x)-f(x)<\epsilon|$$ Observe the following simple claim

Claim 2.5. Given sequence of functions $(f_n)n\in N$ uniformly convergent on $E$ with limit $f:E \rightarrow R$, then there exists $N$ such that $\forall n \geq N$ we have $f(\bar x) \leq f(x^*)+ \epsilon$ where $\bar x = min_x f_n(x)$ and $x^*= min_x f_n(x)$

Proof. Let N be such that for all $N \geq N$ and $x \in \epsilon$, we have $|f_n(x)-f(x)| < \epsilon/2$. Then $\forall n \geq N$, we have $$ f(\bar x) \leq f_n(\bar x) + \epsilon/2 \leq f_n(x^*) +\epsilon$$

What the previous claim is saying is that if two functions are $\epsilon/2$ close at every point then minimizing one will give $\epsilon$ (additive) good solution to other. So if number of samples are large enough such that $L_S(h)$ is $\epsilon/2$ close to $L_D(h)$ for all $h \in H$, then we know from the previous claim minimizing $L_S(h)$ will output hypothesis which will be $\epsilon$ away in performance from best hypothesis in $H$.We complete the detail in next theorem and show that $ERM_H$succeeds in providing agnostic PAC learnability over finite hypothesis classes.

Theorem 2.6 Finite hypothesis classes with respect to bounded loss functions are agnostic PAC learnable by $ERM_H$.

Proof. Lets define $\epsilon$ representative samples
Definition 2.7 S is called an $\epsilon$ representative if $\forall h \in H$, $|L_S(h) - L_D(h)| \leq \epsilon.$

Observe that if we sample a $\epsilon/2$ representative sample ,$S$, then from claim 2.5, we know that $L_S(h)$ is a good proxy for L_D(h) and moreover $$L_D(h_S) \leq \underset{h \in H}{\min} (L_D(h) + \epsilon)$$ where $h_S$ is the output of $ERM_H$ .So what we are left to do now is to bound the probability of not sampling $\epsilon/2$ representative samples. Lets call non $\epsilon/2$ representative samples, bad samples. Then \begin{equation} \begin{split} P_{s \sim D^m}(\text{S is not an } \epsilon/2 \text{ representative sample}) & \leq D^m[\{ S| \exists h \in H s.t. |L_S(h)-L_D(h)| \geq \epsilon/2\}]\\ & \leq D^m[\cup_{h \in H}{S||L_S(h) - L_D(h)| \geq \epsilon/2}] \\ & \leq \sum_{h \in H} D^m[{S||L_S(h) - L_D(h)| \geq \epsilon/2}] \text{ Using Union Bound}\\ \end{split} \end{equation} For bounded loss function $l$, bounded in the interval $[0,B]$, we observe that L_S(h) is average i.i.d bounded random variables with expectation equal to $L_D(h)$, using Hoeffding's inequality we have for $h \in H$ \begin{equation} \begin{split} D^m[{S||L_S(h)-L_D(h)| \geq \epsilon/2}] & = P_{S \sim D^m}[|L_S(h)-L_D(h)| \geq \epsilon/2]\\ & = P_{S \sim D^m}[\{|L_S(h)-E[L_S(h)]|\} \geq \epsilon/2 ]\\ & \leq 2exp(-\frac{m\epsilon^2}{2B^2}) \text{ Using Hoeffding's Inequality}\\ \end{split} \end{equation} Combining everything we have , \begin{equation} \begin{split} P_{S \sim D^m}(L_D(A(S)) \geq min_{h \in H}(L_D(h)+ \epsilon)& \leq P_{s \sim D^m}(\text{S is not an } \epsilon/2 \text{ representative sample})\\ & = D^m[{S| \exists h \in H s.t. |L_S(h)-L_D(h)| \geq \epsilon/2}]\\ & \leq \sum_{h \in H} D^m[{S| |L_S(h) - L_D(h)| \geq \epsilon/2}]\\ & \leq 2|H|exp(-\frac{m \epsilon^2}{2B^2})\\ \end{split} \end{equation} which for $m = O(\frac{1}{\epsilon^2}log(\frac{2|H|}{\delta}))$ is $\leq \delta$.

Conclusion

In this lecture we introduced the notion of ERM, PAC Learning and agnostic PAC Learning and showed that finite hypothesis class is agnostic PAC learnable by $ERM_H$.

References

Understanding Machine Learning: From Theory to Algorithm By Shai Shalev Shwartz and Shai Ben David
Understanding Machine Learning: From Theory to Algorithm By Shai Ben David Youtube video lectures
Foundations of Machine Learning By M.Mohri, Afshin Rostamizadeh, Ameet Talwalkar
https://towardsdatascience.com/learning-theory-agnostic-probably-approximately-correct-learning-dfd0d7c76467
https://towardsdatascience.com/learning-theory-empirical-risk-minimization-d3573f90ff77