# diochnos/teaching/CS5970/2020F

## CS 5970 – Computational Learning Theory (Fall 2020)

### Course Description

Topics of machine learning theory. Learning using membership queries, equivalence queries, version spaces, decision trees, perceptrons. Probably approximately correct (PAC) learning, Occam algorithms, VC-dimension, sample sizes for distribution-independent learning. Representation issues, proper learning, reductions, intractability, learning in the realizable case, agnostic learning. Noise models, statistical queries, PAC learning under noise. Adversarially robust learning against poisoning attacks and against adversarial examples. Distribution-specific learning and evolvability. Online learning and learning with expert advice in the mistake bound model. Weak and strong learning (boosting).

### Basic Information

#### Syllabus

The syllabus is available here.

#### Time and Location

Mondays, Wednesdays, and Fridays, 9:30am – 10:20am, Devon Energy Hall 0120.

#### Office Hours

I will be holding my office hours at the following times.

Mondays
10:30am – 12:30pm, 244 Devon Energy Hall
Wednesdays
10:30am – 12:30pm, 244 Devon Energy Hall

Please note that while anyone is welcome during the entire 2-hour time period on Mondays and Wednesdays, CS 5970 will have precedence during the first hour (10:30am – 11:30am) and CS 4033/5033 will have precedence during the second hour (11:30am – 12:30pm).

##### Exceptions to the Regular Schedule of Office Hours

If you want to meet me outside of my office hours, please send me an email and arrange an appointment.

Thursday, September 3, 2020: I will be holding office hours between 10:00am – 12:00pm. My regular office hours that were planned for Wednesday, September 2, 2020 are canceled. Please see Canvas for the Zoom link for these makeup office hours.

Thursday, September 10, 2020: I will be holding office hours between 10:00am – 12:00pm. My regular office hours that were planned for Wednesday, September 9, 2020 are canceled. Please see Canvas for the Zoom link for these makeup office hours.

Wednesday, September 16, 2020: I will be holding my office hours between 10:30am – 12:00pm; that is, 30 minutes less today.

### Homework Assignments

Assignment 1: Announced Friday, August 28. Due Friday, September 11.

Assignment 2: Announced Friday, September 11. Due Monday, September 21.

Assignment 3: Announced Tuesday, September 22. Due Thursday, October 1.

Assignment 4: Announced Monday, October 12. Due Wednesday, October 21.

Assignment 5: Announced Monday, November 2. Due Friday, November 13.

Assignment 6: Announced Monday, November 16. Due Tuesday, December 1.

### Final Projects

In general I am open to discussion for various ideas regarding these projects. The route that I expect most people will take, is to either work alone or form a group with someone else, read a research paper, and present what they read to others during the poster session that takes place at the end of the semester. Another idea is for 2 people to form a group and implement a library with some algorithms that we saw in class, run some experiments with these algorithms, and present this work during the poster session at the end of the semester.

For a list of candidate papers or book chapters to be presented at the end of the course, please have a look below. The candidate material below is split into one- and multi-person projects.

I am happy to discuss final projects on other subjects that are related to the course. For example, you may want to use a data set that you acquired online for training a classifier and then predicting its generalization ability by performing cross-validation and ultimately compare how close this empirical estimate is with the prediction that you have from PAC learning (for an appropriate tradeoff between the accuracy parameter ε and the confidence parameter δ). I am also open to other suggestions; for example one can implement Shannon's mind reading machine. However, in any case, you need to talk to me well in advance and make sure that I agree with the path that you want to follow for what you plan to deliver for your final project.

#### Candidate Material

Clicking on the title of a paper takes you to the paper. It might be easier if you first download the pdf file locally to your computer.

### Computational Learning Theory Resources

#### Books

The two main books that we plan to use for the course are available for free in electronic format in the following links:

Another book that I like a lot and recommend to people who are starting with machine learning is the following one:

### Class Log

A log for the class will be held online here.

#### Class 1 (Aug 24, 2020)

Discussion on syllabus and policies.

Mentioned the existence of the following handouts:

Discussion of coupon collector's problem. We derived the expected running time and gave an upper bound on the running time (with high probability) using Markov's inequality.

There is still discussion that we should do here next time.

#### Class 2 (Aug 26, 2020)

Continued our discussion on coupon collection.

Started discussion on terminology that we will be using. Agnostic vs non-agnostic learning.

Concept learning and concept class.

#### Class 3 (Aug 28, 2020)

Assigned today: Homework 1.

Continued our discussion on terminology. Hypothesis space, hypothesis.

Proper learning vs `improper'/representation-independent learning. Learning in the realizable case.

Defined the different types of queries that one can ask for query learning.

#### Class 4 (Aug 31, 2020)

Clarified some points on the representation of functions.

Learning monotone conjunctions using membership and equivalence queries. Learning general conjunctions using equivalence queries. Showed negative result (exponential bound in the worst case) when using membership queries.

Lemma 1 from Angluin's paper – a sunflower lemma.

#### Class 5 (Sep 2, 2020)

The double sunflower. Using the same concept class we can achieve an exponential lower bound for any algorithm that performs exact identification using any kind of queries.

Assigned Reading: From the paper Queries and Concept Learning, Sections 1 to 2.4, as well as Section 6 are mandatory; rest is optional.

The more-general-than-or-equal-to relation among functions. Lattice with partial ordering of hypotheses in the hypothesis space. Correspondence between hypotheses in the hypothesis space and sets of examples in the instance space.

#### Class 6 (Sep 4, 2020)

Consistency and version spaces.

Algorithms Find-S and List-Then-Eliminate.

Most specific and most general boundaries.

Version space representation theorem. (Theorem 2.1 in Mitchell's book.)

#### Sep 7, 2020

No class. Labor day.

#### Class 7 (Sep 9, 2020)

Candidate Elimination Algorithm. Execution on our toy example.

Discussion on the convergence of Candidate Elimination Algorithm (no noise and realizability assumption).

What queries the learners should request next and the Halving algorithm.

#### Class 8 (Sep 11, 2020)

Due today: Homework 1.

Assigned today: Homework 2.

Using partially learned concepts to make predictions.

Inductive bias and the futility of bias-free learners.

Assigned Reading: Chapter 2, from Tom Mitchell's book.

Introduction to decision trees.

Entropy, information gain and description of ID3 algorithm for decision tree learning.

#### Class 9 (Sep 14, 2020)

Partial execution of the ID3 algorithm in order to identify the root node.

Characteristics of the search performed by ID3. Contrasting this search (and bias) of ID3 with the search performed by the Candidate-Elimination algorithm. Occam's razor.

Definition of overfitting. Approaches to deal with overfitting.

Partitioning original examples to: training set, validation set, test set.

#### Class 10 (Sep 16, 2020)

Handling overfitting: Reduced-Error Pruning and Rule Post-Pruning.

Handling continuous-valued attributes. One needs only to examine boundary points for ID3.

Discussion on missing attributes or attributes that have different costs for testing.

Assigned Reading: Chapter 3 from Tom Mitchell's book.

Perceptron and geometry of the solution (hypothesis).

The perceptron update rule for learning from linearly separable data.

Assigned Reading: Sections 4.4 – 4.4.2 from Tom Mitchell's book. Feel free to also read the beginning of Chapter 4 in Tom Mitchell's book.

#### Class 11 (Sep 18, 2020)

Discussion for a few minutes on the perceptron.

No free-lunch theorems.

Definition of PAC learning and efficient PAC learning.

Assigned Reading: Started Chapter 2 from the book Foundations of Machine Learning. Equivalently, from the book Understanding Machine Learning, we started discussing the content of Chapters 2 and 3.

#### Class 12 (Sep 21, 2020)

Due today: Homework 2.

Definition of agnostic PAC learning and efficient agnostic PAC learning.

Efficient PAC learnability of axis-aligned rectangles in the plane (using Find-S).

Assigned Reading: This is Example 2.4 from the book Foundations of Machine Learning. Equivalently, this is the first example in Chapter 1 in the Kearns-Vazirani book.

#### Sep 22, 2020

Assigned today: Homework 3.

#### Class 13 (Sep 23, 2020)

Efficient PAC learnability of conjunctions (using Find-S).

Assigned Reading: This is Section 1.3 in the Kearns-Vazirani book.

#### Class 14 (Sep 25, 2020)

True risk and empirical risk.

Consistent learners and generalization. Theorem for sample complexity of PAC learning using finite hypothesis spaces (in the consistency model).

Assigned Reading: Sections 7.1, 7.2, and 7.3 (up until 7.3.1 but not 7.3.1 yet) from Tom Mitchell's book.

Optional Reading: Occam's Razor, by Anselm Blumer, Andrej Ehrenfeucht, David Haussler, and Manfred K. Warmuth.

#### Class 15 (Sep 28, 2020)

Some clarifications on the perceptron, so that we can have decision boundaries that do not go through the origin of axes.

Discussed some implications of the theorem that we proved last time; e.g., we improved the sample complexity of PAC learning conjunctions without even arguing about the algorithm that can be used for learning. We also saw a specific example of plugging-in certain values for ε and δ.

Went through the Computational Learning Theory Overview slides and briefly discussed how we plan to extend our theorem for finite hypothesis spaces to situations where a hypothesis space is infinite.

We then starting going through the slides on Elements of Computational Learning Theory that summarize with more details some of the things that we have seen so far (no free-lunch theorems, functions that are used as toy examples in computational learning theory, sample size for finite hypothesis spaces) and started our discussion on k-term DNFs where we want to prove the intractability of efficiently (properly) PAC learning k-term DNFs. We determined the sample size that follows immediately from our theorem for finite concept classes and next time we will discuss the reduction.

#### Class 16 (Sep 30, 2020)

Intractability of learning 3-term DNF formulae.

Assigned Reading: This is discussed in Section 1.4 in the Kearns-Vazirani Book. Have a look in the notes that we have in the Elements of Computational Learning Theory slides between pages 24/48 and 29/48 (numbering shown on bottom right part of each slide).

#### Oct 1, 2020

Due today: Homework 3.

#### Class 17 (Oct 2, 2020)

Went through the solutions for homework assignments 1 and 2.

#### Class 18 (Oct 5, 2020)

Midterm.

Midterm was pushed back to Friday, October 9.

Efficiently PAC learning k-term DNF formulae using k-CNF formulae. This was a reduction that allowed us to solve a problem using an algorithm (Find-S) for solving another problem (simple conjunctions). So, the representation of the hypotheses is very important in PAC learning, and whereas it is hard to PAC learn k-term DNF formulae properly, nevertheless, we can efficiently PAC learn k-term DNF formulae in the realizable case using k-CNF formulae.

Set covering and finding an O(log(m)) optimal approximation with a greedy algorithm.

Application of the algorithm to PAC learning conjunctions with few relevant literals by covering the negative examples (that Find-S would otherwise ignore) with the literals obtained from the solution of Find-S (that initially only takes into account the positive examples).

Assigned Reading: For the discussion on learning k-term DNF formulae using k-CNF formulae you can have a look at any one of the following sources:

However, the discussion on the approach with Set Cover and the learnability of conjunctions with few relevant variables is only available in the Kearns-Vazirani book (and in our notes).

#### Class 19 (Oct 7, 2020)

Optional Reading: A Few Useful Things to Know About Machine Learning, by Pedro Domingos.

Went over the solutions for homework 3.

Introduction to the VC-dimension.

Midterm.

#### Class 21 (Oct 12, 2020)

Assigned today: Homework 4.

Announce that a list of candidate papers for final projects are now proposed online.

Mention Occam's Razor paper by Anselm Blumer, Andrej Ehrenfeucht, David Haussler, and Manfred K. Warmuth.

Went over the solutions for the midterm problems.

#### Class 22 (Oct 14, 2020)

Continued our discussion on the VC-dimension. What it means to prove a lower bound and what it means to prove an upper bound.

The growth function $\Pi_{\mathcal{C}}(m)$.

Discussed the VC-dimension of thresholds on a line. Difference compared to a perceptron in 1 dimension.

Discussed the VC-dimension of axis-aligned rectangles in 2 dimensions as well as in d dimensions.

#### Class 23 (Oct 16, 2020)

Discussion on the basic idea for active learning. Example with learning a threshold on a line by using binary search and thus achieving an exponential improvement on the number of labeled examples that can be used in order to accomplish the PAC criterion.

Upper bound on the VC dimension $d$ of a finite hypothesis space; that is, $d \le \lg(|\mathcal{H}|)$.

Application of the above bound so that we can compute tight bounds on the VC dimension of monotone and general conjunctions.

#### Class 24 (Oct 19, 2020)

Using the VC dimension in order to obtain a lower bound on the number of examples that are necessary for PAC learning a concept class in a distribution-free sense.

There is still some details that we need to discuss next time and actually conclude the proof.

A similar bound is given in the Kearns-Vazirani book, in Section 3.6.

#### Class 25 (Oct 21, 2020)

Due today: Homework 4.

Concluded the proof on the lower bound on the number of examples needed by any distibution-free PAC learning algorithm. Namely, we proved that any hypothesis space that has VC dimension d, implies a lower bound of $$m > \frac{d-1}{64\varepsilon}$$ training examples that will be needed by any distribution-free PAC learning algorithm.

Discussion on the VC dimension and what the different bounds mean, what our goals are, and further discussion between distribution-free and distribution-specific learning.

Proved that $\left(1 + \frac{1}{n}\right)^n \le e \le \left(1+\frac{1}{n}\right)^{n+1}$.

Definition of the $\Phi_d(m)$ function.

#### Class 26 (Oct 23, 2020)

Proved properties of the $\Phi_d(m)$ function.

• $\Phi_d(m) = \sum_{i=0}^d \binom{m}{i}$.
• $\Phi_d(m) \le \left(\frac{em}{d}\right)^d$.

Assigned Reading: From the book Foundations of Machine Learning, see Corollary 3.18.

Optional Reading: Lemma 3.2 (Section 3.4) and closing remarks in Section 3.4 of the Kearns-Vazirani book.

We stated Sauer's lemma (i.e., that $\Pi_{\mathcal{H}}(m) \le \Phi_d(m)$), but we will prove it next time.

Instead of the proof of Sauer's lemma, we discussed what adversarial machine learning is about. Hence, we stated high-level ideas about poisoning and evasion attacks. Evasion attacks are also known as adversarial examples. These basic ideas are available in the slides 7/50 and 8/50 that are available here.

#### Class 27 (Oct 26, 2020)

Proof of Sauer's lemma.

Assigned Reading: From the book Foundations of Machine Learning, see Theorem 3.17

Optional Reading: Sauer's lemma is proved in Lemma 3.1 (Section 3.4) of the Kearns-Vazirani book.

We wrote down the statement (for the upper bound on the number of examples) of the Fundamental Theorem of Computational Learning Theory.

#### Class 28 (Oct 28, 2020)

No class due to inclement weather.

No class today.

#### Class 30 (Nov 2, 2020)

Assigned today: Homework 5.

We started the proof of the upper bound of the Fundamental Theorem of Computational Learning Theory (upper bound on the number of examples that are enough for PAC learning in the consistency model, using a hypothesis class with finite VC dimension).

We have almost proved that, for a hypothesis $h$ (drawn from a hypothesis-space $\mathcal{H}$ that has finite VC dimension) that is consistent on a training sample $S$, it holds $$\text{Pr}\left[\ \text{Risk}_{\mathcal{D}}(h, c) \le \frac{2}{m}\cdot\left(\lg\left(\Pi_{\mathcal{H}}(2m)\right) + \lg\left(\frac{2}{\delta}\right)\right) \right] \ge 1 - \delta\,.$$ (This is proved in Section 3.5 from the Kearns-Vazirani book.)

Assigned Reading: Section 28.3 from the book Understanding Machine Learning, which is the upper bound for the realizable case.

Optional Reading: Section 3.5 from the Kearns-Vazirani book.

We will finish the proof next time.

Optional Reading: Learnability and the Vapnik-Chervonenkis Dimension, by Anselm Blumer, Andrej Ehrenfeucht, David Haussler, and Manfred K. Warmuth

Optional Reading: A survey of some aspects of computational learning theory, by György Turán.
The paper gives an overview of several important results that are related to computational learning theory; some of which we have already touched, such as the VC dimension, and others that we have not yet touched, such as mistake bounds in the mistake bound model of learning. You can read this paper in pieces as we are making progress with the class, but now is a good time to start digesting the notions that we have covered already in class.

#### Class 31 (Nov 4, 2020)

Concluded the proof that we had almost finished last time.

Discussion on different noise models.

• Bob Sloan has a nice description for several cases of noise models in Four types of noise in data for PAC learning.
• Apart from the models mentioned in Bob Sloan's paper we also mentioned the nasty noise model as a bounded-budget version of the malicious noise model. We will also make a related discussion here when we will be discussing more about poisoning attacks (p-tampering being the analogue of malicious noise and strong "clean label" p-budget attacks being the analogue of nasty noise).

Showed that the malicious noise that can be tolerated by PAC learning algorithms is less than $\varepsilon$, regardless of the concept class being learnt, where $\varepsilon$ is the usual parameter for bounding the risk in PAC learning. The result that we showed was a simple variant of the result of Kearns and Li from their paper Learning in the Presence of Malicious Errors using the method of induced distributions; in particular Theorem 1 from that paper.

Assigned Reading: Four types of noise in data for PAC learning, by Robert Sloan. Up to and including Section 2.2 is mandatory reading; the rest is optional.

Assigned Reading: Notes on what we discussed in class.

Optional Reading: Learning in the Presence of Malicious Errors, by Michael Kearns and Ming Li.

Optional Reading: Learning Disjunctions of Conjunctions, by Leslie Valiant. (The paper that introduced malicious noise.)

#### Class 32 (Nov 6, 2020)

Discussion on random (mis-)classification noise.

We proved that $$m \ge \frac{2}{\varepsilon^2(1-2\eta)^2}\cdot\ln\left(2|\mathcal{H}|/\delta\right)$$ training examples are enough for PAC learning a concept class $\mathcal{C}$ using a hypothesis space $\mathcal{H}$, in the realizable case, under random (mis-)classification noise rate $\eta$.

Of course it is a question of how we can actually obtain $\eta$ to begin with, as well as solve the computational problem that is hidden. We will discuss these in sequence.

Assigned Reading: Learning From Noisy Examples, by Dana Angluin and Philip Laird. For now please read up to (and including) Section 2.2.

#### Class 33 (Nov 9, 2020)

We discussed the algorithm for computing an upper bound $\eta_b < 1/2$ of the true noise rate $\eta$. There are a few things that we need to discuss next time.

Assigned Reading: Section 2.3 from the paper Learning From Noisy Examples, by Dana Angluin and Philip Laird.

#### Class 34 (Nov 11, 2020)

Concluded the discussion on the sample size needed for determining an upper bound $\eta_b$ of the true misclassification noise rate $\eta$.

Minimizing disagreements is $NP$-hard for monotone conjunctions.

#### Class 35 (Nov 13, 2020)

Due today: Homework 5.

Completed the proof on the hardness result.

Assigned Reading: Section 2.4 from the paper Learning From Noisy Examples, by Dana Angluin and Philip Laird.

A new algorithm for PAC learning conjunctions in the presence of misclassification noise.

Optional Reading: Section 3 from the paper Learning From Noisy Examples, by Dana Angluin and Philip Laird. The new algorithm for conjunctions is a special case of the algorithm presented in Section 3 of this paper.

Poisoning attacks. We used the slides that are available here.

Optional Reading: The paper Learning under p-tampering poisoning attacks, by Saeed Mahloujifar, Dimitrios Diochnos and Mohammad Mahmoody.

#### Class 36 (Nov 16, 2020)

Assigned today: Homework 6.

Evasion attacks / adversarial examples. We used the slides that are available here.

Optional Reading: The paper Adversarial Risk and Robustness: General Definitions and Implications for the Uniform Distribution, by Dimitrios Diochnos, Saeed Mahloujifar and Mohammad Mahmoody.

#### Class 37 (Nov 18, 2020)

Introduction to statistical queries. An SQ algorithm for learning monotone conjunctions.

#### Class 38 (Nov 20, 2020)

Advantages of learning algorithms that are based on Statistical Queries.

We derived a formula for computing $P_{\chi}$ which is the probability that the statistical query $\chi$ has to be satisfied (and will subsequently be returned within $\tau$ of its true value by the STAT oracle).

$P_{\chi} = Pr_{EX(c, \mathcal{D})}[\chi = 1] = \frac{(1-\eta)Pr_{EX_{CN}^\eta(c,\mathcal{D})}[\chi = 1] - \eta Pr_{EX_{CN}^\eta(\overline{c}, \mathcal{D})}[\chi=1]}{1-2\eta}\,.$

The good thing with this last formula is that we can compute the noise-free estimate that we want based on noisy estimates.

Assigned Reading: The lecture notes by Javed Aslam.

Assigned Reading: Sections 1 – 3 from the paper Improved Noise-Tolerant Learning and Generalized Statistical Queries, by Javed A. Aslam and Scott E. Decatur. The rest of the paper is optional.

Optional Reading: The Statistical Query model was introduced in the paper Efficient Noise-Tolerant Learning from Statistical Queries, by Michael Kearns. However, reading this paper is optional.

Though we are far from exhausting the subject on SQ learning, by now we have see several ideas around PAC learning, noise, and learning with Statistical Queries. The slides from A Tutorial on Statistical Queries, by Lev Reyzin can prove to be useful as a backup and provide additional pointers for further reading. The slides also have a discussion on other topics connecting to SQ learning such as evolvability, differential privacy and adaptive data analysis.

Note that some of the results mentioned in the slides are from papers that are suggested for presentation.

#### Class 39 (Nov 23, 2020)

Introduction to online learning and learning with expert advice.

Upper bound on the number of mistakes for the perceptron learning algorithm.

#### Nov 25, 2020

Thanksgiving vacation; no classes.

#### Nov 27, 2020

Thanksgiving vacation; no classes.

#### Class 40 (Nov 30, 2020)

Halving algorithm and an upper bound on the number of mistakes. Randomized halving algorithms and expected number of mistakes.

VC-dimension lower bound on the number of mistakes of any online learning algorithm.

Weighted Majority Algorithm.

Assigned Reading: Section 7.5 from Tom Mitchell's book.

Assigned Reading: Sections 1 and 2 from Avrim Blum's paper On-Line Algorithms in Machine Learning.

Assigned Reading: Theorem 2.1 from Wofgang Maass's paper On-Line Learning with an Oblivious Environment and the Power of Randomization. The rest of the paper is optional.

#### Tue, Dec 1

Due today: Homework 6.

#### Class 41 (Dec 2, 2020)

Randomized weighted majority algorithm.

Introduction to Winnow.

Assigned Reading: Section 3.2 from Avrim Blum's paper On-Line Algorithms in Machine Learning.

#### Class 42 (Dec 4, 2020)

Concluded our discussion on Winnow.

Assigned Reading: Section 3.2 from Avrim Blum's paper On-Line Algorithms in Machine Learning.

Introduction to evolvability.

#### Class 43 (Dec 7, 2020)

Evolvability and the Swapping Algorithm.

Optional Reading: Notes on Evolution and Learning, Sections 1 – 7. We used the slides Evolvability in Learning Theory in class.

Optional Reading: The paper On Evolvability: The Swapping Algorithm, Product Distributions, and Covariance, by Diochnos and Turán.

Introduction to boosting. Boosting the confidence and boosting the accuracy.

Overview of boosting the confidence procedure.

Optional Reading: Sections 4.1 and 4.2 from the Kearns-Vazirani book.

#### Class 44 (Dec 9, 2020)

Overview of the ideas for boosting the accuracy.

We plan to devote 10 – 12 minutes for evaluating the course.

Discussion on Gather Town and presentations.

Optional Reading: A modest procedure for boosting the accuracy. Sections 4.3.1 and 4.3.2 from the Kearns-Vazirani book.

The above analysis is from the paper The Strength of Weak Learnability, by Robert E. Shapire which was the first paper presenting a boosting method and thus converting a weak learner to a strong learner.

A much better method with applications in our days is AdaBoost. A short introduction to boosting with AdaBoost is available in the paper A Short Introduction to Boosting, by Yoav Freund and Robert E. Shapire.

Also, a survey paper (book chapter) that has a large overlap with the above paper, is The Boosting Approach to Machine Learning: An Overview, by Robert E. Shapire where this time boosting and logistic regression are also discussed.

All the above-mentioned papers are well worth your time; however, they are optional readings for this class given how late in the semester we discussed boosting this year..

#### Class 45 (Dec 11, 2020)

No class today. Poster presentations.

#### Tuesday, December 15, 2020 (8:00am – 10:00am)

Final exam. In-class (Devon Energy Hall 0120 — where we met throughout the semester).