# diochnos/teaching/CS5970/2021F

## CS 5970 – Computational Learning Theory (Fall 2021)

### Course Description

Topics of machine learning theory. Learning using membership queries, equivalence queries, version spaces, decision trees, perceptrons. Probably approximately correct (PAC) learning, Occam algorithms, VC-dimension, sample sizes for distribution-independent learning. Representation issues, proper learning, reductions, intractability, learning in the realizable case, agnostic learning. Noise models, statistical queries, PAC learning under noise. Adversarially robust learning against poisoning attacks and against adversarial examples. Distribution-specific learning and evolvability. Online learning and learning with expert advice in the mistake bound model. Weak and strong learning (boosting).

### Basic Information

#### Syllabus

The syllabus is available here.

#### Time and Location

Mondays, Wednesdays, and Fridays, 9:30am – 10:20am, Carson Engineering Center 0121.

#### Office Hours

I will be holding my office hours at the following times.

Mondays
10:30am – 11:30am, 244 Devon Energy Hall
Wednesdays
10:30am – 11:30am, 244 Devon Energy Hall
##### Exceptions to the Regular Schedule of Office Hours

If you want to meet me outside of my office hours, please send me an email and arrange an appointment.

Monday, August 23, 2021: I will not hold office hours on 8/23. I have to give a talk to IJCAI-2021 at the same time.

### Homework Assignments

Assignment 1: Announced on Friday, August 27. Due on Friday, September 10.

Assignment 2: Announced on Friday, September 10. Due on Monday, September 20.

Assignment 3: Announced on Friday, September 20. Due on Thursday, September 30.

Assignment 4: Announced on Monday, October 11. Due on Wednesday, October 20 Sunday, October 24.

Assignment 5: Announced Monday, October 25. Due on Monday, November 8.

Assignment 6: Announced Monday, November 9. Due Tuesday, November 30.

### Final Projects

In general I am open to discussion for various ideas regarding these projects. The route that I expect most people will take, is to either work alone or form a group with someone else, read a research paper, and present what they read to others during the poster session that takes place at the end of the semester. Another idea is for 2 people to form a group and implement a library with some algorithms that we saw in class, run some experiments with these algorithms, and present this work during the poster session at the end of the semester.

For a list of candidate papers or book chapters to be presented at theend of the course, please have a look below.

I am happy to discuss final projects on other subjects that are related to the course. For example, you may want to use a data set that you acquired online for training a classifier and then predicting its generalization ability by performing cross-validation and ultimately compare how close this empirical estimate is with the prediction that you have from PAC learning (for an appropriate tradeoff between the accuracy parameter ε and the confidence parameter δ). I am also open to other suggestions; for example one can implement Shannon's mind reading machine. However, in any case, you need to talk to me well in advance and make sure that I agree with the path that you want to follow for what you plan to deliver for your final project.

#### Candidate Material

Clicking on the title of a paper takes you to the paper. It might be easier if you first download the file locally to your computer and then click on the links.

### Computational Learning Theory Resources

#### Books

The two main books that we plan to use for the course are available for free in electronic format in the following links:

Another book that I like a lot and recommend to people who are starting with machine learning is the following one:

### Class Log

A log for the class will be held online here.

#### Class 1 (Aug 23, 2021)

Discussion on syllabus and policies.

Discussion of coupon collector's problem. We derived the expected running time of the process of collecting all $N$ coupons.

There are several things that we can (and we will) discuss here next time.

#### Class 2 (Aug 25, 2021)

Continued our discussion on coupon collection.

Mentioned the existence of the following handouts:

#### Class 3 (Aug 27, 2021)

Started discussion on terminology that we will be using. Agnostic vs non-agnostic learning.

Concept learning and concept class.

Hypothesis space, hypothesis.

Proper learning vs `improper'/representation-independent learning. Learning in the realizable case.

Assigned today: Homework 1.

#### Class 4 (Aug 30, 2021)

Further discussion on machine learning terminology and semantics.

Discussion on the representation of Boolean functions.

#### Class 5 (Sep 1, 2021)

Defined the different types of queries that one can ask for query learning.

Learning monotone conjunctions using membership and equivalence queries. Learning general conjunctions using equivalence queries. Showed negative result (exponential bound in the worst case) when using membership queries.

Lemma 1 from Angluin's paper – a sunflower lemma.

Assigned Reading: From the paper Queries and Concept Learning, Sections 1 to 2.4, as well as Section 6 are mandatory; rest is optional.

#### Class 6 (Sep 3, 2021)

The double sunflower. Using the same concept class we can achieve an exponential lower bound for any algorithm that performs exact identification using any kind of queries.

The more-general-than-or-equal-to relation among functions. Lattice with partial ordering of hypotheses in the hypothesis space. Correspondence between hypotheses in the hypothesis space and sets of examples in the instance space.

Algorithm Find-S.

#### Sep 6, 2021

No class. Labor day.

#### Class 7 (Sep 8, 2021)

Consistency and version spaces.

Algorithm List-Then-Eliminate.

Most specific and most general boundaries.

Version space representation theorem. (Theorem 2.1 in Mitchell's book.)

Candidate Elimination Algorithm. Execution on our toy example.

#### Class 8 (Sep 10, 2021)

Due today: Homework 1.

Assigned today: Homework 2.

Using partially learned concepts to make predictions.

Inductive bias and the futility of bias-free learners.

Assigned Reading: Chapter 2, from Tom Mitchell's book.

Introduction to decision trees.

Entropy, information gain and basic idea for the construction of decision trees.

#### Class 9 (Sep 13, 2021)

Partial execution of the ID3 algorithm in order to identify the root node.

Characteristics of the search performed by ID3. Contrasting this search (and bias) of ID3 with the search performed by the Candidate-Elimination algorithm. Occam's razor.

#### Class 10 (Sep 15, 2021)

Definition of overfitting. Approaches to deal with overfitting.

Partitioning original examples to: training set, validation set, test set.

Handling overfitting: Reduced-Error Pruning and Rule Post-Pruning.

Handling continuous-valued attributes. One needs only to examine boundary points for ID3.

Discussion on missing attributes or attributes that have different costs for testing.

Assigned Reading: Chapter 3 from Tom Mitchell's book.

#### Class 11 (Sep 17, 2021)

Perceptron and geometry of the solution (hypothesis).

The perceptron update rule for learning from linearly separable data.

Assigned Reading: Sections 4.4 – 4.4.2 from Tom Mitchell's book. Feel free to also read the beginning of Chapter 4 in Tom Mitchell's book.

No free-lunch theorems.

Assigned Reading: Perceptrons (beginning of Chapter 4) from Tom Mitchell's book.

#### Class 12 (Sep 20, 2021)

Due today: Homework 2.

Assigned today: Homework 3.

Definition of PAC learning and efficient PAC learning.

Definition of agnostic PAC learning and efficient agnostic PAC learning.

Started our discussion on learning axis-aligned rectangles in the plane.

Assigned Reading: Started Chapter 2 from the book Foundations of Machine Learning. Equivalently, from the book Understanding Machine Learning, we started discussing the content of Chapters 2 and 3.

#### Class 13 (Sep 22, 2021)

Completed our discussion on PAC learning axis-aligned rectangles in the plane.

Assigned Reading: This is Example 2.4 from the book Foundations of Machine Learning. Equivalently, this is the first example in Chapter 1 in the Kearns-Vazirani book.

Efficient PAC learnability of conjunctions (using Find-S).

Optional Reading: This is Section 1.3 in the Kearns-Vazirani book.

#### Class 14 (Sep 24, 2021)

Discussion on Haussler's example for blowing up the set G that the Candidate-Elimination algorithm maintains.

Discussion on version spaces, consistent learners, Occam algorithms and generalization.

Theorem for sample complexity of PAC learning using finite hypothesis spaces (in the realizable case).

Discussed some implications of the theorem that we proved last time; e.g., we improved the sample complexity of PAC learning conjunctions without even arguing about the algorithm that can be used for learning. We also saw a specific example of plugging-in certain values for ε and δ.

Assigned Reading: Sections 7.1, 7.2, and 7.3 (up until 7.3.1 but not 7.3.1 yet) from Tom Mitchell's book. This is where Mitchell is introducing PAC learning and is proving the theorem with the sample size for finite hypothesis spaces in the realizable case.

Assigned Reading: From the book Foundations of Machine Learning, Theorem 2.5 corresponds to the Theorem that we proved in class about finite hypothesis spaces in the realizable case and Example 2.6 has the discussion that we did in class for improving the sample size needed for PAC learning conjunctions using Find-S.

Optional Reading: Occam's Razor, by Anselm Blumer, Andrej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. This is the paper that proved this simple but very important theorem.

#### Class 15 (Sep 27, 2021)

Set covering and finding an O(log(m)) optimal approximation with a greedy algorithm.

Application of the algorithm to PAC learning conjunctions with few relevant literals by covering the negative examples (that Find-S would otherwise ignore) with the literals obtained from the solution of Find-S (that initially only takes into account the positive examples).

#### Class 16 (Sep 29, 2021)

Theorem for agnostic PAC learning using finite hypotheses spaces and empirical risk minimization.

Intractability of learning 3-term DNF formulae.

Assigned Reading: The rest of Section 7.3 from Tom Mitchell's book.

Optional Reading: This is discussed in Section 1.4 in the Kearns-Vazirani Book.

#### Sep 30, 2021

Due today: Homework 3.

#### Class 17 (Oct 1, 2021)

Discussion of solutions on homework assignments (all 3) so far.

Preparation for the midterm.

Midterm

#### Class 19 (Oct 6, 2021)

Solutions of the midterm.

#### Oct 8, 2021

No class will take place on this day. It has been designated as the 2021 Fall Student Holiday.

#### Class 20 (Oct 11, 2021)

Intractability of learning 3-term DNF properly.

Assigned today: Homework 4.

#### Class 21 (Oct 13, 2021)

Efficiently PAC learning k-term DNF formulae using k-CNF formulae. This was a reduction that allowed us to solve a problem using an algorithm (Find-S) for solving another problem (simple conjunctions). So, the representation of the hypotheses is very important in PAC learning, and whereas it is hard to PAC learn k-term DNF formulae properly, nevertheless, we can efficiently PAC learn k-term DNF formulae in the realizable case using k-CNF formulae.

Introduction to the VC-dimension. What it means to prove a lower bound and what it means to prove an upper bound.

The growth function $\Pi_{\mathcal{H}}(m)$.

Discussed the VC-dimension of thresholds on a line. Difference compared to a perceptron in 1 dimension.

Discussed the VC-dimension of axis-aligned rectangles in 2 dimensions as well as in d dimensions.

Assigned Reading: For the discussion on learning k-term DNF formulae using k-CNF formulae you can have a look at any one of the following sources:

• From Tom Mitchell's book, Sections 7.3.2 and 7.3.3. For the discussion on the VC dimension see Sections 7.4 – 7.4.2 (including 7.4.2).
• From the book Foundations of Machine Learning, see Example 2.6, Example 2.8, and Example 2.9. See also Sections 3.1 and 3.2 for the discussion on the VC dimension. It also has several examples, some of which we discuss in class.

Optional Reading: From the Kearns-Vazirani book, Sections 1.4 and 1.5 (learning 3-term DNF formulae) as well as Sections 3.1 – 3.3 (VC dimension).

#### Class 22 (Oct 15, 2021)

We reviewed the notion of the VC dimension and other notions that are related to that (e.g., shattering, the growth function, etc).

Upper bound on the VC dimension $d$ of a finite hypothesis space; that is, $d \le \lg(|\mathcal{H}|)$.

#### Class 23 (Oct 18, 2021)

Application of the bound on the VC dimension for finite hypotheses spaces, so that we can compute tight bounds on the VC dimension of monotone and general conjunctions.

Definition of the $\Phi_d(m)$ function.

Proved that $\left(1 + \frac{1}{n}\right)^n \le e \le \left(1+\frac{1}{n}\right)^{n+1}$.

Proved properties of the $\Phi_d(m)$ function.

• $\Phi_d(m) = \sum_{i=0}^d \binom{m}{i}$.
• $\Phi_d(m) \le \left(\frac{em}{d}\right)^d$.

Assigned Reading: From the book Foundations of Machine Learning, see Corollary 3.18.

Optional Reading: Lemma 3.2 (Section 3.4) and closing remarks in Section 3.4 of the Kearns-Vazirani book.

We stated the Sauer-Shelah lemma (i.e., that $\Pi_{\mathcal{H}}(m) \le \Phi_d(m)$), but we will prove it next time.

#### Class 24 (Oct 20, 2021)

Proof of the Sauer-Shelah lemma.

Assigned Reading: From the book Foundations of Machine Learning, see Theorem 3.17

Optional Reading: Sauer's lemma is proved in Lemma 3.1 (Section 3.4) of the Kearns-Vazirani book.

We discussed the big picture on the theorems regarding sample sizes that we have proved as well as what we plan to prove. We then started our discussion on the Fundamental Theorem of Computational Learning Theory (for the upper bound on the number of examples).

#### Class 25 (Oct 22, 2021)

Proof of the upper bound of the Fundamental Theorem of Computational Learning Theory (upper bound on the number of examples that are enough for PAC learning in the consistency model, using a hypothesis class with finite VC dimension).

We proved that, for a hypothesis $h$ (drawn from a hypothesis-space $\mathcal{H}$ that has finite VC dimension) that is consistent on a training sample $S$, it holds $$\text{Pr}\left[\ R_{\mathcal{D}}(h, c) \le \frac{2}{m}\cdot\left(\lg\left(\Pi_{\mathcal{H}}(2m)\right) + \lg\left(\frac{2}{\delta}\right)\right) \right] \ge 1 - \delta\,.$$ (This is proved in Section 3.5 from the Kearns-Vazirani book.)

Assigned Reading: Section 28.3 from the book Understanding Machine Learning, which is the upper bound for the realizable case.

Optional Reading: Section 3.5 from the Kearns-Vazirani book.

Optional Reading: A survey of some aspects of computational learning theory, by György Turán.
The paper gives an overview of several important results that are related to computational learning theory; some of which we have already touched, such as the VC dimension, and others that we have not yet touched, such as mistake bounds in the mistake bound model of learning. You can read this paper in pieces as we are making progress with the class, but now is a good time to start digesting the notions that we have covered already in class.

#### Oct 24, 2021

Due today: Homework 4.

#### Class 26 (Oct 25, 2021)

Assigned today: Homework 5.

Proof on the lower bound on the number of examples needed by any distibution-free PAC learning algorithm. Namely, we proved that any hypothesis space that has VC dimension d, implies a lower bound of $$m > \frac{d-1}{64\varepsilon}$$ training examples that will be needed by any distribution-free PAC learning algorithm.

A similar bound is given in the Kearns-Vazirani book, in Section 3.6.

Overview of the results that we have proved for (distribution-free) PAC learning.

#### Class 27 (Oct 27, 2021)

Training time attacks: Noise and poisoning attacks.

Test-time attacks: Evasion attacks (adversarial examples).

Attacks involving both phases: Backdoor and hybrid attacks.

For now we focus on noise. Discussion on different noise models.

• Bob Sloan has a nice description for several cases of noise models in Four types of noise in data for PAC learning.
• Apart from the models mentioned in Bob Sloan's paper we will mentioned the nasty noise model as a bounded-budget version of the malicious noise model. We will also make a related discussion here we will be discussing more about poisoning attacks (p-tampering being the analogue of malicious noise and strong "clean label" p-budget attacks being the analogue of nasty noise).

Assigned Reading: Four types of noise in data for PAC learning, by Robert Sloan. Up to and including Section 2.2 is mandatory reading; the rest is optional.

#### Class 28 (Oct 29, 2021)

We showed that the malicious noise that can be tolerated by PAC learning algorithms is less than $\varepsilon$, regardless of the concept class being learnt, where $\varepsilon$ is the usual parameter for bounding the risk in PAC learning. The result that we showed was a simple variant of the result of Kearns and Li from their paper Learning in the Presence of Malicious Errors using the method of induced distributions; in particular Theorem 1 from that paper.

Assigned Reading: Notes on what we discussed in class.

Optional Reading: Learning in the Presence of Malicious Errors, by Michael Kearns and Ming Li.

Optional Reading: Learning Disjunctions of Conjunctions, by Leslie Valiant. (The paper that introduced malicious noise.)

Brief discussion on some of the available papers for presentation.

#### Class 29 (Nov 1, 2021)

Presentation on financial education by Cami Sheaffer.

#### Class 30 (Nov 3, 2021)

Discussion on random classification noise.

We proved that $$m \ge \frac{2}{\varepsilon^2(1-2\eta)^2}\cdot\ln\left(2|\mathcal{H}|/\delta\right)$$ training examples are enough for PAC learning a concept class $\mathcal{C}$ using a hypothesis space $\mathcal{H}$, in the realizable case, under random classification noise rate $\eta$.

Of course it is a question of how we can actually obtain $\eta$ to begin with, as well as solve the computational problem that is hidden. We will discuss these in sequence.

Assigned Reading: Learning From Noisy Examples, by Dana Angluin and Philip Laird. For now please read up to (and including) Section 2.2.

Brief discussion on some of the candidate papers. We will discuss more next time.

#### Class 31 (Nov 5, 2021)

We will discuss an algorithm for computing an upper bound $\eta_b < 1/2$ of the true noise rate $\eta$.

Assigned Reading: Section 2.3 from the paper Learning From Noisy Examples, by Dana Angluin and Philip Laird.

#### Class 32 (Nov 8, 2021)

Due today: Homework 5.

Assigned today: Homework 6.

PAC learning under class imbalance. Situations where low risk is not the only goal; recall and precision may be very important. A bisection algorithm for computing a lower bound on the rate of the minority class. Experiments on monotone conjunctions and performance guarantees.

#### Class 33 (Nov 10, 2021)

Minimizing disagreements is $NP$-hard for monotone conjunctions.

Introduction to statistical queries. An SQ algorithm for learning monotone conjunctions.

We need to continue our discussion here.

#### Class 34 (Nov 12, 2021)

Advantages of learning algorithms that are based on statistical queries.

We will derive a formula for computing $P_{\chi}$ which is the probability that the statistical query $\chi$ has to be satisfied (and will subsequently be returned within $\tau$ of its true value by the STAT oracle).

$P_{\chi} = Pr_{EX(c, \mathcal{D})}[\chi = 1] = \frac{(1-\eta)Pr_{EX_{CN}^\eta(c,\mathcal{D})}[\chi = 1] - \eta Pr_{EX_{CN}^\eta(\overline{c}, \mathcal{D})}[\chi=1]}{1-2\eta}\,.$

The good thing with this last formula is that we can compute the noise-free estimate that we want based on noisy estimates.

Assigned Reading: The lecture notes by Javed Aslam.

Assigned Reading: Sections 1 – 3 from the paper Improved Noise-Tolerant Learning and Generalized Statistical Queries, by Javed A. Aslam and Scott E. Decatur. The rest of the paper is optional.

Optional Reading: The Statistical Query model was introduced in the paper Efficient Noise-Tolerant Learning from Statistical Queries, by Michael Kearns. However, reading this paper is optional.

Though we are far from exhausting the subject on SQ learning, by now we have see several ideas around PAC learning, noise, and learning with statistical queries. The slides from A Tutorial on Statistical Queries, by Lev Reyzin can prove to be useful as a backup and provide additional pointers for further reading. The slides also have a discussion on other topics connecting to SQ learning such as evolvability, differential privacy and adaptive data analysis.

#### Class 35 (Nov 15, 2021)

Poisoning attacks. We used the slides that are available

Optional Reading: The paper Learning under p-tampering poisoning attacks, by Saeed Mahloujifar, Dimitrios Diochnos and Mohammad Mahmoody.

#### Class 36 (Nov 17, 2021)

Evasion attacks / adversarial examples. We used the slides that are available here.

Optional Reading: The paper Adversarial Risk and Robustness: General Definitions and Implications for the Uniform Distribution, by Dimitrios Diochnos, Saeed Mahloujifar and Mohammad Mahmoody.

#### Class 37 (Nov 19, 2021)

Concluded our discussion on adversarial machine learning.

#### Class 38 (Nov 22, 2021)

Distribution-specific learning, evolvability, and the swapping algorithm.

#### Nov 24, 2021

Thanksgiving vacation; no classes.

#### Nov 26, 2021

Thanksgiving vacation; no classes.

#### Class 39 (Nov 29, 2021)

Introduction to online learning and learning with expert advice.

Upper bound on the number of mistakes for the perceptron learning algorithm.

#### Nov 30, 2021

Due today: Homework 6.

#### Class 40 (Dec 1, 2021)

Halving algorithm and an upper bound on the number of mistakes. Randomized halving algorithms and expected number of mistakes.

VC-dimension lower bound on the number of mistakes of any online learning algorithm.

Weighted Majority Algorithm.

Assigned Reading: Section 7.5 from Tom Mitchell's book.

Assigned Reading: Sections 1 and 2 from Avrim Blum's paper On-Line Algorithms in Machine Learning.

Assigned Reading: Theorem 2.1 from Wofgang Maass's paper On-Line Learning with an Oblivious Environment and the Power of Randomization. The rest of the paper is optional.

#### Class 41 (Dec 3, 2021)

Randomized weighted majority algorithm.

Winnow.

Assigned Reading: Section 3.2 from Avrim Blum's paper On-Line Algorithms in Machine Learning.

#### Class 42 (Dec 6, 2021)

Dr Maiti will discuss modern topics in computer security.

Boosting.

#### Class 44 (Dec 10, 2021)

No class will be held. Instead we will have poster presentations.

#### Tuesday, December 14, 2021 (8:00am – 10:00am)

Final exam. In-class (Carson Engineering Center 0121 — where we met throughout the semester).