View on GitHub

CS269 [Spring2021] Foundations of Deep Learning

Foundations of Deep Learning

Overview

Deep learning has achieved great success in many applications such as image processing, speech recognition and Go games. However, the reason why deep learning is so powerful remains elusive. The goal of this course is to understand the successes of deep learning by studying and building the theoretical foundations of deep learning. Topics covered by this course include but are not limited to: approximation power of neural networks, optimization for deep learning, generalization error analysis of deep learning and benign-overfitting of overparamterized learning models. Instructor will give lectures the selected topics. Students will present and discuss papers on the reading list, and do a course project.

Prerequisites

CS 260A, STAT 200A and 200B, ECE 236B and 236C, or equivalent courses.

Logistics

There is no required textbook. The following are recommended textbooks:

  1. [T] Matus Telgarsky, Deep learning theory lecture note, 2020
  2. [A] Sanjeev Arora et al., Theory of Deep learning book draft, 2020 (Thank Prof. Sanjeev Arora for sharing the latest version of the book draft!)
  3. [SSBD] Shai Shalev-Shwartz, and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014.
  4. [MRT] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2012.
  5. [GBCB] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning. Vol. 1. Cambridge: MIT press, 2016.
  6. [ZLLS] Aston Zhang, Zack C. Lipton, Mu Li, Alex J. Smola, Dive into Deep Learning, 2018.

Other Reference

  1. [SHNGS] Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., & Srebro, N. (2018). The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1), 2822-2878.
  2. [GLSS] Gunasekar, S., Lee, J., Soudry, D., & Srebro, N. (2018, July). Characterizing implicit bias in terms of optimization geometry. In International Conference on Machine Learning (pp. 1832-1841). PMLR.
  3. [NLGSSS] Nacson, M. S., Lee, J., Gunasekar, S., Savarese, P. H. P., Srebro, N., & Soudry, D. (2019, April). Convergence of gradient descent on separable data. In 22nd International Conference on Artificial Intelligence and Statistics (pp. 3420-3428). PMLR.
  4. [DZPS] Du, S. S., Zhai, X., Poczos, B., & Singh, A. (2019). Gradient descent provably optimizes over-parameterized neural networks. ICLR.
  5. [MMN] Song, M., Montanari, A., & Nguyen, P. (2018). A mean field view of the landscape of two-layers neural networks. Proceedings of the National Academy of Sciences.
  6. [CB] Chizat, L., & Bach, F. (2018). On the global convergence of gradient descent for over-parameterized models using optimal transport. In NeurIPS.
  7. [FDZ] Fang, C., Dong, H., & Zhang, T. (2019). Over parameterized two-level neural networks can learn near optimal feature representations. arXiv preprint arXiv:1910.11508.
  8. [BLLT] Bartlett, P. L., Long, P. M., Lugosi, G., & Tsigler, A. (2020). Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48), 30063-30070.
  9. [ZWBGK] Zou, D., Wu, J., Braverman, V., Gu, Q., & Kakade, S. M. (2021). Benign Overfitting of Constant-Stepsize SGD for Linear Regression. In COLT.

Grading Policy

Grades will be computed based on the following factors:

Schedule

# Date Topic note scribed note reading materials homework
1 3/29 Introduction note scribe note CH0-1 of [T]  
2 3/31 Approximation I note scribe note CH2-3 of [T]  
3 4/5 Approximation II note scribe note CH3-4 of [T]  
4 4/7 Approximation III note scribe note CH4-5 of [T] HW1 out
5 4/12 Implicit Bias of Gradient Descent I note scribe note CH9 of [A],[SHNGS]  
6 4/14 Implicit Bias of Gradient Descent II note scribe note CH9 of [A],[GLSS,NLGSSS]  
7 4/19 Clarke Subdifferential and Positive Homogeneity note scribe note CH14 of [T] HW1 due
8 4/21 Implicit Bias of Gradient Descent III note   CH15 of [T] HW2 out
9 4/26 NTK Analysis of NNs I note scribe note CH10 of [A], [DZPS]  
10 4/28 NTK Analysis of NNs II note   CH10 of [A], [DZPS]  
11 5/3 Lazy Training note   CH13 of [T] HW2 due
12 5/5 Mean Field Analysis of NNs I note   [FDZ][MMN]  
13 5/10 Mean Field Analysis of NNs II note   [FDZ][MMN] HW3 out
14 5/12 Mean Field Analysis of NNs III note   [FDZ][MMN]  
15 5/17 Generalization Bounds of DNNs I note   CH19 of [T]  
16 5/19 Generalization Bounds of DNNs II note   CH21 of [T] HW3 due, HW4 out
17 5/24 Generalization Bounds of DNNs III note   CH21 of [T]  
18 5/26 Paper Presenation        
  5/31 Memorial Day Holiday       HW4 due, HW5 out
19 6/2 Generalization Bounds of DNNs IV note   CH21 of [T]  
  6/11         HW5 due

Academic Integrity Policy

Students are encouraged to read the UCLA Student Conduct Code for Academic Integrity.

Lecture Note Scribing

Students are required to scribe one lecture note. The latex template for lecture note will be provided. The scribed lecture notes should be a zip file submitted on CCLE that compiles without errors, and it is due 4 days after the lecture. This note will be graded. For example, if 2 students are assigned to scribe a given lecture, I expect to receive 2 separate notes. The individual notes are primarily for grading purposes (and also to make sure that each student scribes their own lecture notes), while the final version of the lecture note will be posted on the course website, after being proofread and edited by the Instructor.

Homework

There will be about 5 homework assignments. The lowest homework score will be dropped. Homework is required to be written in Latex. Latex homework template will be provided. Unless otherwise indicated, you may talk to other students about the homework problems but each student must hand in their own answers. You also must indicate on each homework with whom you collaborated and cite any other references and sources you use including Internet sites. Homework is worth full credit before the due time. It is worth zero credit after the due time.

Paper Presentation

After each lecture, there will be a few recommended readings. Each student is required to select one paper from the list, and prepare a 20 minutes presentation for the class. One paper can only be presented by one student. Students are expected to prepare the slides by themselves, but the original authors’ slides are allowed to be used with proper citation.

The paper presentation will start from week 5.

Both the instructor and other students will grade the presentation (no self-grading). We will provide the detailed grading criteria later.

Project

Students are required to do a project in this class. The goal of the course project is to provide the students an opportunity to explore research directions in optimization or machine learning. Therefore, the project should be related to the course content. An expected project include but not limited to

The best outcome of the project is a manuscript that is publishable in major machine learning conferences (COLT, ICML, NeurIPS, ICLR, AISTATS, UAI etc.) or journals (Journal of Machine Learning Research). The detailed course project guideline can be found at here. Students cannot use their own published work as the course project.

Relevant Courses

There are many other great deep learning theory and statistical learning theory courses. To mention a few:

Matus Telgarsky’s deep learning theory course

Sanjeev Arora’s theoretical deep learning course

Peter Bartlett’s statistical learning theory course

Sham Kakade’s statistical learning theory course

Maxim Raginsky’s statistical learning theory course