Ian Goodfellow, Yoshua Bengio, Aaron Courville-Deep Learning [pre-pub version]-MIT Press (2016).pdf

Deep Learning Deep Learning Ian Go Goo odfello dfellow w Yosh oshua ua Bengio Ian GoCourville odfellow Aaron Yoshua Bengio Aaron Courville

Con Conten ten tents ts Contents Website

vii

Wcebsite A kno knowledgmen wledgmen wledgments ts

vii viii

Acknowledgments Notation

viii xi

Notation 1 In Intro tro troduction duction 1.1 Who Should Read This Bo Book? ok? . . . . . . . . 1 1.2 Introduction Historical Trends in Deep Learning . . . . . 1.1 Who Should Read This Book? . . . . . . . . 1.2 Historical Trends in Deep Learning . . . . . I Applied Math and Mac Machine hine Learning Basics I Applied Math and Machine Learning Basics 2 Linear Algebra 2.1 Scalars, Vectors, Matrices and Tensors . . . 2 2.2 LinearMultiplying Algebra Matrices and Vectors . . . . . . 2.1 Scalars, ectors, Matrices and T 2.3 Iden Identit tit tity yV and In Inverse verse Matrices . ensors . . . . .. .. .. 2.2 Linear Multiplying Matrices and Vectors 2.4 Dep Dependence endence and Span . . .. .. .. .. .. .. 2.3 Norms Identity .and 2.5 . . In . verse . . . .Matrices . . . . . .. .. .. .. .. .. .. .. 2.4 Sp Linear endence and Span . . . . .. .. .. 2.6 Special ecial Dep Kinds of Matrices and V. ectors 2.5 Eigendecomp Norms . . . osition . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. 2.7 Eigendecomposition 2.6 Sp ecial Kinds Matrices and V 2.8 Singular ValueofDecomp Decomposition osition . ectors . . . . .. .. .. 2.7 The Eigendecomp osition Pseudoinv . . . . . . erse . . .. .. .. .. .. .. 2.9 Mo Moore-P ore-P ore-Penrose enrose Pseudoinverse 2.8 The Singular Value Decomp. osition 2.10 Trace Op Operator erator . . . . .. .. .. .. .. .. .. .. 2.9 The The Determinan Mo ore-Penrose 2.11 Determinant t . Pseudoinv . . . . . . erse . . .. .. .. .. .. .. 2.10 The T race Op erator . . . . . . .Analysis . . . . . .. 2.12 Example: Principal Comp Components onents 2.11 The Determinant . . . . . . . . . . . . . . . 2.12 Example: Principal Components Analysis . 3 Probabilit Probability y and Information Theory 3.1 Wh Why y Probabilit Probability? y? . . . . . . . . . . . . . . . 3 Probability and Information Theory 3.1 Why Probability? . . . . . . . . . . . . . . . i i

. . . .

. . .. .. .. .. .. .. .. .. .. .. . . .

. . . .

. . .. .. .. .. .. .. .. .. .. .. . . .

. . . .

. . .. .. .. .. .. .. .. .. .. .. . . .

. . . .

. . .. .. .. .. .. .. .. .. .. .. . . .

. . . .

. . .. .. .. .. .. .. .. .. .. .. . . .

. . . .

. . .. .. .. .. .. .. .. .. .. .. . . .

. . . .

. . .. .. .. .. .. .. .. .. .. .. . . .

. . . .

. . .. .. .. .. .. .. .. .. .. .. . . .

. . . .

. . .. .. .. .. .. .. .. .. .. .. . . .

. . . .

. . .. .. .. .. .. .. .. .. .. .. . . .

. . . .

. . .. .. .. .. .. .. .. .. .. .. . . .

. . . .

xi1 8 1 11 8 11 29

. . .. .. .. .. .. .. .. .. .. .. . . .

29 31 31 31 34 31 36 34 37 36 39 37 40 39 42 40 44 42 45 44 46 45 47 46 48 47 48 53 54 53 54

. . . . . . . . . . . .

CONTENTS

3.2 Random Variables . . . . . . . . . . . . . . 3.3 Probabilit Probability y Distributions . . . . . . . . . . . 3.2 Random V ariables y. .. .. .. .. .. .. .. .. .. .. .. .. .. 3.4 Marginal Probabilit Probability 3.3 Conditional Probability Distributions 3.5 Probabilit Probability y .. .. .. .. .. .. .. .. .. .. .. 3.4 Marginal Probabilit y . . . . . .Probabilities . . . . . . . 3.6 The Chain Rule of Conditional 3.5 Indep Conditional y . . . Indep . . . .endence . . . . 3.7 Independence endenceProbabilit and Conditional Independence 3.6 Exp The Chain Rule of Conditional Probabilities 3.8 Expectation, ectation, Variance and Co Cov variance . . . 3.7 Indep endence and Conditional Indep endence 3.9 Common Probabilit Probability y Distributions . . . . . 3.8 Exp ectation, V ariance and CovFariance 3.10 Useful Prop Properties erties of Common unctions . .. .. 3.9 Ba Common Probabilit 3.11 Bay yes’ Rule . . . . y. Distributions . . . . . . . . .. .. .. .. .. 3.10 T Useful Prop erties of of Con Common Functions 3.12 echnical Details Contin tin tinuous uous Variables . .. 3.11 Information Bayes’ Rule Theory . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. 3.13 3.12 T echnical Details of Contin uous 3.14 Structured Probabilistic Mo Models dels V . ariables . . . . . .. 3.13 Information Theory . . . . . . . . . . . . . . Structured Probabilistic Models . . . . . . . 4 3.14 Numerical Computation 4.1 Ov Overflo erflo erflow w and Underflo Underflow w . . . . . . . . . . . 4 4.2 Numerical Computation Poor Conditioning . . . . . . . . . . . . . . 4.1 Ov erflowt-Based and Underflo w . . . .. .. .. .. .. .. .. .. 4.3 Gradien Gradient-Based Optimization 4.2 Constrained Poor Conditioning . . . . .. .. .. .. .. .. .. .. .. .. 4.4 Optimization 4.3 Example: Gradient-Based 4.5 LinearOptimization Least Squares. .. .. .. .. .. .. .. 4.4 Constrained Optimization . . . . . . . . . . 4.5 Example: Linear Least Squares . . . . . . . 5 Mac Machine hine Learning Basics 5.1 Learning Algorithms . . . . . . . . . . . . . 5 5.2 Machine Learning Basicsand Underfitting . . . Capacit Capacity y, Overfitting 5.1 Hyp Learning Algorithms . alidation . . . . . .Sets . . .. .. .. .. 5.3 Hyperparameters erparameters and V 5.2 Estimators, Capacity, Overfitting and Underfitting 5.4 Bias and V ariance . . . . . .. .. .. 5.3 Hyp erparameters and V alidation 5.5 Maxim Maximum um Lik Likeliho eliho elihoo od Estimation Sets . . .. .. .. .. 5.4 Ba Estimators, Bias and. V. ariance 5.6 Bay yesian Statistics . . . . .. .. .. .. .. .. .. .. 5.5 Sup Maxim um Lik elihoodAlgorithms Estimation. .. .. .. .. .. .. 5.7 Supervised ervised Learning 5.6 Unsup Ba yesian Statistics . . Algorithms . . . . . . . .. .. .. .. .. 5.8 Unsupervised ervised Learning 5.7 Sto Sup Learning Algorithms 5.9 Stoccervised hastic Gradien Gradient t Descen Descent t . . .. .. .. .. .. .. .. 5.8 Unsup ervised Learning Algorithms . . . .. .. 5.10 Building a Machine Learning Algorithm 5.9 Challenges Stochastic Gradien t Descen . . . . . .. .. .. .. 5.11 Motiv Motivating ating Deept Learning 5.10 Building a Machine Learning Algorithm . . 5.11 Challenges Motivating Deep Learning . . . . II Deep Net Netw works: Mo Modern dern Practices II Deep Deep FNet works: dern Practices 6 eedforw eedforward ardMo Netw Networks orks 6.1 Example: Learning XOR . . . . . . 6 6.2 Deep Gradien Feedforw ard Netw orks. . . . . . Gradient-Based t-Based Learning 6.1 Example: Learning XOR . . . . . . 6.2 Gradient-Based Learning . . ii. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .

. . . .

. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .

. . . .

. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .

. . . .

. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .

. . . .

. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .

. . . .

. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .

. . . .

. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .

. . . .

. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .

. . . .

. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .

. . . .

. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .

. . . .

. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .

. . . .

. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .

. . . .

56 56 56 58 56 59 58 59 59 60 59 60 60 62 60 67 62 70 67 71 70 72 71 75 72 75 80 80 80 82 80 82 82 93 82 96 93 96 98 99 98 110 99 120 110 122 120 131 122 135 131 139 135 145 139 150 145 152 150 154 152 154 165 165 167 170 167 176 170 176

CONTENTS

7 7

8 8

9 9

6.3 Hidden Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 6.4 Arc Architecture hitecture Design . . . . . . . . . . . . . . . . . . . . . . . . . . 196 6.3 Hidden Units . . and . . . Other . . . .Differen . . . . tiation . . . . Algorithms . . . . . . . .. .. .. .. .. 203 190 6.5 Bac Back-Propagation k-Propagation Differentiation 6.4 Historical Architecture Design 196 6.6 Notes . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 224 6.5 Back-Propagation and Other Differentiation Algorithms . . . . . 203 6.6 Historical Notes . . . Learning . . . . . . . . . . . . . . . . . . . . . . . . . 228 224 Regularization for Deep 7.1 Parameter Norm Penalties . . . . . . . . . . . . . . . . . . . . . . 230 Regularization for Deep LearningOptimization . . . . . . . . . . . . 228 7.2 Norm Penalties as Constrained 237 7.1 P arameter Norm P enalties . . . . . . . . . . . . . . . . . . . . . . 230 7.3 Regularization and Under-Constrained Problems . . . . . . . . . 239 7.2 Dataset Norm Penalties as Constrained 237 7.4 Augmen Augmentation tation . . . . .Optimization . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. 240 7.3 Noise Regularization and. Under-Constrained 239 7.5 Robustness . . . . . . . . . . . Problems . . . . . . .. .. .. .. .. .. .. .. .. 242 7.4 Semi-Sup Dataset Augmen . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 244 240 7.6 Semi-Supervised ervised tation Learning 7.5 Multi-T Noise Robustness . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 245 242 7.7 Multi-Task ask Learning 7.6 Semi-Sup ervised Learning . . . . . . . . . . . . . . . . . . . . . . 244 7.8 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 7.7 P Multi-T ask TLearning . arameter . . . . . .Sharing . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. 245 7.9 arameter ying and P 251 7.8 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 7.10 Sparse Represen Representations tations . . . . . . . . . . . . . . . . . . . . . . . . 253 7.9 P arameter T ying and Parameter Sharing 251 7.11 Bagging and Other Ensemble Metho Methods ds . .. .. .. .. .. .. .. .. .. .. .. .. .. .. 255 7.10 Sparse Represen tations . . . . . . . . . . . . . . . . . . . . . . . . 253 7.12 Drop Dropout out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 7.11 Bagging andTOther 7.13 Adv dversarial ersarial rainingEnsemble . . . . .Metho . . . ds . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 255 267 7.12 T Drop out Distance, . . . . . .Tangent . . . . .Prop, . . . and . . .Manifold . . . . . T.angent . . . . Classifier . . . . . 268 257 7.14 angent 7.13 Adversarial Training . . . . . . . . . . . . . . . . . . . . . . . . . 267 7.14 Tangent Distance, Tangent Prop, and 268 Optimization for Training Deep Mo Models delsManifold Tangent Classifier 274 8.1 Ho How w Learning Differs from Pure Optimization . . . . . . . . . . . 275 Optimization raining Deep Models 8.2 Challengesfor in T Neural Netw Network ork Optimization . . . . . . . . . . . . 274 282 8.1 Basic How Learning Differs 275 8.3 Algorithms . . from . . .P . ure . . Optimization . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. 294 8.2 P Challenges in Neural Netw ork Optimization 8.4 arameter Initialization Strategies . . . . . .. .. .. .. .. .. .. .. .. .. .. .. 282 301 8.3 Basic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 8.5 Algorithms with Adaptiv Adaptivee Learning Rates . . . . . . . . . . . . . 306 8.4 P arameter Initialization Strategies 301 8.6 Appro Approximate ximate Second-Order Metho Methods ds. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 310 8.5 Optimization Algorithms with Adaptivand e Learning Rates . . .. .. .. .. .. .. .. .. .. .. .. 318 306 8.7 Strategies Meta-Algorithms 8.6 Approximate Second-Order Methods . . . . . . . . . . . . . . . . 310 8.7 Optimization Strategies 318 Con Conv volutional Netw Networks orks and Meta-Algorithms . . . . . . . . . . . 331 9.1 The Con Conv volution Op Operation eration . . . . . . . . . . . . . . . . . . . . . 332 Con v olutional Netw orks 9.2 Motiv Motivation ation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 336 9.1 P The Con.volution 332 9.3 ooling . . . . .Op . .eration . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 340 9.2 Con Motiv ation . and . . .Po. oling . . . as . .an . .Infinitely . . . . . Strong . . . . Prior . . . .. .. .. .. .. .. .. 346 336 9.4 Conv volution 9.3 V Pariants ooling . of. the . . .Basic . . . Con . . v. olution . . . .F . unction . . . . . .. .. .. .. .. .. .. .. .. .. .. .. 348 340 9.5 Conv 9.4 Con v olution and P o oling as an Infinitely Strong Prior . . . . . . . 346 9.6 Structured Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . 359 9.5 Data Variants ofesthe 9.7 Typ ypes . .Basic . . . Con . . v. olution . . . .F . unction . . . . . .. .. .. .. .. .. .. .. .. .. .. .. 348 361 9.6 Structured Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . 359 9.8 Efficien Efficientt Con Conv volution Algorithms . . . . . . . . . . . . . . . . . . 363 9.7 Data T yp es . . . ervised . . . . .Features . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 364 361 9.9 Random or Unsup Unsupervised 9.8 Efficient Convolution Algorithms . . . . . . . . . . . . . . . . . . 363 iii 9.9 Random or Unsupervised Features . . . . . . . . . . . . . . . . . 364

CONTENTS

10 10

11 11

12 12

III

9.10 The Neuroscien Neuroscientific tific Basis for Conv Convolutional olutional Netw Networks orks . . . . 9.11 Con Conv volutional Net Networks works and the History of Deep Learning . 9.10 The Neuroscientific Basis for Convolutional Networks . . . . 9.11 ConvMo olutional NetRecurrent works and the History of Deep Learning . Sequence Modeling: deling: and Recursiv Recursive e Nets 10.1 Unfolding Computational Graphs . . . . . . . . . . . . . . . Sequence Motdeling: 10.2 Recurren Recurrent Neural Recurrent Net Netw works .and . . .Recursiv . . . . . e. Nets . . . . . . . . 10.1 Bidirectional Unfolding Computational 10.3 RNNs . . . .Graphs . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 10.2 Enco Recurren t Neural Networks . . . . . . .Architectures . . . . . . . .. .. .. .. 10.4 Encoder-Deco der-Deco der-Decoder der Sequence-to-Sequence 10.3 Deep Bidirectional RNNs . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 10.5 Recurren Recurrent t Net Netw w. orks 10.4 Enco der-Deco der Sequence-to-Sequence 10.6 Recursiv Recursivee Neural Net Netw works . . . . . . . .Architectures . . . . . . . .. .. .. .. 10.5 The DeepChallenge RecurrentofNet workserm. .Dep . .endencies . . . . . .. .. .. .. .. .. .. .. .. .. 10.7 Long-T Long-Term Dependencies 10.6 Ec Recursiv e Neural Netw.orks 10.8 Echo ho State Net Netw works . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 10.7 Leaky The Challenge of Other Long-T erm Dependencies . . Time . . . .Scales . . . .. 10.9 Units and Strategies for Multiple 10.8 Ec ho State Net w orks . . . . . . . . . . . . . . . . . . . 10.10 The Long Short-T Short-Term erm Memory and Other Gated RNNs .. .. .. 10.9 Optimization Leaky Units and Strategies for Multiple 10.11 for Other Long-T Long-Term erm Dep Dependencies endencies . . Time . . . .Scales . . . .. 10.10 Explicit The LongMemory Short-Term 10.12 . . .Memory . . . . .and . . Other . . . . Gated . . . .RNNs . . . .. .. .. 10.11 Optimization for Long-Term Dependencies . . . . . . . . . . 10.12 Explicit Memory Practical metho methodology dology. . . . . . . . . . . . . . . . . . . . . . . . 11.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . Practical metho dology 11.2 Default Baseline Mo Models dels . . . . . . . . . . . . . . . . . . . . 11.1 P erformance Metrics . . . . .More . . .Data . . . .. .. .. .. .. .. .. .. .. 11.3 Determining Whether .to. Gather 11.2 Selecting Default Baseline Models . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 11.4 Hyp Hyperparameters erparameters 11.3 Determining Whether 11.5 Debugging Strategies .to. Gather . . . . .More . . .Data . . . .. .. .. .. .. .. .. .. .. 11.4 Example: Selecting Hyp erparameters . . Recognition . . . . . . . .. .. .. .. .. .. .. .. .. .. 11.6 Multi-Digit Number 11.5 Debugging Strategies . . . . . . . . . . . . . . . . . . . . . . 11.6 Example: Multi-Digit Number Recognition . . . . . . . . . . Applications 12.1 Large Scale Deep Learning . . . . . . . . . . . . . . . . . . . Applications 12.2 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . 12.1 Large Deep Learning 12.3 Sp Speec eec eech hScale Recognition . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 12.2 Natural Computer Vision Pro . . cessing . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 12.4 Language Processing 12.3 Other SpeechApplications Recognition .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 12.5 12.4 Natural Language Processing . . . . . . . . . . . . . . . . . 12.5 Other Applications . . . . . . . . . . . . . . . . . . . . . . . Deep Learning Researc Research h

III Linear Deep F Learning Researc h 13 actor Mo Models dels 13.1 Probabilistic PCA and Factor Analysis . 13 13.2 LinearIndep Factor Mo dels onent Analysis (ICA) Independen enden endent t Comp Component 13.1 Probabilistic PCA and F.actor 13.3 Slo Slow w Feature Analysis . . . Analysis . . . . . .. 13.2 Sparse Independen t Comp 13.4 Co Coding ding . . .onent . . . Analysis . . . . . (ICA) . . . . 13.3 Slow Feature Analysis . . . . . . . . . . 13.4 Sparse Coding . . . . . . . . iv . . . . . . .

. . .. .. . .

. . .. .. . .

. . .. .. . .

. . .. .. . .

. . .. .. . .

. . .. .. . .

. . .. .. . .

. . .. .. . .

. . .. .. . .

. . .. .. . .

. . .. .. . .

. . . . . . .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. .. . . . . .. .. .. . .

. . .. .. . .

. . . . . . .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. .. . . . . .. .. .. . .

. . .. .. . .

. 365 . 372 . 365 . 374 372 . 376 . 374 379 376 .. 396 .. 379 397 .. 399 396 .. 397 401 399 .. 403 401 .. 406 403 .. 409 .. 406 411 .. 409 415 411 .. 419 . 415 . 424 419 . 425 . 424 428 .. 429 425 428 .. 430 .. 439 429 430 .. 443 . 439 . 446 443 . 446 . 446 455 .. 446 461 455 .. 464 461 .. 480 . 464 . 480 489 . . .. .. . .

489 492 493 492 494 493 496 494 499 496 499

CONTENTS

13.5 Manifold Interpretation of PCA . . . . . . . . . . . . . . . . . . . 502 Manifold 14 13.5 Auto Autoenco enco encoders ders Interpretation of PCA . . . . . . . . . . . . . . . . . . . 14.1 Undercomplete Auto Autoenco enco encoders ders . . . . . . . . . . . . . . . . . . . . 14 14.2 Autoenco ders Regularized Auto Autoenco enco encoders ders . . . . . . . . . . . . . . . . . . . . . . 14.1 Undercomplete Auto enco ders . . . and . . .Depth . . . .. .. .. .. .. .. .. .. .. .. .. 14.3 Represen Representational tational Power, La Lay yer Size 14.2 Sto Regularized Auto enco dersDeco . .ders . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 14.4 Stocchastic Enco Encoders ders and Decoders 14.3 Denoising Representational Power, 14.5 Auto Autoenco enco encoders ders La . y. er. .Size . . and . . .Depth . . . .. .. .. .. .. .. .. .. .. .. .. 14.4 Learning StochasticManifolds Encoderswith and Deco ders . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 14.6 Auto Autoenco enco encoders ders 14.5 Denoising Auto enco ders . . . . . . 14.7 Con Contractiv tractiv tractivee Auto Autoenco enco encoders ders . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 14.6 Predictiv Learning with Auto encoders 14.8 PredictiveeManifolds Sparse Decomp Decomposition osition . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 14.7 Applications Contractive Auto encoenco dersders . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 14.9 of Auto Autoenco encoders 14.8 Predictive Sparse Decomposition . . . . . . . . . . . . . . . . . . Applications of Autoencoders . . . . . . . . . . . . . . . . . . . . 15 14.9 Represen Representation tation Learning 15.1 Greedy La Lay yer-Wise Unsup Unsupervised ervised Pretraining . . . . . . . . . . . 15 15.2 Represen tation Learning Transfer Learning and Domain Adaptation . . . . . . . . . . . . . 15.1 Semi-Sup Greedy Laervised yer-Wise Unsupervised retraining . . .. .. .. .. .. .. .. .. .. 15.3 Semi-Supervised Disentangling of P Causal Factors 15.2 Distributed Transfer Learning and Domain 15.4 Representation . . .A.daptation . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. 15.3 Semi-Sup ervised Disentangling of Causal 15.5 Exp Exponen onen onential tial Gains from Depth . . . . . F. actors . . . . .. .. .. .. .. .. .. .. .. 15.4 Representation . . . . . . .Causes . . . . .. .. .. .. .. .. .. .. .. .. 15.6 Distributed Pro to Disco Providing viding Clues Discov ver. Underlying 15.5 Exponential Gains from Depth . . . . . . . . . . . . . . . . . . . Providing Clues to Disco ver Underlying . . . . . . . . . . 16 15.6 Structured Probabilistic Mo Models dels for DeepCauses Learning 16.1 The Challenge of Unstructured Mo Modeling deling . . . . . . . . . . . . . . 16 16.2 Structured Probabilistic Mo dels for Deep Learning Using Graphs to Describ Describee Mo Model del Structure . . . . . . . . . . . . . 16.1 Challenge Unstructured Mo deling . 16.3 The Sampling from of Graphical Mo . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. Models dels 16.2 A Using Graphs Describe Mo Modeling del Structure 16.4 dv dvantages antages of to Structured Modeling . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. 16.3 Learning Sampling ab from Models 16.5 about outGraphical Dep Dependencies endencies . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 16.4 Inference Advantages of Approximate Structured Mo deling . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 16.6 and Inference 16.5 Learning ab out Dep endencies . . . . . . Probabilistic . . . . . . . Mo . . dels . . . 16.7 The Deep Learning Approach to. .Structured Models 16.6 Inference and Approximate Inference . . . . . . . . . . . . . . . . DeepMetho Learning 17 16.7 Mon Monte teThe Carlo Methods ds Approach to Structured Probabilistic Models 17.1 Sampling and Monte Carlo Metho Methods ds . . . . . . . . . . . . . . . . 17 17.2 Monte Carlo Metho ds Imp Importance ortance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 17.1 Sampling and Monte ds 17.3 Mark Marko ov Chain Mon Monte te Carlo Carlo Metho Metho Methods ds .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 17.2 Gibbs Importance Sampling 17.4 Sampling . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 17.3 The MarkChallenge ov Chain Mon te Carlo Metho ds . . . . Mo . . des . . .. .. .. .. .. .. .. .. 17.5 of Mixing betw etween een Separated Modes 17.4 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . Theting Challenge of MixingFunction between Separated Modes . . . . . . . . 18 17.5 Confron Confronting the Partition 18.1 The Log-Lik Log-Likeliho eliho elihoo od Gradient . . . . . . . . . . . . . . . . . . . . 18 18.2 Confron ting theMaximum Partition Function Sto Stoc chastic Likelihoo Likelihood d and Contrastiv Contrastivee Divergence . . . 18.1 The Log-Likelihood Gradient . . . . . . . . . . . . . . . . . . . . 18.2 Stochastic Maximum Likelihoo v d and Contrastive Divergence . . .

502 505 506 505 507 506 511 507 512 511 513 512 518 513 524 518 526 524 527 526 527 529 531 529 539 531 544 539 549 544 556 549 557 556 557 561 562 561 566 562 583 566 584 583 585 584 586 585 587 586 587 593 593 593 595 593 598 595 602 598 602 602 602 608 609 608 610 609 610

CONTENTS

18.3 Pseudolik Pseudolikeliho eliho elihoo od . . . . . . . . . . . . . . . . . . . . . . . 18.4 Score Matc Matching hing and Ratio Matching . . . . . . . . . . . . 18.3 Denoising Pseudolikeliho odMatching . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 18.5 Score 18.4 Noise-Con Score Matctrastiv hing and Ratio Matching 18.6 Noise-Contrastiv trastive e Estimation . . . . .. .. .. .. .. .. .. .. .. .. .. .. 18.5 Denoising Score Matching . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. 18.7 Estimating the Partition Function 18.6 Noise-Contrastive Estimation . . . . . . . . . . . . . . . . Estimating the Partition Function . . . . . . . . . . . . . . 19 18.7 Appro Approximate ximate inference 19.1 Inference as Optimization . . . . . . . . . . . . . . . . . . 19 19.2 ApproExp ximate inference Expectation ectation Maximization . . . . . . . . . . . . . . . . . . 19.1 Inference as Optimization . . ding . . . .. .. .. .. .. .. .. .. .. .. .. .. .. 19.3 MAP Inference and Sparse Co Coding 19.2 V Exp ectationInference Maximization . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. 19.4 ariational and Learning 19.3 Learned MAP Inference and Sparse Coding 19.5 Appro Approximate ximate Inference . . .. .. .. .. .. .. .. .. .. .. .. .. .. 19.4 Variational Inference and Learning . . . . . . . . . . . . . Appro ximate 20 19.5 Deep Learned Generativ Generative e Mo Models dels Inference . . . . . . . . . . . . . . . 20.1 Boltzmann Mac Machines hines . . . . . . . . . . . . . . . . . . . . . 20 20.2 Deep Restricted GenerativBoltzmann e Mo dels Machines . . . . . . . . . . . . . . . 20.1 Deep Boltzmann hines 20.3 Belief Mac Netw Networks orks .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 20.2 Deep Restricted Boltzmann Machines 20.4 Boltzmann Machines . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 20.3 Deep Belief Netw orks . . . . . .alued . . . Data . . . .. .. .. .. .. .. .. .. .. 20.5 Boltzmann Mac Machines hines for Real-V Real-Valued 20.4 Boltzmann MachinesMac . . hines . . . .. .. .. .. .. .. .. .. .. .. .. .. .. 20.6 Deep Con Boltzmann Conv volutional Machines 20.5 Boltzmann Boltzmann Mac Mac hines for for Structured Real-ValuedorData . . . . Outputs . . . . . 20.7 Machines hines Sequential 20.6 Other Convolutional Boltzmann Mac. hines 20.8 Boltzmann Machines . . . .. .. .. .. .. .. .. .. .. .. .. .. .. 20.7 Boltzmann Mac hines for Structured or Sequential 20.9 Bac Back-Propagation k-Propagation through Random Op Operations erations . Outputs . . . . . 20.8 Other Boltzmann Machines . . . . . . . . . . . . . 20.10 Directed Generative Nets . . . . . . . . . . . . . . . .. .. .. .. 20.9 Dra Bac k-Propagation through Random 20.11 Drawing wing Samples from Auto Autoenco enco encoders dersOp. erations . . . . . .. .. .. .. .. .. 20.10 Generativ Directed Generative .works . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 20.12 Generative e Sto StocchasticNets Net Netw 20.11 Dra wing Samples from Auto 20.13 Other Generation Schemes . enco . . ders . . . .. .. .. .. .. .. .. .. .. .. .. .. 20.12 Ev Generativ Stochastic Net w orks. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 20.14 Evaluating aluatinge Generative Mo Models dels 20.13 Conclusion Other Generation 20.15 . . . . Schemes . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 20.14 Evaluating Generative Models . . . . . . . . . . . . . . . . 20.15 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliograph Bibliography y

. . .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. . .

. . .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. . .

. . .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. . .

. 618 . 620 618 .. 622 620 .. 623 .. 626 622 . 623 . 634 626 . 636 . 634 637 .. 638 636 .. 637 641 .. 653 638 . 641 . 656 653 . 656 . 656 658 .. 662 656 658 .. 665 .. 662 678 .. 665 685 .. 678 687 .. 688 685 687 .. 689 .. 688 694 689 .. 712 694 .. 716 .. 712 717 .. 719 716 .. 717 721 . 719 . 723 721

Bibliography Index

723 780

Index

780

vi

Website Website

www.deeplearningb www.deeplearningbo ook.org www.deeplearningbook.org

This book is accompanied by the ab abov ov ovee website. The website provides a variety of supplemen supplementary tary material, including exercises, lecture slides, corrections of This b o ok is accompanied the ab e website. website provides a mistak mistakes, es, and other resources thatbyshould beov useful to both The readers and instructors. variety of supplementary material, including exercises, lecture slides, corrections of mistakes, and other resources that should be useful to both readers and instructors.

vii vii

Ackno knowledgmen wledgmen wledgments ts Acknowledgments

This book would not ha hav ve been possible without the con contributions tributions of man many y people. would lik likeenot to ha thank those who commen commented prop proposal osal for the book ThisWbeook would ve been possible without ted the on conour tributions of man y people. and help helped ed plan its con conten ten tents ts and organization: Guillaume Alain, Kyungh Kyunghyun yun Cho, W e w ould lik e to thank those who commen ted on our prop osal for book Çağlar Gülçehre, Da David vid Krueger, Hugo Larochelle Larochelle,, Razv Razvan an Pascan Pascanu u and the Thomas and helped plan its contents and organization: Guillaume Alain, Kyunghyun Cho, Rohée. Çağlar Gülçehre, David Krueger, Hugo Larochelle, Razvan Pascanu and Thomas We would like to thank the people who offered feedback on the conten contentt of the Rohée. book itself. Some offered feedbac feedback k on many chapters: Martín Abadi, Guillaume W e w ould like to thank the people who offered feedback on Can the conten of the Alain, Ion Androutsopoulos, Fred Bertsc Bertsch, h, Olexa Bilaniuk, Ufuk Biçici,tMatk Matko o book itself. Some offered Greg feedbac k on manyPierre chapters: Bošnjak, John Boersma, Bro Broc ckman, Luc Martín Carrier,Abadi, SarathGuillaume Chandar, Alain, Ion Androutsopoulos, Fred Bertsc h, Olexa Bilaniuk, Ufuk Can Biçici, Matko P awel Chilinski, Mark Daoust, Oleg Dashevskii, Laurent Dinh, Stephan Dreseitl, Bošnjak, John F Boersma, Brockman, Pierre Luc Carrier, Sarath Chandar, Jim Fan, Miao an, MeireGreg Fortunato, Frédéric Francis, Nando de Freitas, Çağlar P a w el Chilinski, Mark Daoust, Oleg Dashevskii, Laurent Dinh, Stephan Dreseitl, Gülçehre, Jurgen Van Gael, Javier Alonso García, Jonathan Hunt, Gopi Jeyaram, Jim Fan,Kab Miao an,Luk Meire Fortunato, Frédéric Francis, NandoJohn de FKing, reitas,Diederik Çağlar Chingiz Kabyta yta ytay yFev, Lukasz asz Kaiser, Varun Kanade, Akiel Khan, Gülçehre, Jurgen an Gael,Rudolf JavierMathey Alonso, García, Hunt, Gopi Jeyaram, P . Kingma, Yann VLeCun, Mathey, Matías Jonathan Mattamala, Abhinav Maurya, Chingiz Kab yta y ev, Luk asz Kaiser, V arun Kanade, Akiel Khan, John King, Diederik Kevin Murphy Murphy,, Oleg Mürk, Roman Nov Novak, ak, Augustus Q. Odena, Simon Pa Pavlik, vlik, P . Kingma, Y ann LeCun, Rudolf Mathey , Matías Mattamala, Abhinav Maurya, Karl Pichotta, Kari Pulli, Tapani Raiko, An Anurag urag Ranjan, Johannes Roith, Halis KevinCésar Murphy , OlegGrigory Mürk, Sapunov, Roman Nov ak, Augustus Q. Odena, Simon Pavlik, Sak, Salgado, Mik Mike e Sch Schuster, uster, Julian Serban, Nir Shabat, Karl Shirriff, Pichotta,Scott KariStanley Pulli, T Anurag Ranjan, Johannes Roith, Halis Ken Stanley, , apani DavidRaiko, Sussillo, Ilya Sutsk Sutskev ev ever, er, Carles Gelada Sáez, Sak, César Salgado, Grigory Sapunov, Mik e Sch uster, Julian Serban, Nir Shabat, Graham Taylor, Valen alentin tin Tolmer, An Tran, Shubhendu Trivedi, Alexey Umnov, Ken Shirriff, Scott Stanley Sussillo, Ilya Sutsk ever,WCarles Gelada Sáez, Vincen Vincentt Vanhouc anhouck ke, Marco, David Visen Visentini-Scarzanella, tini-Scarzanella, Da David vid arde-F arde-Farley arley arley, , Dustin Graham Taylor, Tolmer, Tran, tShubhendu rivedi, Alexey W ebb, Kelvin Xu,Valen Wei tin Xue, Li Yao,An Zygmun Zygmunt Za Zając jąc and T Ozan Çağlay Çağlayan. an. Umnov, Vincent Vanhoucke, Marco Visentini-Scarzanella, David Warde-Farley, Dustin We would also lik likee to thank those who provided us with useful feedbac feedback k on Webb, Kelvin Xu, Wei Xue, Li Yao, Zygmunt Za jąc and Ozan Çağlayan. individual chapters: We would also like to thank those who provided us with useful feedback on individual chapters: • Chapter 1, Introduction: Yusuf Akgul, Sebastien Bratieres, Samira Ebrahimi, Charlie Gorichanaz, Brendan Loudermilk, Eric Morris, Cosmin Pârvulescu Chapter 1, Introduction : Yusuf Akgul, Sebastien Bratieres, Samira Ebrahimi, and Alfredo Solano. • Charlie Gorichanaz, Brendan Loudermilk, Eric Morris, Cosmin Pârvulescu 2, Linear Algebra: Amjad Almahairi, Nik Nikola ola Banić, Kevin Bennett, • Chapter and Alfredo Solano. viiiAlmahairi, Nikola Banić, Kevin Bennett, Chapter 2, Linear Algebra: Amjad

•

viii

CONTENTS

• • • • • • • • • • • • • • • • • • • • •

Philipp Philippee Castonguay Castonguay,, Oscar Chang, Eric Fosler-Lussier, Andrey Khaly Khalya avin, Sergey Oreshk Oreshko ov, Istv István án Petrás, Dennis Prangle, Thomas Rohée, Colb Colby y Philipp e Castonguay , Oscar Chang, Eric F osler-Lussier, Andrey Khaly a vin, Toland, Massimiliano Tomassoli, Alessandro Vitale and Bob Welland. Sergey Oreshkov, István Petrás, Dennis Prangle, Thomas Rohée, Colby Toland, Massimiliano omassoli, Alessandro Vitale and Bob Anderson, Welland. Kai Chapter 3, ProbabilityTand Information Theory : John Philip Arulkumaran, Vincen Vincentt Dumoulin, Rui Fa, Stephan Gouws, Artem Ob Oboturov, oturov, Chapter 3 , Probability and Information Theory : John Philip Anderson, Kai An Antti tti Rasmus, Andre Simp Simpelo, elo, Alexey Surk Surko ov and Volk olker er Tresp. Arulkumaran, Vincent Dumoulin, Rui Fa, Stephan Gouws, Artem Oboturov, Antti Rasmus, Andre Simp elo, Alexey Surk v andAn, Volk er T resp. and Hu Chapter 4, Numerical Computation : T ran oLam Ian Fischer, Yuhuang. Chapter 4, Numerical Computation: Tran Lam An, Ian Fischer, and Hu Yuhuang.5, Machine Learning Basics: Dzmitry Bahdanau, Nikhil Garg, Chapter Mak Makoto oto Otsuk Otsuka, a, Bob Pepin, Philip Popien, Emmanuel Ra Rayner, yner, Kee-Bong Chapter 5 , Machine Learning Basics : Dzmitry Bahdanau, Nikhil Garg, Song, Zheng Sun and Andy Wu. Makoto Otsuka, Bob Pepin, Philip Popien, Emmanuel Rayner, Kee-Bong Song, Zheng Sun F and Andyard Wu. Chapter 6, Deep eedforw eedforward Netw Networks orks orks:: Uriel Berdugo, Fabrizio Bottarel, Elizab Elizabeth eth Burl, Ishan Durugk Durugkar, ar, Jeff Hlywa, Jong Wook Kim, Da David vid Krueger Chapter Feedforw ard and Adit Adity y6a, Deep Kumar Prahara Praharaj. j. Networks: Uriel Berdugo, Fabrizio Bottarel, Elizabeth Burl, Ishan Durugkar, Jeff Hlywa, Jong Wook Kim, David Krueger and Adity Kumar Praharafor j. Deep Learning: Inkyu Lee, Sunil Mohan and Chapter 7a, Regularization Josh Joshua ua Salisbury Salisbury.. Chapter 7, Regularization for Deep Learning: Inkyu Lee, Sunil Mohan and Joshua Salisbury . Chapter 8, Optimization for Training Deep Mo Models dels dels:: Marcel Ackermann, Ro Row wel Atienza, Andrew Bro Brock, ck, Tegan Mahara Maharaj, j, James Martens and Klaus Chapter 8, Optimization for Training Deep Models: Marcel Ackermann, Strobl. Rowel Atienza, Andrew Brock, Tegan Mahara j, James Martens and Klaus Strobl. 9, Conv Chapter Convolutional olutional Netw Networks orks orks:: Martín Arjovsky Arjovsky,, Eugene Brevdo, Eric Jensen, Asifullah Khan, Mehdi Mirza, Alex Paino, Eddie Pierce, Marjorie Chapter 9, Conv olutional Networks Sa Say yer, Ryan Stout and Wentao Wu.: Martín Arjovsky, Eugene Brevdo, Eric Jensen, Asifullah Khan, Mehdi Mirza, Alex Paino, Eddie Pierce, Marjorie Sayer, Ryan and Modeling: Wentao Wu.Recurren Chapter 10, Stout Sequence Recurrentt and Recursive Nets Nets:: Gökçen Eraslan, Stev Steven en Hickson, Razv Razvan an Pascan Pascanu, u, Lorenzo von Ritter, Rui Ro Rodrigues, drigues, Chapter 10 , Sequence Modeling: Recurren t and Recursive Nets : Gökçen Mihaela Rosca, Dmitriy Serdyuk, Dongyu Shi and Kaiyu Yang. Eraslan, Steven Hickson, Razvan Pascanu, Lorenzo von Ritter, Rui Rodrigues, Mihaela Dmitriy Serdyuk, and Kaiyu Yang. Chapter Rosca, 11, Practical metho methodology dologyDongyu : DanielShi Bec Beckstein. kstein. 11, Applications Practical metho dologyDahl : Daniel kstein. Chapter 12 : George and Bec Ribana Roscher. 12, Representation Applications: George Dahl and Ribana Chapter 15 Learning : Kunal Ghosh.Roscher.

Learning: Mo Kunal Chapter 15 16, Representation Structured Probabilistic Models dels Ghosh. for Deep Learning: Minh Lê and Anton Varfolom. Chapter 16, Structured Probabilistic Models for Deep Learning: Minh Lê • and Chapter 18,VConfronting the Partition Function unction:: Sam Bowman. Anton arfolom. ix Chapter 18, Confronting the Partition Function: Sam Bowman.

•

CONTENTS

• Chapter 20, Deep Generativ Generativee Mo Models dels dels:: Nicolas Chapados, Daniel Galvez, Wenming Ma, Fady Medhat, Shakir Mohamed and Grégoire Monta Montav von. Chapter 20, Deep Generative Models: Nicolas Chapados, Daniel Galvez, enming Ma, Fady Medhat, • W Bibliograph Bibliography: y: Leslie N. Smith.Shakir Mohamed and Grégoire Montavon. Bibliography: Leslie N. Smith. We also wan antt to thank those who allo allow wed us to repro reproduce duce images, figures or data• from their publications. We indicate their contributions in the figure captions We also the wantext. t to thank those who allowed us to reproduce images, figures or throughout data from their publications. We indicate their contributions in the figure captions We w would ould like e to thank Ian’s wife Daniela Flori Goo Goodfellow dfellow for patiently throughout the lik text. supp supporting orting Ian during the writing of the book as well as for help with pro proofreading. ofreading. We would like to thank Ian’s wife Daniela Flori Goodfellow for patiently We would lik likee to thank the Go Google ogle Brain team for pro providing viding an intellectual supporting Ian during the writing of the book as well as for help with proofreading. en environmen vironmen vironmentt where Ian could dev devote ote a tremendous amoun amountt of time to writing this W e w ould lik e to thank the Go ogle Brain team for pro intellectual book and receiv receivee feedbac feedback k and guidance from colleagues. Weviding wouldan esp especially ecially like en vironmen t where Ian could dev ote a tremendous amoun t of time to writing this to thank Ian’s former manager, Greg Corrado, and his current manager, Samy book andforreceiv feedbac colleagues. Welike would especially like Bengio, theire supp support ortkofand thisguidance pro project. ject. from Finally Finally, , we would to thank Geoffrey to thank former manager, Greg Corrado, and his current manager, Samy Hin Hinton ton forIan’s encouragement when writing was difficult. Bengio, for their support of this pro ject. Finally, we would like to thank Geoffrey Hinton for encouragement when writing was difficult.

x

Notation Notation

This section pro provides vides a concise reference describing the notation used throughout this book. If you are unfamiliar with an any y of the corresp corresponding onding mathematical This section pro vides a concise reference describing the notation throughout concepts, this notation reference ma may y seem in intimidating. timidating. Ho How wev ever, er,used do not despair, this b o ok. If y ou are unfamiliar with an y of the corresp onding mathematical we describ describee most of these ideas in chapters 2-4. concepts, this notation reference may seem intimidating. However, do not despair, we describe most of these ideas Num in chapters 2-4. Arra Numb bers and Arrays ys a

A scalar (in (integer teger Num or real) bers and Arrays A scalar vector (integer or real)

A a

A matrix vector

A A IAn

A tensor matrix

II

Iden Identit tity matrix with withndimensionalit dimensionality y implied by tit y matrix rows and n columns con context text Identity matrix with dimensionality implied by Standard context basis vector [0, . . . , 0, 1, 0, . . . , 0] with a 1 at position i Standard basis vector [0, . . . , 0, 1, 0, . . . , 0] with a A square, diagonal matrix with diagonal entries 1 at position i giv given en by a A square, diagonal matrix with diagonal entries A variable givscalar en by random a

a

I e(i) e diag diag((a) diag(a) a

Iden Identit tit tity y matrix with n ro rows ws and n columns A tensor

a

A scalar vector-v ector-valued alued random A random variablevariable

A a

A matrix-v matrix-valued alued random random vvariable ariable vector-valued

A

A matrix-valued random variable

xi xi

CONTENTS

Sets and Graphs A set

A A R

Sets and Graphs The umbers bers A setset of real num

{0R , 1}

The containing taining and 1 The set set con of real num0bers

{0, 10, ,. 1. . , n} The set con of all in integers tegers betw between taining 0 and 1 een 0 and n The real interv terv terval altegers including and b n 0, {1[a, , . .b.]}, n set ofin all in betwaeen 0 and { ([a, The real real in interv a, bb]] } The interv terval excluding aa and but bincluding b al including A (a,\B b] A B G \ P a G(xi ) P a G(x )

Set the set containing The subtraction, real interval i.e., excluding a but includingthe b elemen ments ts of A that are not in B Set subtraction, i.e., the set containing the eleA B A graph men ts of that are not in The paren parents ts of xi in G A graph

aa−i

The parents of x in Indexing G Elemen Elementt i of vector a , withIndexing indexing starting at 1 All elemen elements ts vofector vector except for elemen element t i at 1 a, a Elemen t i of with indexing starting

A a i,j

Elemen Element t i, ts j of Aexcept for element i All elemen of matrix vector a

A A i,:

Ro Row w i of A Elemen t i,matrix j of matrix A

A :,i

Column of matrix Row i ofimatrix A A

AAi,j,k A A

Elemen Element k) of aA 3-D tensor A Columnt i(i,ofj,matrix A 2-D slice a k3-D Elemen t (of i, j, ) oftensor a 3-D tensor

Aa i

Elemen Element t iofofathe 2-D slice 3-Drandom tensor vector a

ai

:,:,i

a A> A+

Element i of the random vector a Linear Algebra Op Operations erations Transpose of matrix LinearAAlgebra Op erations Mo Moore-P ore-P ore-Penrose enrose pseudoin pseudoinv T ranspose of matrix A verse of A

AA B

Elemen Element-wise t-wise product Mo ore-P enrose(Hadamard) pseudoinverse of A of A and B

det( A A B) A) det(

Determinan Determinant t of A Elemen t-wise (Hadamard) product of A and B Determinant of A xii

CONTENTS

Calculus Deriv Derivative ative of y with resp respect ect to x. Calculus Derivative of y with respect to x. Partial deriv derivative ative of y with resp respect ect to x

dy dx dy ∂y dx ∂x ∂y ∇ xy ∂x ∇Xyy ∇ Xyy ∇

Gradien Gradient t of yative withofresp respect ect to x ect to x Partial deriv y with resp Matrix derivatives y ect withtoresp respect Gradienderiv t of yatives with of resp x ect to X

∇ y ∂f ∇∂ x ∂f ∇2x f (x) or H (f )(x) Z ∂x (xH )dx f (x)for (f )(x) Z ∇ f (x)dx S

f (x)dx

Z

a⊥b

Z a⊥b | c a b (a) c aP ⊥ b

T ensor con containing taining derivatives of yect with respect Matrix deriv atives deriv of y atives with resp to resp X ect to X Tensor containing derivatives of y with respect to X Jacobian matrix J ∈ Rm×n of f : Rn → Rm

R R R The Hessian matrix input Jacobian matrix J of f at of f : point x ∈ the en → of x Definite in integral tegral over entire tire domain The Hessian matrix of f at input point x

Definite in tegral with over the entire of xset S integral respect to domain x ov over er the S Definite integral with respect to x over the set Probabilit Probability y and Information Theory The random ariables a and Theory b are indep independen enden endentt Probabilit y andvInformation They are are vconditionally given en ct The random ariables a andindependent b are indepgiv enden

⊥ Pp(a)| p(a) a∼P

A probabilit probability distribution oindependent ver a discretegiv variable They are arey conditionally en c A probabilit probability y distribution distribution oovver er aa discrete contin continuous uous varivariable able, or over a variable whose type has not been A ecified probability distribution over a continuous varisp specified able, or over a variable whose type has not been Random sp ecified variable a has distribution P

Ex∼P [f (ax)] P or Ef (x) E V f (or x))Ef (x) [far( (x∼ )]

V ariance of fof (x)f (under P (resp x) ect to P (x) Exp ectation x) with

Co Cov( v(ar( f (fx()x , g))(x)) V

Co Cov variance ) and Pg(x) under P (x) V ariance of of f (xf)(xunder

Cov(fH((xx)), g(x))

Shannon en entrop trop tropy the random variable g(x) under Covariance of fy(xof ) and P (x) x

D KL H((Pxk) Q)

Kullbac Kullback-Leibler divergence of Pvand Q x Shannonk-Leibler entropy div of ergence the random ariable

N Σ) D (x(;Pµ, Q (x; µk, Σ)

Gaussian distribution ov over er x Kullbac k-Leibler divergence of with P andmean Q µ and co cov variance Σ Gaussian distribution over x with mean µ and covariance Σ

N

Exp Expectation ectation of f (xa)has with resp respect ect to P (x) Random variable distribution

xiii

CONTENTS

f :A→B A B f :f ◦ g f f(x;→ θ g)

Functions The function f with domain A and range B Functions A and g B Comp Composition osition of the functions The function f with domain f and range

||ζx (x||)p

A function parametrized Comp ositionofofx the functions f by andθ.g Sometimes we just write f (x) and ignore the argument θ to A function of x parametrized by θ. Sometimes ligh lighten ten notation. we just write f (x) and ignore the argument θ to Natural logarithm of x ligh ten notation. 1 Logistic sigmoid, of x Natural logarithm 1 + exp(−x) 1 Logistic sigmoid, Softplus, log log(1 (1 + exp( x )) 1 + exp( x) p L norm of Softplus, logx(1 + exp(x)) −

||xx|| || xx+||

P ositive e part norm of x of x, i.e., max(0, x) Lositiv

f (x◦; θ) log x σ (xx) log ζσ((xx))

L2 norm of x

||x || 1 condition is 1 if the condition is true, 0 ,otherwise Positiv e part of x, i.e., max(0 x) Sometimes we use a function f whose argumen argumentt is a scalar, but apply it to a vector, 1 is 1 if the condition is true, 0 otherwise matrix, or tensor: f (x), f ( X), or f (X). This means to apply f to the arra array y f Sometimes w e use a function whose argumen t is a scalar, but apply it to a v ector, elemen element-wise. t-wise. For example, if C = σ (X)X, then Ci,j,k = σ(Xi,j,k ) for all valid values f (x), f ( X), or f ( ). This means to apply f to the array matrix, or ktensor: of i, j and . C X C X element-wise. For example, if = σ ( ), then ) for all valid values = σ( of i, j and k. Datasets and distributions p data pˆdata pˆ X (i) xX y (i) xor y (i) y

or y X X

The data generating distribution Datasets and distributions The empirical distribution defined by the training data generating distribution set The empirical distribution defined by the training A setset of training examples The exampleexamples (input) from a dataset A seti-th of training ( i) The target asso associated ciated with x supervised ervised learni-th example (input) fromfora sup dataset ing The target associated with x for supervised learn(i) The row w ing m × n matrix with input example x in ro Xi,: The m n matrix with input example x in row X ×

xiv

Chapter 1 Chapter 1

In Intro tro troduction duction tro In duction In Inv ven entors tors ha hav ve long dreamed of creating mac machines hines that think. This desire dates bac back k to at least the time of ancien ancientt Greece. The mythical figures Pygmalion, Inventors ha ve long dreamedma ofy creating machines that think. This Daedalus, and Hephaestus may all be interpreted as legendary in inv vdesire en entors, tors,dates and bac k to at least the time of ancien t Greece. The m ythical figures Pygmalion, Galatea, Talos, and Pandora may all be regarded as artificial life (Ovid and Martin, Daedalus, and Hephaestus y ). all be interpreted as legendary inventors, and 2004 ; Spark Sparkes es, 1996 ; Tandy, ma 1997 Galatea, Talos, and Pandora may all be regarded as artificial life (Ovid and Martin, programmable computers conceived, ed, people wondered whether 2004When ; Spark es, 1996; Tandy , 1997). were first conceiv they migh mightt become intelligen intelligent, t, ov over er a hundred years before one was built (Lo Lov velace elace,, When programmable computers w ere first conceiv ed, p eople w ondered whether 1842 1842). ). Today oday,, artificial intel intelligenc ligenc ligencee (AI) is a thriving field with man many y practical they migh t b ecome intelligen t, ov er a h undred y ears b efore one was built Lovelace, applications and active researc research h topics. We lo look ok to intelligen intelligentt softw software are to (automate 1842 ). Tlab oday intel ligenc is a thriving field withinman y practical routine understand sp images, mak medicine and labor, or,, artificial speech eech eor(AI) makee diagnoses applications and active researc h topics. W e lo ok to intelligen t softw are to automate supp support ort basic scientific research. routine labor, understand speech or images, make diagnoses in medicine and Inort thebasic earlyscientific da days ys of research. artificial in intelligence, telligence, the field rapidly tackled and solved supp problems that are intellectually difficult for human beings but relativ relatively ely straightIn the early da ys of artificial in telligence, the field rapidly tackled and solved forw for computers—problems that can b e describ by a list of formal, mathforward ard described ed problems that are intellectually difficult for human b eings but relativ ely straightematical rules. The true challenge to artificial intelligence prov proved ed to be solving forw ard for computers—problems that can b e describ ed by a list of formal, maththe tasks that are easy for people to perform but hard for people to describ describe e ematical rules. The true challenge to artificial intelligence prov ed to b e solving formally—problems that we solve intuitiv intuitively ely ely,, that feel automatic, like recognizing the tasks that are easy for p eople to p erform but hard for people to describe sp spok ok oken en words or faces in images. formally—problems that we solve intuitively, that feel automatic, like recognizing This book is ab about out a solution to these more intuitiv intuitivee problems. This solution is spoken words or faces in images. to allow computers to learn from exp experience erience and understand the world in terms of a This b o ok is ab out a solution to these more intuitiv e problems. This solution is hierarc hierarch hy of concepts, with each concept defined in terms of its relation to simpler to allow computers to learn from exp erience and understand the world in terms of a concepts. By gathering knowledge from experience, this approac approach h av avoids oids the need hierarc hy ofop concepts, with each concept in terms of its that relation simpler for human operators erators to formally sp specify ecify defined all of the knowledge the to computer concepts. gathering knowledge fromthe experience, approac h avoids concepts the need needs. TheBy hierarc hierarch hy of concepts allows computerthis to learn complicated for h uman op erators to formally sp ecify all of the knowledge that the computer by building them out of simpler ones. If we draw a graph showing how these needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones. 1If we draw a graph showing how these 1

CHAPTER 1. INTRODUCTION

concepts are built on top of eac each h other, the graph is deep, with man many y lay layers. ers. For this reason, we call this approach to AI de deep ep le learning arning arning.. concepts are built on top of each other, the graph is deep, with many layers. For Man Many y of the early successes of AI took place in relativ relatively ely sterile and formal this reason, we call this approach to AI deep learning. en environmen vironmen vironments ts and did not require computers to ha hav ve muc uch h kno knowledge wledge ab about out Man y of the early successes of AI took place in relativ ely sterile and formal the world. F For or example, IBM’s Deep Blue chess-pla chess-playing ying system defeated world vironmen ts and did not computers haveismofuccourse h knowledge about cen hampion Garry Kasparo Kasparov v inrequire 1997 (Hsu , 2002). to Chess a very simple the world. For example, IBM’s Blue and chess-pla ying orlde w orld, con containing taining only sixt sixty-four y-fourDeep lo locations cations thirt thirty-t y-t y-tw wosystem pieces defeated that can w mov move champion Garrycircumscrib Kasparov in , 2002). aChess is of course very simple in only rigidly circumscribed ed 1997 ways.(Hsu Devising successful chess astrategy is a w orld, con taining only sixt y-four lo cations and thirt y-t w o pieces that can mov tremendous accomplishmen accomplishment, t, but the challenge is not due to the difficulty ofe in only rigidly circumscrib ed ways. successful strategyChess is a describing the set of chess pieces and Devising allo allow wableamov moves es to thechess computer. tremendous accomplishmen the brief challenge not due to the difficulty of can be completely describ described ed t,bybut a very list ofiscompletely formal rules, easily describing the set of chess pieces and allowable moves to the computer. Chess pro provided vided ahead of time by the programmer. can be completely described by a very brief list of completely formal rules, easily Ironically Ironically,, abstract and formal tasks that are among the most difficult mental provided ahead of time by the programmer. undertakings for a human being are among the easiest for a computer. Computers , abstract formal tasks among the most difficult mental ha hav veIronically long been able toand defeat even thethat bestare human chess play player, er, but are only undertakings for a h uman b eing are among the easiest for a computer. Computers recen recently tly matching some of the abilities of average human beings to recognize ob objects jects ha v e long b een able to defeat even the b est human chess play er, but are only or sp speech. eech. A person’s everyda everyday y life requires an immense amount of kno knowledge wledge recen matching some of of aviserage human beings to recognize ob jects ab about outtly the world. Much of the thisabilities kno knowledge wledge sub subjectiv jectiv jective e and intuitiv intuitive, e, and therefore or speech. person’s in everyda y lifewarequires an immense of kno difficult to A articulate a formal y. Computers need amount to capture thiswledge same ab out the world. Much of this kno wledge is sub jectiv e and intuitiv e, and therefore kno knowledge wledge in order to beha ehav ve in an in intelligen telligen telligentt way. One of the key challenges in difficult to articulate in a formal w a y . Computers need into to capture this same artificial in intelligence telligence is how to get this informal kno knowledge wledge a computer. knowledge in order to behave in an intelligent way. One of the key challenges in Sev Several eral artificial in intelligence telligence pro projects jects hav havee sought to hard-co hard-code de knowledge ab about out artificial intelligence is how to get this informal knowledge into a computer. the worl world d in formal languages. A computer can reason ab about out statements in these Sev eral artificial in telligence pro jects hav e sought to hard-co de knowledge abthe out formal languages automatically using logical inference rules. This is kno known wn as the worl d in formal languages. A computer can reason ab out statements in these know knowle le ledge dge base approach to artificial in intelligence. telligence. None of these pro projects jects has led to automatically using logical inference rules. This is knoand wn as the, aformal ma major jorlanguages success. One of the most famous such pro projects jects is Cyc (Lenat Guha know). ledge base approach to artificial intelligence. None these pro jects led to 1989 1989). Cyc is an inference engine and a database of of statements in a has language a ma jor success. One of the most famous such pro jects is Cycsup (Lenat andItGuha called CycL. These statements are en entered tered by a staff of human supervisors. ervisors. is an, 1989 ). Cyc is an inference engine and a database of statements in a language un unwieldy wieldy pro process. cess. People struggle to devise formal rules with enough complexity called CycL. These statements are en tered by a staff human ervisors. aIt story is an to accurately describ describe e the world. For example, Cycoffailed to sup understand unwieldy cess. People struggle to in devise formal rules with enough ab about out a ppro erson named Fred shaving the morning (Linde , 1992 ). Itscomplexity inference to accurately describ e the world. F or example, Cyc failed to understand a story engine detected an inconsistency in the story: it knew that people do not ha hav ve ab out a p erson named F red shaving in the morning ( Linde , 1992 ). Its inference electrical parts, but because Fred was holding an electric razor, it believed the engine detected an inconsistency in the story: parts. it knewIt that people do have en entit tit tity y “F “FredWhileShaving” redWhileShaving” contained electrical therefore ask asked ednot whether electrical parts, because Fred as holding F red was still a pbut erson while he waswsha shaving. ving. an electric razor, it believed the entity “FredWhileShaving” contained electrical parts. It therefore asked whether The difficulties faced by systems relying on hard-coded kno knowledge wledge suggest that Fred was still a person while he was shaving. AI systems need the ability to acquire their own kno knowledge, wledge, by extracting patterns The faced by systems relyingason hard-coded knowledge suggest that from ra raw wdifficulties data. This capabilit capability y is known machine le learning arning arning. . The in intro tro troduction duction AI systems need the ability to acquire their own knowledge, by extracting patterns from raw data. This capability is known 2 as machine learning. The intro duction


of mac machine hine learning allo allow wed computers to tackle problems inv involving olving knowledge of the real world and mak makee decisions that app appear ear sub subjective. jective. A simple machine of machine learningcalled allowlo ed computers to can tackle problems involving knowledge learning algorithm logistic gistic regr gression ession determine whether to recommend of the real world mak e decisions that ear submachine jective. learning A simplealgorithm machine cesarean deliv delivery ery and (Mor-Y Mor-Yosef osef et al. al.,, 1990 ). app A simple learning algorithm called lo gistic r e gr ession can determine whether to recommend called naive Bayes can separate legitimate e-mail from spam e-mail. cesarean delivery (Mor-Yosef et al., 1990). A simple machine learning algorithm The performance of these simple machine learning algorithms dep depends ends heavily called naive Bayes can separate legitimate e-mail from spam e-mail. on the repr epresentation esentation of the data they are given. For example, when logistic The p erformance of these simple machine learning algorithms dep heavily regression is used to recommend cesarean deliv delivery ery ery,, the AI system do does es ends not examine on the r epr esentation of the data they are given. F or example, when logistic the patient directly directly.. Instead, the do doctor ctor tells the system several pieces of relev relevan an antt regression is used cesarean deliveryof, the AI system does not examine information, suc such htoasrecommend the presence or absence a uterine scar. Each piece of the patient directly . Instead, the doctor tellsofthe several piecesasofa relev anet. information included in the represen representation tation thesystem patient is known fe featur atur ature information, such learns as the ho presence orthese absence of a of uterine scar. correlates Each piece of Logistic regression how w eac each h of features the patient with information included in the represen tation of the patient is known as a fe atur e. various outcomes. Ho How wev ever, er, it cannot influence the wa way y that the features are Logistic in regression how eac h of thesewas features patient with defined any wa way ylearns . If logistic regression given of anthe MRI scan correlates of the patient, variousthan outcomes. Howev er, it cannot influence the not waybethat features are rather the do doctor’s ctor’s formalized rep report, ort, it would ablethe to mak make e useful defined in any way. If pixels logisticinregression washa given an MRI correlation scan of thewith patient, predictions. Individual an MRI scan hav ve negligible an any y rather than the do ctor’s formalized rep ort, it would not be able to mak e useful complications that might occur during delivery delivery.. predictions. Individual pixels in an MRI scan have negligible correlation with any This dep dependence endence on represen representations tations is a general phenomenon that app appears ears complications that might occur during delivery. throughout computer science and even daily life. In computer science, operaThis dep on arepresen tations is a can general phenomenon that faster appears tions suc such h asendence searching collection of data pro proceed ceed exp exponentially onentially if throughout computer science and even daily life. In computer science, operathe collection is structured and indexed intelligen intelligently tly tly.. P People eople can easily perform tions such as a collection data can proceed exponentially faster if arithmetic on searching Arabic numerals, but of find arithmetic on Roman numerals muc uch h the collection is structured and indexed intelligen tly . P eople can easily p erform more time-consuming. It is not surprising that the choice of represen representation tation has an arithmetic on Arabic numerals, but find arithmetic on Roman n umerals much enormous effect on the performance of mac machine hine learning algorithms. For a simple more time-consuming. It is not surprising that the choice of represen tation has an visual example, see Fig. 1.1. enormous effect on the performance of machine learning algorithms. For a simple Man Many y artificial tasks can be solv solved ed by designing the righ rightt set of visual example, see intelligence Fig. 1.1. features to extract for that task, then providing these features to a simple machine Manyalgorithm. artificial intelligence tasks can be solvedforbysp designing the right set of learning For example, a useful feature speak eak eaker er iden identification tification from features to extract for that task, then providing these features to a simple machine sound is an estimate of the size of speaker’s vocal tract. It therefore giv gives es a strong learning Forsp example, usefulwoman, featureorforchild. speaker identification from clue as toalgorithm. whether the speaker eaker is aaman, sound is an estimate of the size of speaker’s vocal tract. It therefore gives a strong Ho How wev ever, er, for man many y tasks, it is difficult to know what features should be extracted. clue as to whether the speaker is a man, woman, or child. For example, supp suppose ose that we would lik likee to write a program to detect cars in Ho w ev er, for man y tasks, it is difficult to know features e extracted. photographs. We know that cars ha hav ve wheels, so what we might like should to use b the presence Fora example, ose that we would lik write a to program detect what cars in of wheel as asupp feature. Unfortunately Unfortunately, , ite istodifficult describ describeto e exactly a photographs. W e know that cars ha v e wheels, so w e might like to use the presence wheel lo looks oks like in terms of pixel values. A wheel has a simple geometric shap shapee but of a wheel as a feature. Unfortunately , it is difficult to describ e exactly whatoffa its image may be complicated by shadows falling on the wheel, the sun glaring wheel lo oks like in terms of pixel v alues. A wheel has a simple geometric shap e but the metal parts of the wheel, the fender of the car or an ob object ject in the foreground its image may e complicated by so shadows obscuring partbof the wheel, and on. falling on the wheel, the sun glaring off the metal parts of the wheel, the fender of the car or an ob ject in the foreground obscuring part of the wheel, and so on. 3


Polar coordinates

Cartesian coordinates

Polar coordinates

µ

y

Cartesian coordinates

x

r

Figure 1.1: Example of differen differentt represen representations: tations: suppose we wan antt to separate ttw wo categories of data by dra drawing wing a line betw between een them in a scatterplot. In the plot on the left, Figure 1.1: some Example differen t represen tations:and suppose weis w antossible. to separate two w e represent data of using Cartesian co coordinates, ordinates, the task imp impossible. In the plot categories of data b y dra wing a line betw een them in a scatterplot. In the plot on the on the right, we represent the data with p olar coordinates and the task b ecomes simpleleft, to we represent some data using Cartesian co ordinates, and the task imp ossible. In the plot solv solve e with a vertical line. (Figure pro produced duced in collab collaboration oration withis David Warde-F arde-Farley) arley) on the right, we represent the data with p olar coordinates and the task b ecomes simple to solve with a vertical line. (Figure pro duced in collab oration with David Warde-Farley)

One solution to this problem is to use machine learning to disco discov ver not only the mapping from represen representation tation to output but also the representation itself. solution to this as problem is to usele machine learning representations to discover not often only ThisOne approach is known repr epresentation esentation learning arning arning.. Learned the mapping represen tation tothan output also the representation itself. result in muc uch hfrom better performance can but be obtained with hand-designed This approach is known as r epr esentation le arning . Learned representations represen representations. tations. They also allow AI systems to rapidly adapt to new tasks, often with result in hm uch bin etter p erformance than can b e obtained with hand-designed minimal uman A representation learning algorithm can discov interv terv terven en ention. tion. discover er a represen tations. They also allow AI systems to rapidly adapt to new tasks, with go goo od set of features for a simple task in min minutes, utes, or a complex task in hours to minimal h uman in terv en tion. A representation learning discov mon months. ths. Manually designing features for a complex taskalgorithm requires acan great dealerofa go o d set of features for a simple task in min utes, or a complex task in hours to human time and effort; it can tak takee decades for an en entire tire communit community y of researc researchers. hers. months. Manually designing features for a complex task requires a great deal of The quintessen quintessential tial example of a represen representation tation learning algorithm is the auhuman time and effort; it can take decades for an entire community of researchers. to toenc enc enco oder der.. An auto autoenco enco encoder der is the com combination bination of an enc enco oder function that con conv verts The quintessen tial example of a represen tation learning algorithm is the authe input data into a different representation, and a de deccoder function that con conv verts to enc o der . An auto enco der is the com bination of an enc o der function that con v erts the new representation bac back k into the original format. Auto Autoenco enco encoders ders are trained to the input data into a different representation, and a de c o der function verts preserv preservee as muc uch h information as possible when an input is run throughthat the con enco encoder der the new backare into thetrained originaltoformat. Auto enco ders are trained toe and thenrepresentation the deco decoder, der, but also make the new representation hav have preserv as mprop uch information as possible anenco input is aim run through the encoder v ariousenice properties. erties. Different kinds when of auto autoenco encoders ders to achiev achieve e different and then the deco der, but are also trained to make the new representation have kinds of prop properties. erties. various nice properties. Different kinds of autoencoders aim to achieve different When designing kinds of prop erties. features or algorithms for learning features, our goal is usually to separate the factors of variation that explain the observed data. In this context, When features or algorithms learning features, our goalthe is usually we use the designing word “factors” simply to refer to for separate sources of influence; factors to separate the factors of variation that explain the observed data. In this context, are usually not combined by multiplication. Suc Such h factors are often not quantities we use the word “factors” simply to refer to separate sources of influence; the factors 4 are usually not combined by multiplication. Such factors are often not quantities


that are directly observed. Instead, they may exist either as unobserv unobserved ed ob objects jects or unobserved forces in the ph physical ysical world that affect observ observable able quan quantities. tities. They that are directly may exist either as unobserv ed objects ma may y also exist asobserved. constructsInstead, in the hthey uman mind that pro provide vide useful simplifying or unobservedorforces in the physical world that tities. They explanations inferred causes of the observ observed ed affect data. observ They able can bquan e thought of as ma y also exist as constructs in the h uman mind that pro vide useful simplifying concepts or abstractions that help us make sense of the rich variabilit ariability y in the data. explanations or inferred of thethe observ ed of data. They can be thought ofer’s as When analyzing a sp speech eechcauses recording, factors variation include the sp speak eak eaker’s concepts abstractions that makethat sense of the variabilit y inanalyzing the data. age, theiror sex, their accent andhelp theus words they are rich sp speaking. eaking. When When analyzing speech recording, the factors variation er’s an image of a car,a the factors of variation includeofthe positioninclude of the the car, sp itseak color, age, their sex, their accent and the words that they are sp eaking. When analyzing and the angle and brightness of the sun. an image of a car, the factors of variation include the position of the car, its color, ma major jor source of difficult difficulty yofin the many real-w real-world orld artificial intelligence applications andAthe angle and brightness sun. is that many of the factors of variation influence ev every ery single piece of data we are A ma jor source of difficult y in many real-w orld artificial able to observe. The individual pixels in an image of a red intelligence car migh mightt bapplications e very close is that many of the factors of v ariation influence ev ery single piece of data are to black at night. The shap shapee of the car’s silhouette dep depends ends on the viewingwe angle. able toapplications observe. The individual pixels in anthe image of a of redvariation car mighand t bediscard very close Most require us to disentangle factors the to black at night. The shap e of the car’s silhouette dep ends on the viewing angle. ones that we do not care about. Most applications require us to disentangle the factors of variation and discard the Of course, it can be very difficult to extract such high-level, abstract features ones that we do not care about. from ra raw w data. Man Many y of these factors of variation, such as a sp speak eak eaker’s er’s accen accent, t, it can very sophisticated, difficult to extract such high-level, abstract features can Of be course, iden identified tified onlybeusing nearly human-lev human-level el understanding of fromdata. raw When data. it Man y of these factors to of obtain variation, such as a speak er’s accen t, the is nearly as difficult a representation as to solve the can be iden tified representation only using sophisticated, nearly human-lev el understanding of original problem, learning do does es not, at first glance, seem to help us. the data. When it is nearly as difficult to obtain a representation as to solve the De Deep epproblem, le learning arning solv solves es this central problem in not, represen representation tation learning original representation learning does at first glance, seembytointroduchelp us. ing represen representations tations that are expressed in terms of other, simpler representations. Delearning ep learning solves central problem represen tation learning by introducDeep allows thethis computer to build in complex concepts out of simpler coning represen tations that are expressed in terms of other, simpler representations. cepts. Fig. 1.2 shows how a deep learning system can represen representt the concept of an Deep learning allows the computer to build complex concepts out ofand simpler conimage of a person by combining simpler concepts, such as corners contours, cepts. Fig.in1.2 shows how in a deep system can represent the concept of an whic which h are turn defined termslearning of edges. image of a person by combining simpler concepts, such as corners and contours, quin quintessen tessential tial example of aofdeep learning mo model del is the feedforw feedforward ard deep whicThe h are in tessen turn defined in terms edges. net netw work or multilayer per ercceptr eptron on (MLP). A multila ultilay yer perceptron is just a matheThe quin tessen tial example of a deep learning mo del is the feedforw deep matical function mapping some set of input values to output values. The ard function netformed work orbymultilayer permany ceptron (MLP). A multila is just a matheis comp composing osing simpler functions. Weyer canperceptron think of each application matical function mapping some set of values to output values. The function of a different mathematical function as input pro providing viding a new representation of the input. is formed by composing many simpler functions. We can think of each application idea of learning thefunction right represen representation the representation data provides one persp erspececof aThe different mathematical as protation viding for a new of the input. tiv tivee on deep learning. Another persp erspective ective on deep learning is that depth allows the The idea of learning the right represen tation forEac theh data one persp eccomputer to learn a multi-step computer program. Each lay layer erprovides of the represen representation tation tive b one deep learning. Another on deepmemory learningafter is that depth allows the can thought of as the state pofersp theective computer’s executing another computer to learn a m ulti-step computer program. Eac h lay er of the represen tation set of instructions in parallel. Net Netw works with greater depth can execute more can b e thought of as the state of the computer’s memory after executing another instructions in sequence. Sequential instructions offer great pow ower er because later set of instructions in back parallel. worksofwith greater depth can executetomore instructions can refer to theNet results earlier instructions. According this instructions in sequence. Sequential instructions offer great power because later instructions can refer back to the results5 of earlier instructions. According to this


CAR

PERSON

ANIMAL

Output (object identity)

3rd hidden layer (object parts)

2nd hidden layer (corners and contours)

1st hidden layer (edges)

Visible layer (input pixels)

Figure 1.2: Illustration of a deep learning mo model. del. It is difficult for a computer to understand the meaning of ra raw w sensory input data, suc such h as this image represen represented ted as a collection Figure Illustration of a deep learningfrom mo del. It isofdifficult foran a computer to tity understand of pixel1.2: values. The function mapping a set pixels to ob object ject iden identity is very the meaning Learning of raw sensory input data, such as seems this image represented as a collection complicated. or ev evaluating aluating this mapping insurmountable if tackled directly directly.. of pixel values.resolves The function mapping a setthe of desired pixels to an ob ject iden tity isinto very Deep learning this difficult difficulty y by from breaking complicated mapping a complicated. Learning or ev aluating this mapping seems insurmountable if tackled directly series of nested simple mappings, each describ described ed by a different lay layer er of the mo model. del. The. Deep learning resolves this difficult y ,bysobreaking desired complicated mappingthat intowea input is presented at the visible layer named bthe ecause it contains the variables series of to nested simple mappings, each describ ed by a different layer ofabstract the mo del. The are able observe. Then a series of hidden layers extracts increasingly features visible layer input is presented at the , so named b ecause it contains the v ariables that from the image. These lay layers ers are called “hidden” b ecause their values are not given we in hidden layers are able observe.the Then a series extracts increasingly abstract features the data;toinstead mo model del mustofdetermine which concepts are useful for explaining fromrelationships the image. These ers are data. called The “hidden” b ecause their values are not given in the in the lay observed images here are visualizations of the kind the data; instead the mo del must determine which concepts are useful for explaining of feature represented by each hidden unit. Giv Given en the pixels, the first lay layer er can easily the relationships the observed data. Theofimages here are visualizations of the kind iden identify tify edges, by in comparing the brightness neighboring pixels. Given the first hidden of represented byedges, each hidden unit. Givenlay the pixels, thesearch first lay can easily la lay yfeature er’s description of the the second hidden layer er can easily forercorners and iden tify edges, by comparing the brightness of neighboring pixels. Given the first hidden extended contours, which are recognizable as collections of edges. Given the second secondofhidden er can easily the search forhidden cornerslay and la lay yer’s description description of of the the edges, image the in terms cornerslay and contours, third layer er extended which recognizable ofecific edges. Given the hidden can detectcontours, entire parts ofare sp specific ecific ob objects, jects,asbycollections finding sp specific collections ofsecond contours and layer’s description of description the image in of corners andofcontours, third er corners. Finally Finally,, this of terms the image in terms the ob object jectthe parts it hidden containslay can can detect entire parts sp ecific ob jects, by finding ecific reproduced collections of contours and b e used to recognize theofob objects jects present in the image. sp Images with p ermission corners. Finally description from Zeiler and, Fthis ergus (2014). of the image in terms of the ob ject parts it contains can b e used to recognize the ob jects present in the image. Images reproduced with p ermission from Zeiler and Fergus (2014). 6




Element Set

+ ⇥ +   ⇥

+ 

⇥ w1

Element Set

⇥

+ x1

w2

Logistic Regression

x2

Logistic Regression

w

x

Figure 1.3: Illustration of computational graphs mapping an input to an output where eac each h no node de p erforms an op operation. eration. Depth is the length of the longest path from input to Figure 1.3: Illustration of computational graphs mappingaan input to an output where output but dep depends ends on the definition of what constitutes p ossible computational step. each computation no de p erformsdepicted an op eration. is the length of the from input to The in theseDepth graphs is the output of alongest logisticpath regression mo model, del, T but dep ends on the definition of what constitutes a p ossible computational step. output σ(w x ), where σ is the logistic sigmoid function. If we use addition, multiplication and The computation in these graphs is the output of athen logistic del, logistic sigmoids asdepicted the elemen elements ts of our computer language, this regression mo model del has mo depth σ(w xIf),we where is the logistic sigmoid If we use multiplication and three. viewσ logistic regression as anfunction. element itself, thenaddition, this mo model del has depth one. logistic sigmoids as the elements of our computer language, then this mo del has depth three. If we view logistic regression as an element itself, then this mo del has depth one.

⇥

⇥

view of deep learning, not all of the information in a lay layer’s er’s activ activations ations necessarily enco encodes des factors of variation that explain the input. The representation also stores view of deep learning, not alltoofexecute the information a lay er’smake activsense ationsofnecessarily state information that helps a programinthat can the input. enco des factors of v ariation that explain the input. The representation also stores This state information could be analogous to a coun counter ter or pointer in a traditional state information that to execute a program can make the input., computer program. It helps has nothing to do with the that con conten ten tent t of the sense inputofsp specifically ecifically ecifically, This state information could b e analogous to a coun ter or p ointer in a traditional but it helps the mo model del to organize its processing. computer program. It has nothing to do with the content of the input specifically, There are tw two o main wa ways ys of measuring the depth of a mo model. del. The first view is but it helps the model to organize its processing. based on the num umber ber of sequen sequential tial instructions that must be executed to ev evaluate aluate There are tw o main wa ys of measuring the depth of a mo del. The first view is the arc architecture. hitecture. We can think of this as the length of the longest path through based on the n um ber of sequen tial instructions that must b e executed to ev aluate a flo flow w chart that describ describes es how to compute each of the mo model’s del’s outputs given theinputs. architecture. e ocan think oft computer this as theprograms length ofwill the hav longest path through its Just asWtw two equiv equivalen alen alent have e different lengths a flo w c hart that describ es how to compute each of the mo del’s outputs given dep depending ending on which language the program is written in, the same function may be its inputs. Just as tw o equiv alen t computer programs will hav e different lengths dra drawn wn as a flo flow wchart with differen differentt depths dep depending ending on whic which h functions we allow dep ending on which language the program is written in, the same be to be used as individual steps in the flo flow wchart. Fig. 1.3 illustratesfunction ho how w thismay choice dralanguage wn as a flo wcgive hart tw with differentmeasurements depths depending on whic functions we allow of can two o different for the sameharchitecture. to be used as individual steps in the flowchart. Fig. 1.3 illustrates how this choice Another approac approach, h, used by deep probabilistic mo models, dels, regards the depth of a of language can give two different measurements for the same architecture. mo model del as being not the depth of the computational graph but the depth of the Another approac used by deep probabilistic dels, In regards the depth of a graph describing howh,concepts are related to eachmo other. this case, the depth mothe del as bchart eing not the computations depth of the computational graph the but representation the depth of the of flow flowchart of the needed to compute of graph describing how concepts are related to each other. In this case, the depth 7 of the flowchart of the computations needed to compute the representation of


eac each h concept ma may y be muc uch h deep deeper er than the graph of the concepts themselv themselves. es. This is because the system’s understanding of the simpler concepts can be refined eac h concept mayab beout muc h more deepercomplex than the graph of concepts es. giv given en information about the concepts. Forthe example, anthemselv AI system This is because the of system’s understanding the simpler concepts cansee beone refined observing an image a face with one eye in of shadow ma may y initially only eye. giv en information ab out the more complex concepts. F or example, an AI system After detecting that a face is presen present, t, it can then infer that a second ey eyee is probably observing image a face eyeof in concepts shadow ma y initially only one eye. presen present t asan well. In of this case,with the one graph only includes twosee lay layers—a ers—a After detecting thata alay face t, it canthe then inferofthat a second eyeincludes is probably la lay yer for ey eyes es and layer er is forpresen faces—but graph computations 2n presen t as well. In this case, the graph of concepts only includes t w o lay ers—a la lay yers if we refine our estimate of each concept giv given en the other n times. layer for eyes and a layer for faces—but the graph of computations includes 2n Because it is not alw alwa ays clear whic which h of these tw two o views—the depth of the layers if we refine our estimate of each concept given the other n times. computational graph, or the depth of the probabilistic mo modeling deling graph—is most Because it is not alw a ys clear whic h of these tw o views—the depth of the relev relevant, ant, and because differen differentt people cho hoose ose different sets of smallest elemen elements ts computational graph, or the depth of the probabilistic mo deling graph—is most from whic which h to construct their graphs, there is no single correct value for the relev ant, and because differen eople is cho different sets of smallest elemenof ts depth of an arc architecture, hitecture, just taspthere noose single correct value for the length which program. to construct graphs, there isab noout single valueafor the afrom computer Northeir is there a consensus about ho how w correct muc much h depth mo model del depth of an arc hitecture, just as there is no single correct v alue for the length of requires to qualify as “deep.” Ho How wev ever, er, deep learning can safely be regarded as the a computer program. Nor isinv there aboutt of how mucosition h depthof alearned model study of mo models dels that either involve olveaaconsensus greater amoun amount comp composition requires to qualify as “deep.” Ho w ev er, deep learning can safely b e regarded as the functions or learned concepts than traditional mac machine hine learning does. study of models that either involve a greater amount of composition of learned To summarize, deep learning, the sub subject ject of this book, is an approach to AI. functions or learned concepts than traditional machine learning does. Sp Specifically ecifically ecifically,, it is a type of machine learning, a technique that allow allowss computer T o summarize, deep learning, the sub ject of this b o ok, is an AI. systems to impro improv ve with exp experience erience and data. A According ccording to the approach authors oftothis Sp ecifically , it is a type of machine learning, a technique that allow s computer book, mac machine hine learning is the only viable approac approach h to building AI systems that systems to impro ve with exp erience data. Ats. ccording to the authors of this can op operate erate in complicated, real-w real-world orldand environ environmen men ments. Deep learning is a particular book,ofmac hine the achiev only viable approac to building AIby systems that kind mac machine hinelearning learningisthat achieves es great pow power erh and flexibility learning to can op erate in complicated, real-w orld environ men ts. Deep learning is a particular represen representt the world as a nested hierarc hierarch hy of concepts, with each concept defined in kind of mac hine learning that achiev es great pow er and flexibility by learning to relation to simpler concepts, and more abstract representations computed in terms represen t the world as Fig. a nested hierarchy of with beach concept in of less abstract ones. 1.4 illustrates theconcepts, relationship et etw ween thesedefined different relation to simpler more abstract representations AI disciplines. Fig.concepts, 1.5 gives and a high-level schematic of ho how w eachcomputed works. in terms of less abstract ones. Fig. 1.4 illustrates the relationship between these different AI disciplines. Fig. 1.5 gives a high-level schematic of how each works.

1.1

Who Should Read This Bo Book? ok?

This ok can bShould e useful for a variet ariety y of readers, but we wrote it with two main 1.1 boWho Read This Book? target audiences in mind. One of these target audiences is universit university y students This b o ok can b e useful for a v ariet y of readers, but we wrote it with two main (undergraduate or graduate) learning ab about out machine learning, including those who target audiences in mind. One of these target audiences is universit y students are beginning a career in deep learning and artificial in intelligence telligence researc research. h. The (undergraduate or graduate) learning about machine who other target audience is softw software are engineers who do learning, not hav havee including a mac machine hinethose learning arestatistics beginning a career in but deepwan learning and artificial telligence researc h. deep The or backgr background, ound, want t to rapidly acquire in one and begin using other target audience is softw are engineers do not e a mac hine learning learning in their pro product duct or platform. Deep who learning hashav already pro prov ven useful in or statistics backgr ound, but wan t to rapidly acquire one and b egin using deep man many y soft softw ware disciplines including computer vision, speech and audio pro processing, cessing, learning in their product or platform. Deep learning has already proven useful in 8 many software disciplines including computer vision, speech and audio processing,


Deep learning Example: MLPs

Example: Shallow autoencoders

Example: Logistic regression

Example: Knowledge bases

Representation learning

Machine learning

AI

Figure 1.4: A Venn diagram showing how deep learning is a kind of represen representation tation learning, whic is in turn a kind of mac learning, which is used for many but not all approaches which h machine hine Figure 1.4: A V enn diagram showing how deep learning is a kind of represen tation learning, to AI. Each section of the Venn diagram includes an example of an AI technology technology. . which is in turn a kind of machine learning, which is used for many but not all approaches to AI. Each section of the Venn diagram includes an example of an AI technology.

9


Output

Output

Output

Mapping from features

Output



Additional layers of more abstract features

Handdesigned program

Handdesigned features

Features

Simple features

Input

Input

Input

Input

Rule-based systems

Deep learning

Classic machine learning

Representation learning

Figure 1.5: Flow Flowcharts charts showing how the differen differentt parts of an AI system relate to eac each h other within different AI disciplines. Shaded b oxes indicate comp components onents that are able to Figurefrom 1.5: data. Flowcharts showing how the different parts of an AI system relate to each learn other within different AI disciplines. Shaded b oxes indicate comp onents that are able to learn from data.

10


natural language pro processing, cessing, rob robotics, otics, bioinformatics and chemistry chemistry,, video games, searc search h engines, online advertising and finance. natural language processing, robotics, bioinformatics and chemistry, video games, This book has been organized into three parts in order to best accommo accommodate date a search engines, online advertising and finance. variety of readers. Part I introduces basic mathematical to tools ols and machine learning This b o ok has been organized into three parts in order to best accommo date concepts. Part II describ describes es the most established deep learning algorithms that area vessen ariety of readers. Part I introduces mathematical ols and machine essentially tially solv solved ed tec technologies. hnologies. Partbasic III describ describes es moreto sp speculativ eculativ eculative e ideas learning that are concepts. P art I I describ es the most established deep learning algorithms that are widely believ elieved ed to be imp important ortant for future researc research h in deep learning. essentially solved technologies. Part III describes more speculative ideas that are Readers should feel free to skip parts that are not relev relevan an antt given their interests widely believed to be important for future research in deep learning. or background. Readers familiar with linear algebra, probability probability,, and fundamental Readers should feel freecan to skip skip Part partsI,that are not relev anreaders t given their interests mac machine hine learning concepts for example, while who just wan antt or background. Readers familiar with linear algebra, probability , and fundamental to implement a working system need not read bey eyond ond Part II. To help choose which mac hine learning concepts can skip Part I , for example, readers who just want chapters to read, Fig. 1.6 provides a flow flowchart chart sho showing wingwhile the high-level organization to implement a w orking system need not read b ey ond Part I I . T o help choose which of the book. chapters to read, Fig. 1.6 provides a flowchart showing the high-level organization We do assume that all readers come from a computer science bac background. kground. We of the book. assume familiarity with programming, a basic understanding of computational We do assume all readers come from a computer science bac We performance issues,that complexity theory theory, , in introductory troductory lev level el calculus andkground. some of the assume familiarity with programming, a basic understanding of computational terminology of graph theory theory. . performance issues, complexity theory, introductory level calculus and some of the terminology of graph theory.

1.2

Historical Trends in Deep Learning

It is easiest to understand deep learning with some historical con context. text. Rather than 1.2 Historical Trends in Deep Learning pro providing viding a detailed history of deep learning, we iden identify tify a few key trends: It is easiest to understand deep learning with some historical context. Rather than providing detailed has history deepand learning, we iden tifyhas a few key • Deepa learning had of a long ric rich h history history, , but gone bytrends: many names • • • • •

reflecting different philosophical viewp viewpoints, oints, and has waxed and waned in Deep learning has had a long and ric h history , but has gone by many names popularit opularity y. reflecting different philosophical viewpoints, and has waxed and waned in popularit y. Deep learning has become more useful as the amoun amountt of av available ailable training data has increased. Deep learning has become more useful as the amount of available training data has increased. Deep learning models ha hav ve gro grown wn in size over time as computer hardware and soft softw ware infrastructure for deep learning has improv improved. ed. Deep learning models have grown in size over time as computer hardware and soft ware infrastructure for deep complicated learning hasapplications improved. with increasing Deep learning has solv solved ed increasingly accuracy over time. Deep learning has solved increasingly complicated applications with increasing accuracy over time.

11


1. Introduction

Part I: Applied Math and Machine Learning Basics 2. Linear Algebra

3. Probability and Information Theory

4. Numerical Computation

5. Machine Learning Basics

Part II: Deep Networks: Modern Practices 6. Deep Feedforward Networks

7. Regularization

8. Optimization

11. Practical Methodology

9. CNNs

10. RNNs

12. Applications

Part III: Deep Learning Research 13. Linear Factor Models

14. Autoencoders

15. Representation Learning

16. Structured Probabilistic Models

17. Monte Carlo Methods

19. Inference

18. Partition Function

20. Deep Generative Models

Figure 1.6: The high-level organization of the b o ok. An arro arrow w from one chapter to another indicates that the former chapter is prerequisite material for understanding the latter. Figure 1.6: The high-level organization of the b o ok. An arrow from one chapter to another 12 indicates that the former chapter is prerequisite material for understanding the latter.


1.2.1

The Man Many y Names and Changing Fortunes of Neural Networks 1.2.1 The Many Names and Changing Fortunes of Neural Networks We expect that many readers of this book ha hav ve heard of deep learning as an exciting new technology technology,, and are surprised to see a men mention tion of “history” in a book W e expect that many readers of this b o ok ha v e heard of deep learning an ab about out an emerging field. In fact, deep learning dates back to the 1940s. as Deep towas exciting new are surprised see relatively a mention unp of “history” in several a book learning onlytechnology app appeears to, and be new, because it unpopular opular for outpreceding an emerging field. Inpfact, deep learning datesit back to the 1940s. many Deep yab ears its current opularit opularity y, and because has gone through learning appand earshas to bonly e new, because it wascalled relatively opular for several differen differentt only names, recently become “deepunp learning.” The field yhas earsbeen preceding its current p opularit y , and b ecause it has gone through many rebranded many times, reflecting the influence of differen differentt researchers differen t names, and has and differen different t persp erspectiv ectiv ectives. es.only recently become called “deep learning.” The field has been rebranded many times, reflecting the influence of different researchers A comprehensive history of deep learning is bey eyond ond the scope of this textb textbo ook. and different perspectives. Ho How wev ever, er, some basic context is useful for understanding deep learning. Broadly A comprehensive of deep is beyond scope of thisdeep textb ook. sp speaking, eaking, there hav havee bhistory een three wa wav veslearning of developmen development t ofthe deep learning: learnHowknown ever, some basic context useful for understanding deep learning. Broadly ing as cyb cybernetics ernetics in theis1940s–1960s, deep learning known as conne onnectionism ctionism speaking, there have band een the threecurren wavest of developmen t of deep learning: deep learnin the 1980s–1990s, current resurgence under the name deep learning ing known in as 2006. cybernetics in quantitativ the 1940s–1960s, deep learning known b eginning This is quantitatively ely illustrated in Fig. 1.7. as connectionism in the 1980s–1990s, and the current resurgence under the name deep learning Some of the earliest learning algorithms we recognize to toda da day y were intended beginning in 2006. This is quantitatively illustrated in Fig. 1.7. to be computational mo models dels of biological learning, i.e. mo models dels of how learning Some of could the earliest algorithms we recognize danames y werethat intended happ happens ens or happ happen enlearning in the brain. As a result, one of to the deep to b e computational mo dels of biological learning, i.e. mo dels of how learning learning has gone by is artificial neur neural al networks (ANNs). The corresp corresponding onding happ ens or could happ en in the brain. As a result, one of the names that deep persp erspectiv ectiv ectivee on deep learning mo models dels is that they are engineered systems inspired learning has gonebrain by is(whether artificialthe neur al networks correspanimal). onding b y the biological human brain or(ANNs). the brain The of another perspectiv on deep learning moorks dels is thatfor they are engineered inspired While the ekinds of neural netw networks used machine learning systems hav havee sometimes byeen theused biological brain (whether human(Hinton brain orand the Shallice brain of, another animal). b to understand brain the function 1991), they are While the kinds of neural netw orks used for machine learning hav e sometimes generally not designed to be realistic mo models dels of biological function. The neural bersp een used etoonunderstand brain function 1991idea ), they are p erspectiv ectiv ective deep learning is motiv motivated ated(Hinton by tw two o and mainShallice ideas. ,One is that generally not designed to e realistic dels of biological function. The neural the brain provides a pro proof of bby example mo that intelligen intelligent t behavior is possible, and a p ersp ectiv e on deep learning is motiv ated b y tw o main ideas. One idea is conceptually straightforw straightforward ard path to building in intelligence telligence is to rev reverse erse engineerthat the the brain provides a pro of by example that intelligen t b ehavior is p ossible, and a computational principles behind the brain and duplicate its functionalit functionality y. Another conceptually ardbpath to building intelligence is to revthe ersebrain engineer the p ersp erspectiv ectiv ectivee isstraightforw that it would e deeply interesting to understand and the computational behindintelligence, the brain and its functionalit Another principles that principles underlie human so duplicate machine learning mo models delsy.that shed p ersp ectiv e is that it w ould b e deeply interesting to understand the brain and thee ligh lightt on these basic scien scientific tific questions are useful apart from their abilit ability y to solv solve principles that underlie human intelligence, so machine learning mo dels that shed engineering applications. light on these basic scientific questions are useful apart from their ability to solve The mo modern dern term “deep learning” go goes es beyond the neuroscientific persp erspective ective engineering applications. on the curren currentt breed of mac machine hine learning models. It app appeals eals to a more general The mo dern term “deep learning” go es b eyond the neuroscientific ersp ective principle of learning multiple levels of comp omposition osition osition,, which can be appliedpin machine on the curren t breed of mac hine learning models. It app eals to a more general learning framew frameworks orks that are not necessarily neurally inspired. principle of learning multiple levels of composition, which can be applied in machine learning frameworks that are not necessarily neurally inspired. 13

Frequency of Word or Phrase


0.000250 0.000200 0.000150

cyb cybernetics ernetics (connectionism + neural net netw works) cyb ernetics (connectionism + neural networks)

0.000100 0.000050 0.000000 1940

1950

1960

1970

1980

1990

2000

Year

Figure 1.7: The figure shows tw two o of the three historical wa wav ves of artificial neural nets researc research, h, as measured by the frequency of the phrases “cyb “cybernetics” ernetics” and “connectionism” or Figure net 1.7:works” The figure shows twoogle of Bo theoks three esoofrecent artificial neural “neural netw according to Go Google Books (thehistorical third wa wav vwa e isvto too to app appear). ear). nets The researc h,easstarted measured the frequency of the phrases “cyb ernetics” and “connectionism” or first wav ave withbycybernetics in the 1940s–1960s, with the developmen development t of theories “neural networks” according to Go oks ,(the wa,v1949 e is to recent to app ear). The of biological learning (McCullo McCulloch chogle andBo Pitts 1943third ; Hebb ) oand implementations of first w av e started with cybernetics in the 1940s–1960s, with the developmen t of theories the first mo models dels such as the p erceptron (Rosen Rosenblatt blatt, 1958) allo allowing wing the training of a single of biological learning (ve McCullo andthe Pitts , 1943; Hebb , 1949) and implementations of neuron. The second wa wave startedchwith connectionist approach of the 1980–1995 p erio eriod, d, the first mo dels such as(Rumelhart the p erceptron 1958a) allo wing theork training of aorsingle with bac back-propagation k-propagation et al.(,Rosen 1986ablatt ) to ,train neural netw network with one tw two o neuron. The second wa ve started with the connectionist approach of the 1980–1995 p erio d, hidden la layers. yers. The current and third wa wav ve, deep learning, started around 2006 (Hinton et al., 1986a with k-propagation (Rumelhart to train a neural with oneinorbtw o et al.bac , 2006 ; Bengio et al. , 2007; Ranzato et al.,)2007a ), and is justnetw no now work app appearing earing ook hidden The current ve, deepapp learning, around 2006 (Hinton form as laofyers. 2016. The other and tw two o third waveswa similarly appeared eared instarted b o ok form muc uch h later than et al.corresp et al., 2007 et al., 2007a), and is just now app earing in b ook , 2006;onding Bengioscientific ; Ranzato the corresponding activity o ccurred. form as of 2016. The other two waves similarly app eared in b o ok form much later than the corresp onding scientific activity o ccurred.

14


The earliest predecessors of mo modern dern deep learning were simple linear mo models dels motiv motivated ated from a neuroscientific persp erspective. ective. These mo models dels were designed to earliest learning models tak takeeThe a set of n predecessors input values ofx1mo and asso associate ciatewere themsimple with linear an output , . .dern . , xn deep y. motiv ated from a neuroscientific p ersp ective. These mo dels were designed to These mo models dels would learn a set of weigh eights ts w1 , . . . , wn and compute their output tak e a set of input v alues them with output n x , . . . , x y. f (x, w ) = x 1w1 + · · · + xnwn . This firstand wavasso e of ciate neural netw networks orks an researc research h was These would ,learn a set of win eigh ts w kno known wnmo as dels cyb cybernetics ernetics ernetics, as illustrated Fig. 1.7,.. . . , w and compute their output f (x, w ) = x w + + x w . This first wave of neural networks research was The McCullo McCulloch-Pitts ch-Pitts Neuron (McCullo McCulloch ch and Pitts, 1943) was an early mo model del known as cybernetics · · ,· as illustrated in Fig. 1.7. of brain function. This linear mo model del could recognize tw two o different categories of The McCullo ch-Pitts Neuron ( McCullo ch and Pitts , 1943 ) was anforearly del inputs by testing whether f (x, w ) is positiv ositivee or negative. Of course, the mo mo model del of brain function. linear model could o different of to corresp correspond ond to theThis desired definition of therecognize categories,twthe weigh weights ts categories needed to be inputs by testing whether positiv negative. course, for the mo del fts (xcould , w ) isbe set correctly correctly. . These weigh weights set beyorthe human Of op operator. erator. In the 1950s, to corresp ond to(Rosen the desired categories, weigh ts needed be the perceptron Rosenblatt blatt, definition 1958, 1962of ) bthe ecame the firstthe mo model del that could to learn set correctly . These weigh ts could be set b y the human op erator. In the 1950s, the weigh weights ts defining the categories given examples of inputs from each category category.. the padaptive erceptron (Rosen blatt, (ADALINE), 1958, 1962) bwhich ecamedates the first del that The line linear ar element frommo about the could same learn time, the weigh ts defining the categories given examples of inputs from each category simply returned the value of f (x) itself to predict a real num umb ber (Widro Widrow w and . The linecould ar element (ADALINE), fromfrom about the same time, Hoff,adaptive 1960), and also learn to predictwhich thesedates num numbers bers data. simply returned the value of f (x) itself to predict a real number (Widrow and These simple learning algorithms greatly affected the mo modern dern landscap landscapee of Hoff, 1960), and could also learn to predict these numbers from data. mac machine hine learning. The training algorithm used to adapt the weigh weights ts of the ADAThese algorithms affected modern landscap e tly of LINE was asimple sp special eciallearning case of an algorithmgreatly called sto stochastic chasticthe gr gradient adient desc descent ent ent.. Sligh Slightly mac hine learning. algorithm to adapt the remain weightsthe of the ADAmo modified dified versions ofThe the training sto stocchastic gradien gradientt used descent algorithm dominan dominant t LINE was a sp ecial case of an algorithm called sto chastic gr adient desc ent . Sligh tly training algorithms for deep learning mo models dels today today.. modified versions of the stochastic gradient descent algorithm remain the dominant Mo Models dels based on the f (x, w) used by the perceptron and ADALINE are called training algorithms for deep learning models today. line linear ar mo models dels dels.. These mo models dels remain some of the most widely used machine learning (x, wthey Mo dels based on the ) used y the perceptron and areoriginal called mo models, dels, though in man many y fcases arebtrained in differen different t wADALINE ays than the line ar mo dels . These mo dels remain some of the most widely used machine learning mo models dels were trained. models, though in many cases they are trained in different ways than the original Linear mo models dels hav havee many limitations. Most famously famously,, they cannot learn the models were trained. ([0,, 1] , w) = 1 and f ([1 ([1,, 0], w) = 1 but f ([1 ([1,, 1], w) = 0 XOR function, where f ([0 Linear mo dels hav e many limitations. Most famously , they cannot learn the ([0,, 0], w ) = 0. Critics who observ and f ([0 observed ed these fla flaws ws in linear mo models dels caused f ([0 , 1] , w f ([1 , 0] , w f ([1 , 1] , w ) = 0, X OR function, where ) = 1 and ) = 1 but a bac backlash klash against biologically inspired learning in general (Minsky and Pap apert ert ert, f ([0 , 0],w was and ). )= Critics these flaws in linear mo dels caused 1969 1969). This the0.first ma major jorwho dipobserv in theed popularity of neural netw networks. orks. a backlash against biologically inspired learning in general (Minsky and Papert, day, , neuroscience regarded an imp importan ortan ortantt source of inspiration 1969T).oday This was the firstisma jor dip as in the popularity of neural networks. for deep learning researc researchers, hers, but it is no longer the predominan predominantt guide for the field. Today, neuroscience is regarded as an important source of inspiration for deep The main reason for the diminished role of neuroscience in deep learning learning researchers, but it is no longer the predominant guide for the field. researc research h to today day is that we simply do not ha hav ve enough information ab about out the brain The main reason for the diminished role of neuroscience in deep learning to use it as a guide. To obtain a deep understanding of the actual algorithms used researc h to day is that w e simply do not ha v e enough information ab out the brain by the brain, we would need to be able to monitor the activity of (at the very to use thousands it as a guide. To obtain a deepneurons understanding of the actual algorithms least) of interconnected sim simultaneously ultaneously ultaneously. . Because we areused not by the ouldfarneed be able to monitor the activity of (at the very able to brain, do this,weweware fromtounderstanding even some of the most simple and least) thousands of interconnected neurons simultaneously. Because we are not 15 able to do this, we are far from understanding even some of the most simple and


well-studied parts of the brain (Olshausen and Field, 2005). Neuroscience has given us a reason to hop hopee that a single deep learning algorithm well-studied parts of the brain (Olshausen and Field, 2005). can solve man many y differen differentt tasks. Neuroscientists ha hav ve found that ferrets can learn to Neuroscience has given us a reason to hop e that a single deep brains learning algorithm “see” with the auditory pro processing cessing region of their brain if their are rewired can solve man y differen t tasks. Neuroscientists ha v e found that ferrets can learn to to send visual signals to that area (Von Melchner et al. al.,, 2000). This suggests that “see” themammalian auditory pro cessing region if theirtobrains are rewired muc uch hwith of the brain might useofa their singlebrain algorithm solve most of the to send visual signals to that area ( V on Melchner et al. , 2000 ). This suggests that differen differentt tasks that the brain solves. Before this hypothesis, machine learning m uch of the mammalian brain with mightdifferen use a tsingle algorithm solvehers most of the researc research h was more fragmented, different communities of to researc researchers studying differen t tasks that the brain solves. Before this hypothesis, machine learning natural language processing, vision, motion planning and speech recognition. Toda day y, researc h was morecommunities fragmented,are with t communities of researc these application stilldifferen separate, but it is common for hers deepstudying learning natural vision, motion and speechareas recognition. Today,. researc research hlanguage groups toprocessing, study many or even all ofplanning these application sim simultaneously ultaneously ultaneously. these application communities are still separate, but it is common for deep learning We are able to dra draw w some rough guidelines from neuroscience. The basic idea of research groups to study many or even all of these application areas simultaneously. ha having ving many computational units that become in intelligen telligen telligentt only via their interactions e are other able tois dra w someby rough guidelines neuroscience. The basic,idea withWeach inspired the brain. Thefrom Neo Neocognitron cognitron (Fukushima 1980of) hatro ving manya computational that become intelligen viathat theirwinteractions in intro troduced duced pow owerful erful modelunits arc architecture hitecture for pro processing cessingt only images as inspired with each other is inspired by the brain. The Neo cognitron ( F ukushima , 1980 by the structure of the mammalian visual system and later became the basis for) intromo duced pow erful model arc pro cessing images that inspired the modern dern acon conv volutional netw network orkhitecture (LeCun for et al. , 1998b ), as we will seewinasSec. 9.10. by theneural structure of theto mammalian and later became basis for Most netw networks orks toda da day y are basedvisual on a system mo model del neuron called the rethe ctifie ctified d line linear ar the mo dern con v olutional netw ork ( LeCun et al. , 1998b ), as we will see in Sec. 9.10 unit unit.. The original Cognitron (Fukushima, 1975) in intro tro troduced duced a more complicated. netwhighly orks toinspired day are by based awledge model of neuron the The rectifie d linear vMost ersionneural that was our on kno knowledge brain called function. simplified unit . The original Cognitron ( F ukushima , 1975 ) in tro duced a more complicated mo modern dern version was dev developed eloped incorp incorporating orating ideas from many viewp viewpoints, oints, with Nair vand ersion that inspiredetby kno)wledge brain function. simplified Hin Hinton ton (was 2010highly ) and Glorot al.our (2011a citing of neuroscience as anThe influence, and mo dern version was dev eloped incorp orating ideas from many viewp oints, with Nair Jarrett et al. (2009) citing more engineering-oriented influences. While neuroscience and ton (2010source ) and Glorot et al. (2011a ) citing influence, is anHin imp important ortant of inspiration, it need notneuroscience be tak taken en asasa an rigid guide. and We Jarrett et al. ( 2009 ) citing more engineering-oriented influences. While neuroscience kno know w that actual neurons compute very different functions than mo modern dern rectified is an imp ortant of neural inspiration, it need not yet be tak rigid guide. linear units, butsource greater realism has not leden toasana impro improv vemen ementtWine know that actual p neurons compute very different functions modern rectified mac machine hine learning erformance. Also, while neuroscience hasthan successfully inspired linear units, but greater neural realism has not yet led to an impro v emen t in sev several eral neural netw network ork arc know w enough ab about out biological architectures hitectures, we do not yet kno mac hine learning p erformance. Also, while neuroscience has successfully inspired learning for neuroscience to offer muc much h guidance for the learning algorithms we several neuralthese netwarchitectures. ork architectures, we do not yet know enough about biological use to train learning for neuroscience to offer much guidance for the learning algorithms we Media accoun accounts ts often emphasize the similarity of deep learning to the brain. use to train these architectures. While it is true that deep learning researchers are more lik likely ely to cite the brain as an Media accoun ts often emphasize the similarity of deep tohthe brain. influence than researchers working in other machine learninglearning fields suc such as kernel While it isor true that deep learningone researchers are view moredeep likelylearning to cite the brain as an mac machines hines Bay Bayesian esian statistics, should not as an attempt influence than working other machine learning fields sucmany h as kernel to sim simulate ulate theresearchers brain. Mo Modern dern deepinlearning draws inspiration from fields, mac hines or Bay esian statistics, one should not view deep learning as an attempt esp especially ecially applied math fundamentals like linear algebra, probability probability,, information to sim ulate the brain. Mo dern deep learning draws inspiration many theory theory,, and numerical optimization. While some deep learningfrom researc researchers hersfields, cite especially applied fundamentals linear algebra, , information neuroscience as anmath imp important ortant source oflike inspiration, othersprobability are not concerned with theory, and numerical optimization. While some deep learning researchers cite neuroscience as an important source of inspiration, others are not concerned with 16


neuroscience at all. It is w worth orth noting that the effort to understand ho how w the brain works on neuroscience at all. an algorithmic lev level el is aliv alivee and well. This endea endeav vor is primarily known as It is w orth noting that the effort to understand how the brain on “computational neuroscience” and is a separate field of study from deepworks learning. an isalgorithmic el is aliv and well. This is een primarily knownThe as It common forlev researc researchers herse to mov move e bac back k andendea forthvor betw etween both fields. “computational neuroscience” andconcerned is a separate field deep learning. field of deep learning is primarily with howoftostudy buildfrom computer systems It is common for researc hers to mov e bac k and forth b etw een both fields. The that are able to successfully solv solvee tasks requiring in intelligence, telligence, while the field of field of deep learning is primarily concerned with how to build computer systems computational neuroscience is primarily concerned with building more accurate that areofable successfully solveworks. tasks requiring intelligence, while the field of mo models dels howtothe brain actually computational neuroscience is primarily concerned with building more accurate In the thebrain second wave of neural net netw work research emerged in great part models of 1980s, how the actually works. via a mov movemen emen ementt called conne onnectionism ctionism or par aral al allel lel distribute distributed d pr pro ocessing (Rumelhart In the 1980s, the second w a v e of neural net w ork research emerged great part et al., 1986c; McClelland et al., 1995). Connectionism arose in theincontext of via a mov emen t called c onne ctionism or p ar al lel distribute d pr o c essing ( Rumelhart cognitiv cognitivee science. Cognitiv Cognitivee science is an interdisciplinary approach to understandet al. , 1986c ; McClelland et al., 1995 ). Connectionism arose in the context of ing the mind, combining multiple different lev levels els of analysis. During the early cognitiv e science. Cognitiv e science is an interdisciplinary understand1980s, most cognitive scientists studied mo models dels of sym symb bolic approach reasoning.toDespite their ing the mind, combining multiple different lev els of analysis. During the early popularit opularity y, symbolic mo models dels were difficult to explain in terms of ho how w the brain 1980s, most cognitive scientists studied mo dels of sym b olic reasoning. Despite their could actually implement them using neurons. The connectionists began to study p opularit symbolicthat models were difficult explain in in neural terms of how thetations brain mo models dels ofy,cognition could actually be to grounded implemen implementations actually them), using neurons. study (could Touretzky andimplement Min Minton ton, 1985 reviving man many yThe ideasconnectionists dating bac back k btoegan the to work of mo dels of cognition that could actually b e grounded in neural implemen tations psyc psychologist hologist Donald Hebb in the 1940s (Hebb, 1949). (Touretzky and Minton, 1985), reviving many ideas dating back to the work of central idea inHebb connectionism is that a large num number psycThe hologist Donald in the 1940s (Hebb , 1949 ). ber of simple computational units can ac achiev hiev hievee in intelligen telligen telligentt behavior when net netw work orked ed together. This insight The central idea in connectionism is that a large num ber simple computational applies equally to neurons in biological nerv nervous ous systems of and to hidden units in units can ac hiev e in telligen t behavior when net w ork ed together. This insight computational mo models. dels. applies equally to neurons in biological nervous systems and to hidden units in Sev Several eral key concepts arose during the connectionism mov movemen emen ementt of the 1980s computational models. that remain central to to today’s day’s deep learning. Several key concepts arose during the connectionism movement of the 1980s of these concepts is that of distribute distributed epresentation esentation (Hinton et al., 1986). thatOne remain central to today’s deep learning.d repr This is the idea that eac each h input to a system should be represen represented ted by man many y features, these concepts that of distribute d r epr esentation ( Hinton et al.,inputs. 1986). andOne eachoffeature should beis in olv in the represen of many p ossible inv volved ed representation tation This is the ideasuppose that eacwhe input a system should represen ted bcars, y mantruc y features, For example, ha hav ve atovision system thatbecan recognize trucks, ks, and and each feature should b e in v olv ed in the represen tation of many p ossible inputs. birds and these ob objects jects can eac each h be red, green, or blue. One way of representing F or example, suppose w e ha v e a vision system that can recognize cars, truc ks, these inputs would be to ha hav ve a separate neuron or hidden unit that activ activates atesand for birds and these ob jects can eac h b e red, green, or blue. One w a y of representing eac each h of the nine possible combinations: red truck, red car, red bird, green truck, and these would benine to ha ve a separate neuron or hidden unit that activ ates for so on.inputs This requires different neurons, and each neuron must indep independently endently each of theconcept nine possible red truck, bird, green and learn the of colorcombinations: and ob object ject identit identity y. Onered wacar, y to red impro improv ve on thistruck, situation so on. This requires nine different neurons, and each neuron m ust indep endently is to use a distributed representation, with three neurons describing the color and learn the concept of color and object ject identit yy.. One ay to impro on neurons this situation three neurons describing the ob object identit identity Thiswrequires onlyvesix total is to useofa nine, distributed representation, with three neurons the and instead and the neuron describing redness is abledescribing to learn ab about outcolor redness three neurons describing the ob ject identity. This requires only six neurons total instead of nine, and the neuron describing 17 redness is able to learn ab out redness


from images of cars, trucks and birds, not only from images of one sp specific ecific category of ob objects. jects. The concept of distributed represen representation tation is central to this book, and frombimages ofed cars, trucks and birds, not only 15 from will e describ described in greater detail in Chapter . images of one specific category of ob jects. The concept of distributed representation is central to this book, and ma major accomplishment the connectionist mov movemen emen ementt was the sucwill Another be describ edjor in accomplishmen greater detail int of Chapter 15. cessful use of back-propagation to train deep neural net netw works with in internal ternal repreAnother mathe jor paccomplishmen t of connectionist mov ement was the sucsen sentations tations and opularization of thethe back-propagation algorithm (Rumelhart cessful use of; back-propagation to algorithm train deep has neural netwand orkswaned with in repreet al. al.,, 1986a LeCun, 1987). This waxed internal popularity sentations andwriting the popularization of the back-propagation (Rumelhart but as of this is currently the dominan dominant t approac approach h toalgorithm training deep models. et al., 1986a; LeCun, 1987). This algorithm has waxed and waned in popularity thewriting 1990s, is researc researchers hers made important adv advances ances mo modeling deling sequences but During as of this currently the dominan t approac h tointraining deep models. with neural netw networks. orks. Ho Hochreiter chreiter (1991) and Bengio et al. (1994) iden identified tified some During the 1990s, researchers made important advances modeling sequences of the fundamental mathematical difficulties in mo modeling deling longinsequences, describ described ed with neural netw orks. Ho chreiter ( 1991 ) and Bengio et al. ( 1994 ) iden tified some in Sec. 10.7. Hochreiter and Sc Schmidh hmidh hmidhub ub uber er (1997) introduced the long short-term of the fundamental mathematical difficulties modeling long sequences, describ ed memory or LSTM net netw work to resolv resolvee some ofinthese difficulties. Toda day y, the LSTM in widely Sec. 10.7 . Hochreiter and Schmidh uber tasks, (1997)including introduced thenatural long short-term is used for many sequence mo modeling deling many language memory or LSTM net w ork to resolv e some of these difficulties. T o da y , the LSTM pro processing cessing tasks at Go Google. ogle. is widely used for many sequence modeling tasks, including many natural language The second wa wave ve of neural netw networks orks research lasted un until til the mid-1990s. Venprocessing tasks at Google. tures based on neural netw networks orks and other AI technologies began to make unrealistisecond wa ve ofwhile neural networks research lasted the mid-1990s. VencallyThe ambitious claims seeking inv investments. estments. When un AItil research did not fulfill tures based on neuralexp netw orks and inv other AI technologies bointed. egan to Simultaneously make unrealisti-, these unreasonable expectations, ectations, investors estors were disapp disappointed. Simultaneously, cally ambitious claims while seeking inv estments. When AI research not et fulfill other fields of mac machine hine learning made adv advances. ances. Kernel mac machines hines (did Boser al., these unreasonable exp ectations, inv estors were disapp ointed. Simultaneously 1992; Cortes and Vapnik, 1995; Schölk Schölkopf opf et al. al.,, 1999) and graphical mo models dels (Jor-, other fields hineedlearning madeonadv ances. Kernel machines (Boser et al., dan , 1998 ) bof othmac achiev achieved go goood results many imp importan ortan ortantt tasks. These two factors 1992 ; Cortes andinVthe apnik , 1995; Schölk opf etnetw al., orks 1999)that andlasted graphical dels (Jorled to a decline popularity of neural networks untilmo 2007. dan, 1998) both achieved good results on many important tasks. These two factors During this time, neural net netw works con contin tin tinued ued to obtain impressiv impressivee performance led to a decline in the popularity of neural networks that lasted until 2007. on some tasks (LeCun et al., 1998b; Bengio et al., 2001). The Canadian Institute thisResearc time, neural net works con to neural obtain net impressiv performance for During Adv Advanced anced Research h (CIF (CIFAR) AR) help helped ed tin toued keep netw works eresearch alive on some tasks ( LeCun et al. , 1998b ; Bengio et al. , 2001 ). The Canadian Institute via its Neural Computation and Adaptiv Adaptivee Perception (NCAP) research initiative. for Adv anced Researc h (CIF AR) helpedresearc to keep neural led netw orks research alive This program united machine learning research h groups by Geoffrey Hinton via its Neural Computation and Adaptiv e P erception (NCAP) research initiative. at Universit University y of Toron oronto, to, Yosh oshua ua Bengio at Univ Universit ersit ersity y of Montreal, and Yann This program learning researc groupsresearch led by Geoffrey LeCun at Newunited York machine Universit University y. The CIF CIFAR AR hNCAP initiativeHinton had a at Universit y of T oron to, Y osh ua Bengio at Univ ersit y of Montreal, and Yann multi-disciplinary nature that also included neuroscien neuroscientists tists and experts in human LeCun at New Y ork Universit y . The CIF AR NCAP research initiative had a and computer vision. multi-disciplinary nature that also included neuroscientists and experts in human At this poin ointt in time, deep netw networks orks were generally believ elieved ed to be very difficult and computer vision. to train. W Wee now know that algorithms that hav havee existed since the 1980s work Atwell, this but pointhis t in w time, deep networks were generally believ to be vsimply ery difficult quite as not apparent circa 2006. The issue isedperhaps that to train. W e now know that algorithms that hav e existed since the 1980s work these algorithms were to too o computationally costly to allo allow w muc much h exp experimentation erimentation quite well, but this w as not apparent circa 2006. The issue is p erhaps simply that with the hardware av available ailable at the time. these algorithms were too computationally costly to allow much experimentation The third wa wav ve of neural netw networks orks research began with a breakthrough in with the hardware available at the time. The third wave of neural networks18research began with a breakthrough in


2006. Geoffrey Hinton show showed ed that a kind of neural netw network ork called a deep belief net netw work could be efficien efficiently tly trained using a strategy called greedy la lay yer-wise 2006. Geoffrey Hinton show ed that a kind of neural netw ork called a deep belief pretraining (Hin Hinton ton et al. al.,, 2006), which will be describ described ed in more detail in Sec. net w ork could be efficien tly trained using a strategy called greedy la y er-wise 15.1 15.1.. The other CIF CIFAR-affiliated AR-affiliated research groups quickly show showed ed that the same pretraining ( Hin ton et al. , 2006 ), which will b e describ ed in more detail inetSec. strategy could be used to train man many y other kinds of deep net netw works (Bengio al., 15.1.; The otheretCIF research groups quickly showede that the same 2007 Ranzato al. al.,,AR-affiliated 2007a) and systematically help helped ed to improv improve generalization strategy could be This used wa to vtrain many netw otherorks kinds of deep networks (the Bengio et the al., on test examples. wav e of neural networks researc research h popularized use of 2007 Ranzato et al.to , 2007a ) and systematically help ed to improv term; de deep ep le learning arning emphasize that researchers were no now w able etogeneralization train deeper on test examples. This wa v e of neural netw orks researc h p opularized the useon of the the neural net netw works than had been possible before, and to fo focus cus attention term deep leimp arning to emphasize that researchers were no;wDelalleau able to train deeper, theoretical importance ortance of depth (Bengio and LeCun , 2007 and Bengio neural works had b; een possible efore, and to fo custime, attention the 2011 ; Pnet ascan ascanu u et than al. al.,, 2014a Montufar et bal. al., , 2014 ). At this deep on neural theoretical imperformed ortance ofcompeting depth (Bengio and LeCun , 2007 ; Delalleau andlearning Bengio, net netw works outp outperformed AI systems based on other machine 2011 ; P ascan u et al. , 2014a ; Montufar et al. , 2014 ). At this time, deep neural tec technologies hnologies as well as hand-designed functionalit functionality y. This third wave of popularity netneural works netw outporks erformed competing onthough other machine of networks con contin tin tinues ues to the AI timesystems of this based writing, the fo focus cuslearning of deep tec hnologies as well as hand-designed functionalit y . This third w a v e of popularity learning research has changed dramatically within the time of this wave. The of neural orkswith contin to on thenew timeunsup of this writing, though the focus and of deep third wavnetw e began a ues fo focus cus unsupervised ervised learning techniques the learning research has changed dramatically within the time of this w a v e. The abilit ability y of deep mo models dels to generalize well from small datasets, but to today day there is third wave began a fo cus on new unsup ervised learning techniques the more interest in mwith uc older sup learning algorithms and the abilit of deep uch h supervised ervised ability y and abilit y of models generalize well from small datasets, but today there is mo models dels to deep leverage largetolab labeled eled datasets. more interest in much older supervised learning algorithms and the ability of deep models to leverage large labeled datasets.

1.2.2

Increasing Dataset Sizes

One wonder whyDataset deep learning 1.2.2mayIncreasing Sizeshas only recently become recognized as a crucial tec technology hnology though the first exp experiments eriments with artificial neural net netw works were One may wonder why deep become as a conducted in the 1950s. Deeplearning learninghas hasonly beenrecently successfully usedrecognized in commercial crucial technology though thebut firstwexp eriments with artificial netan works were applications since the 1990s, as often regarded as beingneural more of art than inand the something 1950s. Deep learning bert eencould successfully intly commercial aconducted technology that only anhas exp expert use, untilused recen recently tly. . It is true applications since the 1990s, but w as often regarded as being more of an art than that some skill is required to get go goood performance from a deep learning algorithm. a ortunately technology, the and amoun something only an exp ert could use,amount until recen tly. It is data true F ortunately, amount t of that skill required reduces as the of training that some The skill learning is required to get gooreac d phing erformance a deep learning algorithm. increases. algorithms reaching human from performance on complex tasks F ortunately , the amoun t of skill required reduces as the amount of training to toda da day y are nearly iden identical tical to the learning algorithms that struggled to solvedata toy increases. The learning algorithms reac hing human p erformance on complex tasks problems in the 1980s, though the mo models dels we train with these algorithms hav havee to day are nearly tical to thethe learning algorithms thatarchitectures. struggled to The solvemost toy undergone changesiden that simplify training of very deep problems thedevelopmen 1980s, though the mo delswe wecan train with these imp importan ortan ortanttinnew development t is that to today day provide these algorithms algorithms hav withe undergone changes the training very The most the resources they that needsimplify to succeed. Fig. 1.8ofsho shows wsdeep howarchitectures. the size of benchmark imp ortan t new developmen t is that to day we can provide these algorithms with datasets has increased remark remarkably ably ov over er time. This trend is driven by the increasing the resources need succeed. Fig.of1.8 ws howtake the place size of enchmark digitization of they so societ ciet ciety y. As to more and more oursho activities onbcomputers, datasets increased remark ablyisov er time. This trend is driven by increasing more andhas more of what we do recorded. As our computers arethe increasingly digitization of societyit. As more and more of our activities take place computers, net netw work orked ed together, becomes easier to centralize these records andoncurate them more and more of what we do is recorded. As our computers are increasingly networked together, it becomes easier to19centralize these records and curate them


in into to a dataset appropriate for mac machine hine learning applications. The age of “Big Data” has made mac machine hine learning muc much h easier because the key burden of statistical in to a dataset appropriate for mac hine learning applications. age amoun of “Bigt estimation—generalizing well to new data after observing only The a small amount Data” has made macconsiderably hine learning lightened. much easierAs because the key burden of of statistical of data—has been of 2016, a rough rule th thum um umb b estimation—generalizing w ell to new data after observing only a small amoun is that a sup supervised ervised deep learning algorithm will generally achiev achievee acceptablet oferformance data—haswith beenaround considerably lightened. As ofper 2016, a rough thumor b p 5,000 lab labeled eled examples category category, , andrule will of match is that human a supervised deep learning algorithm generally achiev acceptable exceed performance when trained withwill a dataset con containing taininge at least 10 p erformance with around 5,000 lab eled examples per category , and will match million lab labeled eled examples. Working successfully with datasets smaller than this or is exceed human performance when trained with a dataset con taining at least 10 an imp importan ortan ortantt research area, fo focusing cusing in particular on ho how w we can take adv advantage antage million eled examples. Weled orking successfully datasets thanervised this is of large lab quantities of unlab unlabeled examples, with with unsup unsupervised ervised smaller or semi-sup semi-supervised an important research area, focusing in particular on how we can take advantage learning. of large quantities of unlabeled examples, with unsupervised or semi-supervised learning.

1.2.3

Increasing Mo Model del Sizes

1.2.3 kIncreasing Mo del Sizes Another ey reason that neural net netw works are wildly successful to toda da day y after enjoying comparativ comparatively ely little success since the 1980s is that we hav havee the computational Another ey run reason that neural net wildly successful today after enjoying resourceskto muc uch h larger mo models delsworks to toda da day yare . One of the main insights of connectioncomparativ ely littlebecome successin since the 1980sman is that we hav e the computational ism is that animals intelligen telligen telligent t when many y of their neurons work together. resources to run m uc h larger mo dels to da y . One of the main insights of connectionAn individual neuron or small collection of neurons is not particularly useful. ism is that animals become intelligent when many of their neurons work together. Biological neurons are not esp especially ecially densely connected. As seen in Fig. 1.10, An individual neuron or small collection of neurons is not particularly useful. our mac machine hine learning mo models dels hav havee had a num number ber of connections per neuron that Biological neurons are not esp ecially densely connected. in Fig. 1.10, was within an order of magnitude of even mammalian brains As for seen decades. our machine learning models have had a number of connections per neuron that In terms of the total num umber ber of neurons, neural netw networks orks hav havee been astonishingly was within an order of magnitude of even mammalian brains for decades. small until quite recently recently,, as shown in Fig. 1.11. Since the introduction of hidden In terms of the totalnet num ber of neurons, neural netw orks hav e been units, artificial neural netw works ha hav ve doubled in size roughly every 2.4astonishingly years. This small until quite recently , as shown in Fig. 1.11 . Since the introduction hidden gro growth wth is driv driven en by faster computers with larger memory and by the avof ailability units, artificial neural netwnet orksworks have are doubled in achiev size roughly every 2.4 years. This of larger datasets. Larger netw able to achieve e higher accuracy on more gro wth is driv en b y faster computers with larger memory and b y the a v ailability complex tasks. This trend looks set to contin continue ue for decades. Unless new tec technologies hnologies of larger datasets. netw orks are able achiev higher onber more allo allow w faster scaling,Larger artificial neural netw networks orkstowill note hav have e theaccuracy same num number of complex tasks. This trend looks set to contin ue for decades. Unless new tec hnologies neurons as the human brain until at least the 2050s. Biological neurons ma may y allo w faster scaling, artificial neural netw orks will not hav e the same num ber of represen representt more complicated functions than curren currentt artificial neurons, so biological neurons as the human brain until at least Biological neurons may neural net netw works may be even larger than thisthe plot2050s. portrays. represent more complicated functions than current artificial neurons, so biological In retrosp retrospect, ect, it is not particularly surprising that neural net netw works with fewer neural networks may be even larger than this plot portrays. neurons than a leec leech h were unable to solv solvee sophisticated artificial in intelligence telligence probIn retrosp ect, it is not particularly surprising that neural net w orks with fewer lems. Ev Even en to today’s day’s netw networks, orks, whic which h we consider quite large from a computational neurons than leec h were to solv artificialofineven telligence probsystems pointa of view, areunable smaller thane sophisticated the nervous system relatively lems. Eveen today’s netw orks,like whicfrogs. h we consider quite large from a computational primitiv primitive vertebrate animals systems point of view, are smaller than the nervous system of even relatively The increase in mo model del size over time, due to the availabilit ailability y of faster CPUs, primitive vertebrate animals like frogs. The increase in model size over time, 20 due to the availability of faster CPUs,


Increasing dataset size over time

Dataset size (number examples)

9

10

Canadian Hansard

8

10

Increasing dataset size overWMT time Sports-1M

7

10

ImageNet10k

6

10

5

10

Criminals

Public SVHN ImageNet

4

10

MNIST

3

10

102

Iris

1

10

T vs G vs F

ILSVRC 2014 CIFAR-10

Rotated T vs C

0

10

1900

1950

1985 2000 2015

Year

Figure 1.8: Dataset sizes ha hav ve increased greatly ov over er time. In the early 1900s, statisticians studied datasets using hundreds or thousands of manually compiled measuremen measurements ts (Garson, Figure 1.8: Dataset sizes hav,e1935 increased greatly er the time. In the early 1980s, 1900s, the statisticians 1900 ; Gosset , 1908; Anderson ; Fisher , 1936).ovIn 1950s through pioneers studied datasets using hundreds or thousands of manually compiled measuremen ts (Garson of biologically inspired mac machine hine learning often work orked ed with small, syn synthetic thetic datasets, such, 1900 ;w-resolution Gosset, 1908bitmaps ; Anderson , 1935; that Fisher , 1936 ). In the 1980s, the pioneers as lo low-resolution of letters, were designed to1950s incur through lo low w computational cost and of biologicallythat inspired mac hineorks learning oftentowork ed sp with small, syn datasets, such demonstrate neural netw networks were able learn specific ecific kinds ofthetic functions (Widrow as lo w-resolution bitmaps of letters, that were designed to incur lo w computational cost and and Hoff, 1960; Rumelhart et al. al.,, 1986b). In the 1980s and 1990s, machine learning demonstrate neural in netw orks and werebable sp ecific kinds of functions (Widrow b ecame morethat statistical nature egantotolearn lev leverage erage larger datasets con containing taining tens et al. andthousands Hoff, 1960 Rumelhartsuch 1986b ). In the 1980s and learning of of; examples as, the MNIST dataset (sho (shown wn1990s, in Fig.machine 1.9) of scans of b ecame more statistical in nature and b egan to lev erage larger datasets con taining tens handwritten num numbers bers (LeCun et al., 1998b). In the first decade of the 2000s, more of thousands datasets of examples such as the (showndataset in Fig. (1.9 ) of scansand of sophisticated of this same size,MNIST such as dataset the CIF CIFAR-10 AR-10 Krizhevsky et duced. al., 1998b handwritten bers ).ardInthe the first decade of the more Hin Hinton ton, 2009)num contin continued ued(LeCun to b e pro produced. Tow oward end of that decade and2000s, throughout sophisticated of this same size,larger such datasets, as the CIF AR-10 dataset (Krizhevsky and the first half ofdatasets the 2010s, significantly containing hundreds of thousands Hintens ton,of2009 ) contin to b e pro duced. Tchanged oward the endwofasthat decade and throughout to millions of ued examples, completely what p ossible with deep learning. the first half of included the 2010s,the significantly larger datasets, hundreds(Netzer of thousands These datasets public Street View Housecontaining Numbers dataset et al., to tens of millions of examples, completely changed what w as p ossible with deep learning. 2011 2011), ), various versions of the ImageNet dataset (Deng et al., 2009, 2010a; Russako Russakovsky vsky et the al., These included the public dataset Street View House et Numbers dataset (Netzer et al., datasets 2014a), and the Sp Sports-1M orts-1M (Karpathy al., 2014 ). At the top of et al. 2011), vwe arious versions of theofImageNet (Deng 2009, dataset 2010a; Russako vsky graph, see that datasets translateddataset sentences, such as ,IBM’s constructed et al. et al. , 2014a ), and the Sp orts-1M dataset ( Karpathy , 2014 ). A t the top of the from the Canadian Hansard (Brown et al., 1990) and the WMT 2014 English to Frenc rench h graph, we see that datasets of translated sentences, such as IBM’s dataset constructed dataset (Sch Schwenk wenk, 2014) are typically far ahead of other dataset sizes. from the Canadian Hansard (Brown et al., 1990) and the WMT 2014 English to French dataset (Schwenk, 2014) are typically far ahead of other dataset sizes.

21


Figure 1.9: Example inputs from the MNIST dataset. The “NIST” stands for National Institute of Standards and Technology echnology,, the agency that originally collected this data. Figure Example inputs from thethe MNIST dataset. “NIST” for National The “M”1.9: stands for “mo “modified,” dified,” since data has b een The prepro preprocessed cessedstands for easier use with Institute of Standards and T echnology , the agency that originally collected this data. mac machine hine learning algorithms. The MNIST dataset consists of scans of handwritten digits The standslab forels “mo dified,” since the data0-9 has eentained prepro for easier usesimple with and “M” asso associated ciated labels describing whic which h digit is bcon contained incessed each image. This machine learning algorithms. consists of scans of handwritten digits classification problem is one ofThe theMNIST simplestdataset and most widely used tests in deep learning and asso lab els p describing whichbdigit is con tained indern each tec image. Thistosimple researc research. h.ciated It remains opular despite eing 0-9 quite easy for mo modern techniques hniques solve. classification problem is one ed of the and most used tests in meaning deep learning Geoffrey Hin Hinton ton has describ described it assimplest “the dr drosophila osophila of widely machine learning,” that researc h. machine It remains p opular despite b quite easy for mo dern techniques to solve. it allows learning researchers toeing study their algorithms in controlled lab laboratory oratory osophila Geoffrey Hin tonh has describ edoften it as study “the drfruit conditions, muc much as biologists flies. of machine learning,” meaning that it allows machine learning researchers to study their algorithms in controlled lab oratory conditions, much as biologists often study fruit flies.

22


the adven adventt of general purp purpose ose GPUs (describ (described ed in Sec. 12.1.2), faster net netw work connectivit connectivity y and better softw software are infrastructure for distributed computing, is one of the most advenimp t ofortant general purpin osethe GPUs (describ ed learning. in Sec. 12.1.2 ), faster network the important trends history of deep This trend is generally connectivit and better softw arethe infrastructure for distributed computing, is one of exp expected ected toy contin continue ue well in into to future. the most important trends in the history of deep learning. This trend is generally expected to continue well into the future.

1.2.4

Increasing Accuracy Accuracy,, Complexit Complexity y and Real-W Real-World orld Impact

1.2.4 theIncreasing y anded Real-W orld Impact Since 1980s, deep Accuracy learning has, Complexit consistently improv improved in its ability to provide accurate recognition or prediction. Moreov Moreover, er, deep learning has consisten consistently tly been Since the 1980s, deep learning has consistently improv ed in its ability to provide applied with success to broader and broader sets of applications. accurate recognition or prediction. Moreover, deep learning has consistently been The earliest deep mo models dels were used to recognize individual ob objects jects in tightly applied with success to broader and broader sets of applications. cropp extremely small images ( Rumelhart et al. , 1986a ). Since then there has cropped, ed, The earliest deep mo dels w ere used to recognize individual ob jects tightly been a gradual increase in the size of images neural net netw works could pro process. cess.inMo Modern dern cropp ed, extremely small images ( Rumelhart et al. , 1986a ). Since then there has ob object ject recognition netw networks orks pro process cess ric rich h high-resolution photographs and do not been increase in the the size of images neuralnear netwthe orksob could probcess. Modern ha hav ve aa gradual requirement that photo be cropped object ject to e recognized ob ject recognition netw orks pro cess ric h high-resolution photographs and do not (Krizhevsky et al., 2012). Similarly Similarly,, the earliest netw networks orks could only recognize a requirement photo be the cropped near ob jectoftoa bsingle e recognized tha wovekinds of ob objects jects that (or inthe some cases, absence orthe presence kind of (ob Krizhevsky et al. , 2012 ). Similarly , the earliest netw orks could only recognize object), ject), while these mo modern dern net netw works typically recognize at least 1,000 different tcategories wo kinds of ob jects (or in some cases, the in absence presence ofisa the single kind of of ob objects. jects. The largest contest ob object jectorrecognition ImageNet ob ject), while theseRecognition modern netChallenge works typically recognize least 1,000 different Large-Scale Visual (ILSVRC) held at each year. A dramatic categories jects. The in came objectwhen recognition is the ImageNet momen momentt in of theobmeteoric riselargest of deepcontest learning a con conv volutional netw network ork Large-Scale Visual Recognition Challenge (ILSVRC) held each y ear. A dramatic won this challenge for the first time and by a wide margin, bringing down the moment in the meteoric riserate of deep came when a convolutional ork state-of-the-art top-5 error fromlearning 26.1% to 15.3% (Krizhevsky et al.netw , 2012 ), w on this challenge for the first time and by a wide margin, bringing down the meaning that the conv convolutional olutional net netw work pro produces duces a ranked list of possible categories state-of-the-art top-5 ratecategory from 26.1% to 15.3% (Krizhevsky et al., of 2012 ), for eac each h image and theerror correct appeared in the first fiv fivee entries this meaning that convolutional netw ork produces a ranked of pcomp ossible categories list for all butthe15.3% of the test examples. Since then, list these competitions etitions are for eac h image and the correct category appeared in the first fiv e entries of this consisten consistently tly won by deep conv convolutional olutional nets, and as of this writing, adv advances ances in list for all but 15.3% of the test examples. Since then, these comp etitions are deep learning ha hav ve brought the latest top-5 error rate in this contest do down wn to 3.6%, consisten won 1.12 by deep convolutional nets, and as of this writing, advances in as sho shown wn tly in Fig. . deep learning have brought the latest top-5 error rate in this contest down to 3.6%, Deep has. also had a dramatic impact on sp speech eech recognition. After as sho wn learning in Fig. 1.12 impro improving ving throughout the 1990s, the error rates for sp speech eech recognition stagnated Deep learning has also had a dramatic impact on sp(eech starting in ab about out 2000. The introduction of deep learning Dahlrecognition. et al., 2010; After Deng impro the 1990s, the error for) to speech recognition stagnated et al., ving 2010bthroughout ; Seide et al. , 2011 ; Hinton et al.rates , 2012a sp speech eech recognition resulted starting in ab out 2000. The introduction deep rates learning Dahl al.e, will 2010explore ; Deng in a sudden drop of error rates, with someoferror cut (in half.et W et al.history , 2010b;inSeide al., 2011 ; Hinton this moreetdetail in Sec. 12.3.et al., 2012a) to speech recognition resulted in a sudden drop of error rates, with some error rates cut in half. We will explore Deep net netw works ha hav ve also had sp spectacular ectacular successes for pedestrian detection and this history in more detail in Sec. 12.3. image segmentation (Sermanet et al., 2013; Farab arabet et et al. al.,, 2013; Couprie et al. al.,, Deep net w orks ha v e also had sp ectacular successes for p edestrian detection and 2013 2013)) and yielded sup superh erh erhuman uman performance in traffic sign classification (Ciresan image segmentation (Sermanet et al., 2013; Farabet et al., 2013; Couprie et al., 2013) and yielded superhuman performance in traffic sign classification (Ciresan 23


4

Connections per neuron

10

Number of connections per neuron over time Number of connections per neuron6 over 9time 7 4

3

10

10

5 2

3

1

Cat Mouse

2

10

Human

8

Fruit fly

1

10

1950

1985

2000

2015

Year

Figure 1.10: Initially Initially,, the number of connections b et etw ween neurons in artificial neural net netw works was limited by hardware capabilities. To day day,, the num number ber of connections b et etween ween Figure 1.10: Initially , the consideration. number of connections b etween neurons in artificial neural neurons is mostly a design Some artificial neural netw networks orks hav havee nearly as net w orks was limited b y hardware capabilities. T o day , the num ber of connections b et ween man many y connections p er neuron as a cat, and it is quite common for other neural net netw works neurons is many mostlyconnections a design consideration. artificial neural orks havethe nearly as to hav havee as p er neuron asSome smaller mammals likenetw mice. Even human many do connections as at cat, andt of it is quite common for otherBiological neural netneural works brain does es not ha hav vpeeranneuron exorbitan exorbitant amoun amount connections p er neuron. to hav e as many connections neuron as smaller mammals like mice. Even the human net netw work sizes from Wikip Wikipedia ediap(er 2015 ). brain do es not have an exorbitant amount of connections p er neuron. Biological neural Adaptive and Hoff netw1.ork sizes linear fromelement Wikip(Widrow edia (2015 ). , 1960) 2. Neocognitron (Fukushima, 1980) 3. GPU-accelerated convolutional network (Chellapilla et al., 2006) 4. Deep Boltzmann machine (Salakhutdinov and Hinton, 2009a) 5. Unsupervised convolutional network (Jarrett et al., 2009) 6. GPU-accelerated multilayer perceptron (Ciresan et al., 2010) 7. Distributed autoencoder (Le et al., 2012) 8. Multi-GPU convolutional network (Krizhevsky et al., 2012) 9. COTS HPC unsupervised convolutional network (Coates et al., 2013) 10. GoogLeNet (Szegedy et al., 2014a)

24


et al. al.,, 2012). At the same time that the scale and accuracy of deep netw networks orks has increased, et al., 2012). so has the complexity of the tasks that they can solve. Go Goo odfellow et al. (2014d) A t the same time that the scale and accuracy of deep netw orks has increased, sho show wed that neural netw networks orks could learn to output an entire sequence of characters so has the of therather tasksthan thatjust theyidentifying can solve.aGo odfellow et al. (2014d), transcrib transcribed edcomplexity from an image, single ob object. ject. Previously Previously, showwasedwidely that neural netw orksthis could learn to output an entire sequence of characters it believed that kind of learning required lab labeling eling of the individual transcrib ed from an image, rather than just identifying a single ob ject. Previously elemen elements ts of the sequence (Gülçehre and Bengio, 2013). Recurren Recurrentt neural net netw works,, it whasaswidely believedsequence that thismo kind learning required the individual suc such the LSTM model delofmentioned ab abov ov ove, e, lab areeling nowofused to mo model del elemen ts of the sequence ( Gülçehre and Bengio , 2013 ). Recurren t neural net w orks, relationships bet etw ween se sequenc quenc quences es and other se sequenc quenc quences es rather than just fixed inputs. such sequence-to-sequence as the LSTM sequence modelseems mentioned e, are used to model This learning to be ab onovthe cuspnow of rev revolutionizing olutionizing relationships betweenmachine sequences and other(Sutskev sequences rather than; just fixed inputs. another application: translation Sutskever er et al. al.,, 2014 Bahdanau et al. al.,, This sequence-to-sequence learning seems to b e on the cusp of rev olutionizing 2015 2015). ). another application: machine translation (Sutskever et al., 2014; Bahdanau et al., This trend of increasing complexit complexity y has been pushed to its logical conclusion 2015). with the introduction of neural Turing machines (Grav Graves es et al. al.,, 2014a) that learn This trend of increasing complexit y has b een pushed to logicalcells. conclusion to read from memory cells and write arbitrary con conten ten tentt to its memory Suc Such h with the introduction of neural T uring machines ( Grav es et al. , 2014a ) that learn neural net netw works can learn simple programs from examples of desired behavior. For to read from andlists write conten t to memory cells. Suc h example, they memory can learncells to sort of arbitrary num umbers bers given examples of scrambled and neural net works canThis learn simple programs technology from examples desired behavior. For sorted sequences. self-programming is inofits infancy infancy, , but in the example, theyincan learn to lists to of nearly numbers future could principle besort applied an any ygiven task.examples of scrambled and sorted sequences. This self-programming technology is in its infancy, but in the Another crowning achiev achievement ement of deep learning is its extension to the domain future could in principle be applied to nearly any task. of reinfor einforccement le learning arning arning.. In the context of reinforcement learning, an autonomous Another crowning achievement is its extension to the domain agen agent t must learn to perform a task of bydeep triallearning and error, without an any y guidance from of r einfor c ement le arning . In the context of reinforcement learning, an autonomous the human op operator. erator. DeepMind demonstrated that a reinforcement learning system agen t must learn to perform a taskofby trial and without anygames, guidance from based on deep learning is capable learning to error, play Atari video reaching the human op erator. DeepMind demonstrated that a reinforcement learning system human-lev uman-level el performance on many tasks (Mnih et al., 2015). Deep learning has basedsignificantly on deep learning isedcapable of learningoftoreinforcement play Atari video games, also improv improved the performance learning for reaching rob robotics otics h uman-lev el p erformance on many tasks ( Mnih et al. , 2015 ). Deep learning has (Finn et al., 2015). also significantly improved the performance of reinforcement learning for robotics Man Many of, these (Finn etyal. 2015).applications of deep learning are highly profitable. Deep learning is now used b by y many top technology companies including Go Google, ogle, Microsoft, Man y ofIBM, theseBaidu, applications deep learning highly and profitable. Faceb acebo ook, Apple,ofAdobe, Netflix,are NVIDIA NEC. Deep learning is now used by many top technology companies including Google, Microsoft, Adv dvances ances in deep learning have e also Netflix, dep depended ended hea heavily vilyand on adv advances ances in softw software are Faceb ook, IBM, Baidu, Apple,hav Adobe, NVIDIA NEC. infrastructure. Softw Software are libraries such as Theano (Bergstra et al., 2010; Bastien A dv ances in deep learning have also dep ended), hea vily(on advert ances in ,softw are et al., 2012), PyLearn2 (Go Goo odfellow et al. , 2013c Torch Collob Collobert et al. 2011b ), infrastructure. are libraries such as, 2013 Theano (Bergstra et al. 2010 ; Bastien DistBelief (DeanSoftw et al. , 2012 ), Caffe (Jia ), MXNet (Chen et ,al. , 2015 ), and etensorFlow al., 2012),(Abadi PyLearn2 odfellow , 2013c), Timp orch (Collob ert et al., jects 2011bor), T et al.(Go , 2015 ) hav haveeetallal.supported importan ortan ortant t researc research h pro projects DistBelief ( Dean et al. , 2012 ), Caffe ( Jia , 2013 ), MXNet ( Chen et al. , 2015 ), and commercial pro products. ducts. TensorFlow (Abadi et al., 2015) have all supported important research pro jects or Deep learning has also made contributions back to other sciences. Mo Modern dern commercial products. con conv volutional netw networks orks for ob object ject recognition provide a mo model del of visual pro processing cessing Deep learning has also made contributions back to other sciences. Modern 25 convolutional networks for ob ject recognition provide a model of visual processing


that neuroscientists can study (DiCarlo, 2013). Deep learning also pro provides vides useful to tools ols for pro processing cessing massiv massivee amounts of data and making useful predictions in that neuroscientists can study (DiCarlo,used 2013to ). predict Deep learning also prowill videsinteract useful scien scientific tific fields. It has been successfully how molecules tools for to pro cessing massive amounts of data andnew making in order help pharmaceutical companies design drugsuseful (Dahlpredictions et al., 2014in ), scien tific fields. It has b een successfully used to predict how molecules will interact to searc search h for subatomic particles (Baldi et al., 2014), and to automatically parse in order toe help pharmaceutical companies new (Dahl et (al. , 2014 ), microscop microscope images used to construct a 3-Ddesign map of thedrugs human brain Kno Knowleswlesto searceth al. for, 2014 subatomic particles (Baldi et al.to, 2014 ), and to automatically parse Barley al., ). We exp expect ect deep learning app appear ear in more and more scientific microscop e images fields in the future. used to construct a 3-D map of the human brain (KnowlesBarley et al., 2014). We expect deep learning to appear in more and more scientific In summary summary,, deep learning is an approac approach h to machine learning that has dra drawn wn fields in the future. hea heavily vily on our knowledge of the human brain, statistics and applied math as it In summary deep learning an approac to machine thattremendous has drawn dev develop elop eloped ed over ,the past sev several eralis decades. In hrecen recent t years, learning it has seen hea vily in onits ourpopularit knowledge of usefulness, the human due brain, statistics applied math as it gro growth wth opularity y and in large partand to more pow owerful erful comdev elop ed o v er the past sev eral decades. In recen t years, it has seen tremendous puters, larger datasets and techniques to train deep deeper er netw networks. orks. The years ahead growth in cits popularit y and usefulness,todue in large part to more pow erful comare full of hallenges and opp opportunities ortunities improv improve e deep learning even further and puters, larger datasets and techniques to train deep er netw orks. The years ahead bring it to new frontiers. are full of challenges and opportunities to improve deep learning even further and bring it to new frontiers.

26

Number of neurons (logarithmic scale)


1011 1010 109 108 107 106 105 104 103 102 101 100 10−1 10−2

Increasing neural netw network ork size ov over er time Human

Increasing neural over time 17 network size20 19

16

8

Octopus

18

14 11

Frog Bee

3

Ant Leech

13 1

2 4

1950

12

6

1985

2000

5

9 7

Roundworm

15

10

2015

2056

Sponge

Year

Figure 1.11: Since the introduction of hidden units, artificial neural netw networks orks hav havee doubled in size roughly every 2.4 years. Biological neural netw network ork sizes from Wikip Wikipedia edia (2015). Figure 1.11: Since the introduction of hidden units, artificial neural networks have doubled 1. Perceptron Rosenblatt , 1962 ) in size roughly (every 2.4, 1958 years. Biological neural network sizes from Wikip edia (2015). 2. Adaptive linear element (Widrow and Hoff, 1960) 3. Neocognitron (Fukushima, 1980) 4. Early back-propagation network (Rumelhart et al., 1986b) 5. Recurrent neural network for speech recognition (Robinson and Fallside, 1991) 6. Multilayer perceptron for speech recognition (Bengio et al., 1991) 7. Mean field sigmoid belief network (Saul et al., 1996) 8. LeNet-5 (LeCun et al., 1998b) 9. Echo state network (Jaeger and Haas, 2004) 10. Deep belief network (Hinton et al., 2006) 11. GPU-accelerated convolutional network (Chellapilla et al., 2006) 12. Deep Boltzmann machine (Salakhutdinov and Hinton, 2009a) 13. GPU-accelerated deep belief network (Raina et al., 2009) 14. Unsupervised convolutional network (Jarrett et al., 2009) 15. GPU-accelerated multilayer perceptron (Ciresan et al., 2010) 16. OMP-1 network (Coates and Ng, 2011) 17. Distributed autoencoder (Le et al., 2012) 18. Multi-GPU convolutional network (Krizhevsky et al., 2012) 19. COTS HPC unsupervised convolutional network (Coates et al., 2013) 20. GoogLeNet (Szegedy et al., 2014a)

27

ILSVRC classification error rate


0.30 0.25

Decreasing error rate ov over er time Decreasing error rate over time

0.20 0.15 0.10 0.05 0.00 2010

2011

2012

2013

2014

2015

Year

Figure 1.12: Since deep net netw works reached the scale necessary to comp compete ete in the ImageNet Large Scale Visual Recognition Challenge, they hav havee consistently won the comp competition etition Figure 1.12: and Since deep net wer orks the scale to comp in the ImageNet ev every ery year, yielded low lower andreached low lower er error ratesnecessary each time. Dataete from Russak Russakovsky ovsky Large Recognition they have consistently won the comp etition et al. (Scale 2014bVisual ) and He et al. (2015Challenge, ). every year, and yielded lower and lower error rates each time. Data from Russakovsky et al. (2014b) and He et al. (2015).

28

Part I Part I

Applied Math and Mac Machine hine Learning Basics Applied Math and Machine Learning Basics

29 29

This part of the book in intro tro troduces duces the basic mathematical concepts needed to understand deep learning. We begin with general ideas from applied math that This part of the book inof tromany duces vthe basic find mathematical needed to allo allow w us to define functions ariables, the highestconcepts and low lowest est points understand deep learning. We bdegrees egin with on these functions and quantify of bgeneral elief. ideas from applied math that allow us to define functions of many variables, find the highest and lowest points Next, we describ describee the fundamen fundamental tal goals of machine learning. We describe how on these functions and quantify degrees of belief. to accomplish these goals by sp specifying ecifying a mo model del that represen represents ts certain beliefs, Next, w e describ e the fundamen tal goals of machine learning. We describe how designing a cost function that measures how well those beliefs corresp correspond ond with to accomplish these goals balgorithm y specifying mo del that ts certain beliefs, realit reality y and using a training to aminimize that represen cost function. designing a cost function that measures how well those beliefs correspond with This elementary framew framework ork is the basis for a broad of mac machine hine learning realit y and using a training algorithm to minimize thatvariety cost function. algorithms, including approac approaches hes to machine learning that are not deep. In the This elementary ork basis for a broad variety of machine learning subsequen subsequent t parts of framew the bo book, ok, is wethe develop deep learning algorithms within this algorithms, including approac hes to machine learning that are not deep. In the framew framework. ork. subsequent parts of the book, we develop deep learning algorithms within this framework.

30

Chapter 2 Chapter 2

Linear Algebra Linear Algebra Linear algebra is a branc branch h of mathematics that is widely used throughout science and engineering. Ho How wev ever, er, because linear algebra is a form of contin continuous uous rather Linear algebramathematics, is a branch of mathematics is widely than discrete man many y computer that scientists hav haveeused littlethroughout exp experience erience science with it. and engineering. Ho w ev er, b ecause linear algebra is a form of contin uous A go goo od understanding of linear algebra is essential for understanding and wrather orking than discrete mathematics, man y computer scientists hav e little exp erience with with man many y mac machine hine learning algorithms, esp especially ecially deep learning algorithms. Wit. e A go o d understanding of linear algebra is essential for understanding and w orking therefore precede our in intro tro troduction duction to deep learning with a fo focused cused presentation of withkey man y mac hine learning algorithms, especially deep learning algorithms. We the linear algebra prerequisites. therefore precede our introduction to deep learning with a focused presentation of If you are already familiar with linear algebra, feel free to skip this chapter. If the key linear algebra prerequisites. you hav havee previous exp experience erience with these concepts but need a detailed reference If y are already familiar we with linear algebra, feel freeCo tookb skip chapter. If sheet tooureview key formulas, recommend The Matrix (Petersen and Cookb okbo ookthis you have, previous erience these needalgebra, a detailed P edersen 2006). Ifexp you ha hav ve with no exp exposure osureconcepts at all tobut linear thisreference chapter sheet to review key formulas, we recommend The Matrix Co okb o ok ( Petersen and will teac teach h you enough to read this bo ok, but we highly recommend that you also P edersen , 2006 ). If y ou ha v e no exp osure at all to linear algebra, this chapter consult another resource fo focused cused exclusiv exclusively ely on teaching linear algebra, such as will teac h y ou enough to read this b o ok, but we highly that you also Shilo Shilov v (1977). This chapter will completely omit man many y recommend imp importan ortan ortantt linear algebra consultthat another resource fo cused exclusively on teaching linear algebra, such as topics are not essential for understanding deep learning. Shilov (1977). This chapter will completely omit many imp ortant linear algebra topics that are not essential for understanding deep learning.

2.1

Scalars, Vectors, Matrices and Tensors

The of linear V algebra inv involv olv olves es several types of mathematical objects: jects: 2.1 study Scalars, ectors, Matrices and Tensors ob The study of linear algebra involves several types of mathematical ob jects: • Sc Scalars alars alars:: A scalar is just a single num umb b er, in contrast to most of the other ob objects jects studied in linear algebra, whic which h are usually arrays of multiple num numbers. bers. Sc alars : A scalar is just a single n um b er, in contrast to most of the other We write scalars in italics. We usually give scalars low lower-case er-case variable names. ob jectswe studied in linear algebra, whic h are usually arrays ofbm • When in intro tro troduce duce them, we sp specify ecify what kind of num numb erultiple they num are. bers. For We write scalars in italics. We usually give scalars lower-case variable names. When we introduce them, we specify 31 what kind of number they are. For 31

CHAPTER 2. LINEAR ALGEBRA

example, we migh mightt sa say y “Let s ∈ R be the slop slopee of the line,” while defining a real-v real-valued alued scalar, or “Let n ∈ NRbe the num numb ber of units,” while defining a s example, w e migh t sa y “Let b e the slop e of the line,” while defining a natural num umb ber scalar. N n real-valued scalar, or “Let ∈ be the number of units,” while defining a natural ber scalar. ectors ctors::num A vector is an array numb bers. The num numb b ers are arranged in • V ∈ of num order. We can iden identify tify eac each h individual num numb ber by its index in that ordering. V e ctors : A vector is an array of num b ers. The num arranged in Typically we give vectors low lower er case names written inb ers boldare typeface, suc such h order. W e can iden tify eac h individual num b er by its index in that ordering. • as x. The elements of the vector are iden identified tified by writing its name in italic ypically we give vectors low er first case element names written bold typeface, such tTyp ypeface, eface, with a subscript. The of x is xin 1 , the second element as xx. and The so elements the need vectortoare tified by of writing in italic is on. Weofalso sayiden what kind num umb bits ers name are stored in 2 x x tthe ypeface, with a subscript. The first element of is , the second element vector. If each element is in R, and the vector has n elemen elements, ts, then the x is and so on. W e also need to say what kind of n um b ers are in n times, vector lies in the set formed by taking the Cartesian pro product duct of Rstored R the vector. Ifneach element is into explicitly , and the iden vector elements, then the denoted as R . When we need identify tifyhas the nelements of R a vector, vector liesthem in the set formed by taking the Cartesian product of n times, w e write brackets: ets: R as a column enclosed in square brack denoted as . When we need to explicitly identify the elements of a vector,   we write them as a column enclosed inx1square brackets:  x2    x =  x.  . (2.1)  x..  x = x.n . (2.1) ..  x p oin  ts in space, with each element We can think of vectors as identifying oints giving the coordinate along a different  axis.  ts in space, with each element We can think of vectors as identifying  p oin   elements  Sometimes we need toalong indexa adifferent set of of a vector. In this case, we giving the coordinate axis.   define a set con containing taining the indices and write the set as a subscript. For Sometimes we need index a set of elements of a vector. In this case, we example, to access xto 1 , x3 and x6 , we define the set S = { 1, 3, 6} and write define a set containing the indices and write the set as a subscript. For x S . We use the − sign to index the complement of a set. For example x−1 is x ,all x elemen = x 1, 3, 6is the example, access and xts, we andvwrite x1 ,Sand the vectortocon containing taining elements of xdefine exceptthe forset ector −S . W e use the sign to index the complement of a set. F or example is x x { } con containing taining all of the elements of x except for x1, x 3 and x6 . the vector containing all elements of x except for x , and x is the vector − • Matric con taining of theiselements ofyxofexcept for so x ,eac x hand x .t is identified by Matrices es es:: Aall matrix a 2-D arra array num numb bers, each elemen element two indices instead of just one. We usually giv givee matrices upp upper-case er-case variable Matric es : A matrix is a 2-D arra y of num b ers, so eac h elemen t identified by names with bold typ ypeface, eface, suc such h as A . If a real-v real-valued alued matrix Ais has a heigh height t twomindices of njust one.we Wsay e usually giv∈e matrices upp er-case identify variable • of and a instead width of , then that A e usually Rm×n. W names withtsbof oldatmatrix ypeface,using such its as A . If ainreal-v matrix A has a height the elemen elements name italicalued Rbut not bold font, and the of m and widthwith of nseparating , then we say that AFor example, . WeAusually identify indices area listed commas. upper er 1,1 is the upp the elemen ts of a matrix using its name in italic but not b old font, and the ∈ left entry of A and Am,n is the bottom righ rightt entry entry.. We can identify all of A the indices are listed separating commas. For example, is horizon the upptal er i by writing the num numb bers withwith vertical co coordinate ordinate a “ :” for horizontal A A left entry ofFA is the bottom righ t entry We can identify of co coordinate. ordinate. or and example, horizontal tal. cross section of A all with i,: denotes the horizon the numco bers with vertical cokno ordinate by writing ” for the horizon v ertical coordinate ordinate known wn asi the . Likewise, is i. This is i-th rowaof“ :A A:,ital coordinate. For example, A denotes the horizontal cross section of A with vertical co ordinate i. This is known as the i-th row of A. Likewise, A is 32


2

A1,1 A = 4 A2,1 A3,1

3  A1,2 A1,1 A2,2 5 ) A> = A1,2 A3,2

A2,1 A2,2

A3,1 A3,2



Figure 2.1: The transp transpose ose of the matrix can be thought of as a mirror image across the main diagonal. Figure 2.1: The transpose of the matrix can be thought of as a mirror image across the main diagonal.

the i -th column of A . When we need to explicitly iden identify tify the elemen elements ts of a matrix, we write them as an array enclosed in square brac brack kets: the i -th column of A . When we need to explicitly identify the elements of a   matrix, we write them as an array A 1,enclosed 1 A 1,2 in square brackets: . (2.2) A 2,1 A 2,2 A A . (2.2) A alued expressions that are not just Sometimes we ma may y need to indexAmatrix-v matrix-valued a single letter. In this case, we use subscripts after the expression, but do Sometimes may need to index matrix-v alued expressions that are not just f (A )i,j giv j) not con conv vertwe anything to low lower er case. For example, gives es elemen element t (i,  after a single letter. In this case, we use subscripts the expression, but do of the matrix computed by applying the function f to A. not convert anything to lower case. For example, f (A ) gives element (i, j ) • T ofensors the matrix computed by applying function to A.than tw ensors: : In some cases w e will need the an arra array y withf more two o axes. In the general case, an array of num numb bers arranged on a regular grid with a Tariable ensors:num In some weknown will need arrayW with moreathan twnamed o axes. “A” In v numb ber ofcases axes is as aan tensor. e denote tensor the general case, anA.array of numbthe ers element arrangedofon a regular grid (with • with A at i, j, ka) this typeface: We identify co coordinates ordinates variable num ber. of axes is known as a tensor. We denote a tensor named “A” b y writing Ai,j,k A A with this typeface: . We identify the element of at co ordinates (i, j, k ) A by writing . One imp important ortant op operation eration on matrices is the tr transp ansp anspose ose ose.. The transp transpose ose of a matrix is the mirror image of the matrix across a diagonal line, called the main One ,imp ortantdown operation is the tr anspits oseupp . The transp ose of a diagonal diagonal, running and toonthematrices righ right, t, starting from upper er left corner. See matrix mirror image of theofmatrix across a diagonal line, the Fig. 2.1isforthe a graphical depiction this op operation. eration. We denote thecalled transp transpose osemain of a diagonal , running down and to the righ t, starting from its upp er left corner. See > matrix A as A , and it is defined such that Fig. 2.1 for a graphical depiction of this operation. We denote the transpose of a > matrix A as A , and it is defined(Asuch )i,jthat = Aj,i. (2.3) ) =A . contain only one column. (2.3) Vectors can be though thoughtt of as(A matrices that The transp transpose ose of a vector is therefore a matrix with only one row. Sometimes we Vectors can be thought of as matrices that contain only one column. The 33 transpose of a vector is therefore a matrix with only one row. Sometimes we


define a vector by writing out its elements in the text inline as a ro row w matrix, then using the transp transpose ose op operator erator to turn it in into to a standard column vector, e.g., define >by writing out its elements in the text inline as a row matrix, x = [[x x1a, xvector 2, x3 ] . then using the transpose operator to turn it into a standard column vector, e.g., a single entry entry.. From this, we x =A[xscalar , x , xcan ] .be thought of as a matrix with only > can see that a scalar is its own transp transpose: ose: a = a . A scalar can be thought of as a matrix with only a single entry. From this, we We can add matrices to each other, as long as they ha hav ve the same shap shape, e, just can see that a scalar is its own transpose: a = a . by adding their corresp corresponding onding elemen elements: ts: C = A + B where Ci,j = Ai,j + Bi,j . We can add matrices to each other, as long as they have the same shape, just We can also add a scalar to a matrix or multiply a matrix by a scalar, just by adding their corresp onding elements: C = A + B where C = A + B . by performing that op operation eration on eac each h element of a matrix: D = a · B + c where W e can also add a scalar to a matrix or multiply a matrix by a scalar, just Di,j = a · Bi,j + c. by performing that operation on each element of a matrix: D = a B + c where context text of deep learning, we also use some less conv conventional entional notation. D In= the a Bcon + c. · We allo allow w the addition of matrix and a vector, yielding another matrix: C = A + b, · In the text+ofb deep learning, we also use some less conventional notation. Ci,j con =A where i,j j . In other words, the vector b is added to each row of the C = Ain +to b, We allowThis the addition of eliminates matrix andthe a vector, matrix. shorthand need toyielding define aanother matrixmatrix: with b copied into Cw b= A doing + b .the b is added where In addition. other words, vectorcopying of the eac each h ro row efore Thisthe implicit of bto toeach man many yrow lo locations cations matrix. This shorthand eliminates the need to define a matrix with copied into b is called br bro oadc adcasting asting asting.. each row before doing the addition. This implicit copying of b to many locations is called broadcasting.

2.2

Multiplying Matrices and Vectors

One most imp important ortant op operations erations inv involving olving matrices is multiplication of two 2.2 of the Multiplying Matrices and Vectors matrices. The matrix pr pro oduct of matrices A and B is a third matrix C . In order Onethis of the mosttoimp opA erations inveolving matrices of has two for pro product duct beortant defined, must hav have the same num numb bis er multiplication of columns as B C . In matrices. duct of matrices and thirdCmatrix mo× n and × p. B is of A ro rows. ws. If AThe is ofmatrix shap shapee pr shap shape e nB×isp,athen is of shap shape e morder A B for this pro duct to b e defined, m ust hav e the same num b er of columns as has We can write the matrix pro product duct just by placing tw two o or more matrices together, m n n p m p. A B C ro ws. If is of shap e and is of shap e , then is of shap e e.g. We can write the matrix × product just placing two or more matrices together, × × C =byAB . (2.4) e.g. (2.4) The pro product duct op operation eration is definedCb= y AB . X The product operation is defined Ci,j = by A i,k B k,j. (2.5) k

C = A B . (2.5) Note that the standard pro product duct of tw twoo matrices is not just a matrix con containing taining the pro product duct of the individual elements. Suc Such h an op operation eration exists and is called the Note that the standard pro duct of tw o matrices is not just element-wise pr pro oduct or Hadamar Hadamard d pr pro oduct, and is denoted as aAmatrix  B . containing XSuch an operation exists and is called the the product of the individual elements. x y of the same dimensionalit The dot pr pro o duct b etw etween een t w o vectors y is the element-wise product or Hadamard productand , and is denoted asdimensionality A B. > matrix pro product duct x y . We can think of the matrix pro product duct C = AB as computing  y column dot dot product betwbeen wo vectors of the same y is the Ci,j The as the pro product duct et etw wteen ro row w i ofxAand and j of dimensionalit B. matrix product x y . We can think of the matrix pro duct C = AB as computing 34 A and column j of B . C as the dot pro duct between row i of


Matrix pro product duct op operations erations hav havee many useful prop properties erties that make mathematical analysis of matrices more con convenien venien venient. t. For example, matrix m multiplication ultiplication is Matrix pro duct op erations hav e many useful prop erties that make mathematical distributiv distributive: e: analysis of matrices more A con venien For example, is (B + C )t.= AB + AC . matrix multiplication (2.6) distributive: It is also asso associativ ciativ ciative: e: A(B + C ) = AB + AC . (2.6) A(B C ) = (AB )C . (2.7) It is also asso ciative: AB = B A do Matrix multiplication is not commutativ commutative (the does es(2.7) not A(B C ) = (eAB )Ccondition . alw alwa ays hold), unlik unlikee scalar multiplication. Ho Howev wev wever, er, the dot pro product duct betw etween een tw two o AB = B A Matrix multiplication is not commutativ e (the condition do es not vectors is comm commutativ utativ utative: e: > wever, the dot pro duct b etween two always hold), unlike scalar multiplication. x>y = yHo x. (2.8) vectors is commutative: x y has = y a xsimple . (2.8) The transp transpose ose of a matrix pro product duct form: (AB )> has = Ba>simple A >. form: The transpose of a matrix pro duct

(2.9)

AB )2.8= A . This allo allows ws us to demonstrate (Eq. , bB y exploiting the fact that the value of (2.9) suc such h a pro product duct is a scalar and therefore equal to its own transp transpose: ose: This allows us to demonstrate Eq. 2.8, by exploiting the fact that the value of  > such a product is a scalar and x>therefore y = x >yequal=toy>its x.own transpose: (2.10)

x y= x y = y x. (2.10) Since the fo focus cus of this textb textbo ook is not linear algebra, we do not attempt to dev develop elop a comprehensive list of useful prop properties erties of the matrix pro product duct here, but Since the fo cus of this textb o ok is not linear algebra, we do not attempt to the reader should b e aware that many more exist. develop a comprehensive list of useful properties of the matrix product here, but  notation  We no now w kno know w enough linear algebra to write down a system of linear the reader should b e aware that many more exist. equations: We now know enough linear algebra to write down a system of (2.11) linear Axnotation =b equations: n b ∈=Rbm is a known vector, and x ∈ R(2.11) where A ∈ R m×n is a known matrix,Ax is a vector of unknown variables we would like R to solve for. Each elemen elementt xi of xRis one R A b x another where is a known matrix, is a known vector, is a A of these unknown variables. Each row of and eac each h element of b and pro provide vide vector of t. unknown ariables Eq. we w2.11 ouldas: like ∈ ∈ to solve for. Each element x of∈ x is one constrain constraint. We canvrewrite of these unknown variables. Each row of A and each element of b provide another constraint. We can rewrite Eq. 2.11Aas: (2.12) 1,: x = b1

or, ev even en more explicitly explicitly,, as:

A2,: x = b2

(2.12) (2.13)

A .x. .= b A m,:x . . .= bm

(2.14) (2.13) (2.15) (2.14)

A

(2.15)

x=b

A,1as: or, even more explicitly ,1 x1 + A 1,2x 2 + · · · + A 1,nx n = b1 A

x +A

35 x + +A ···

x =b

(2.16) (2.16)




1  0 01 0 0

0 1 00 1 0

 0 0 10 0 1

Figure 2.2: Example identity matrix: This is I 3 .





Figure 2.2: Example identity matrix: This is I .

A2,1 x1 + A  2,2x 2 + · · · +A 2,nx n = b2

... + A x = b x + A m,1x1 + Am,2x 2 +. .··.···· + A m,nxn = bm . A

x +A

(2.17) (2.18) (2.17) (2.19) (2.18)

Aductx notation + A xpro x compact = b . representation +vides + aAmore (2.19) Matrix-v Matrix-vector ector pro product provides for equations of this form. ··· Matrix-vector product notation provides a more compact representation for equations of this form.

2.3

Iden Identit tit tity y and In Inverse verse Matrices

Linear offers a pow owerful tool Matrices called matrix inversion that allows us to 2.3 algebra Identit y and Inerful verse analytically solv solvee Eq. 2.11 for many values of A. Linear algebra offers a powerful tool called matrix inversion that allows us to To describ describee matrix in inv version, we first need to define the concept of an identity analytically solve Eq. 2.11 for many values of A. matrix matrix.. An identit identity y matrix is a matrix that do does es not change any vector when we To describ e matrix version, we first need tothe define the of anpreserves identity multiply that vector byin that matrix. We denote iden identit tit tity y concept matrix that matrix . An identit y matrix matrix, Ithat dones ×n,not n -dimensional vectors as In.isFaormally ormally, andchange any vector when we n∈R multiply that vector by that matrix. We denote the identity matrix that preserves R n-dimensional vectors as I . Formally ∀x ∈ R,nI, In x = x., and (2.20) ∈ R x I x = xall . of the entries along the(2.20) The structure of the identit identity y matrix is ,simple: main ∀ ∈ entries are zero. See Fig. 2.2 for an example. diagonal are 1, while all of the other The structure of the identity matrix is simple: all of the entries along the main The matrix inverse of A is denoted as A−1, and it is defined as the matrix diagonal are 1, while all of the other entries are zero. See Fig. 2.2 for an example. suc such h that The matrix inverse of A is denoted , and it is defined as the matrix (2.21) A −1 Aas = IA n. such that (2.21) A = I . steps: We can now solve Eq. 2.11 by A the following We can now solve Eq. 2.11 by the following steps: Ax = b −1 A −1 Ax Ax = =A b b −1

(2.22) (2.23) (2.22)

== AA b b A In x Ax

(2.24) (2.23)

36A b I x=

(2.24)


x = A−1b.

(2.25)

= Apossible b. (2.25) Of course, this dep depends ends on it x being to find A−1. We discuss the −1 conditions for the existence of A in the follo following wing section. Of course, this depends on it being possible to find A . We discuss the −1 When Afor the exists, severalofdifferent exist for finding it in closed form. conditions existence A in algorithms the following section. In theory theory,, the same in inv verse matrix can then b e used to solv solvee the equation many A When exists, several different algorithms exist for form. −1 times for different values of b . How However, ever, A is primarilyfinding useful it asina closed theoretical Inol, theory the same verse matrix caninthen b e used solvsoftw e theare equation many to tool, and ,should notinactually b e used practice for to most software applications. b A times for different v alues of . How ever, is primarily useful as a theoretical Because A−1 can b e represented with only limited precision on a digital computer, tool, and should not actually e used in of practice mostobtain software applications. b can for algorithms that make use of bthe value usually more accurate Because can b e represented with only limited precision on a digital computer, A estimates of x. algorithms that make use of the value of b can usually obtain more accurate estimates of x.

2.4

Linear Dep Dependence endence and Span

In for A−1 to exist, Eq. 2.11 must have e exactly one solution for every value 2.4orderLinear Dep endence andhav Span of b. How However, ever, it is also possible for the system of equations to hav havee no solutions A In order for to exist, Eq. 2.11 must hav e exactly one solution value or infinitely many solutions for some values of b. It is not possiblefor to every ha have ve more of b. one Howbut ever, it than is alsoinfinitely possibleman for ythe system for of equations tobhav no solutions x and y than less many solutions a particular ; if eboth b or infinitely many solutions for some v alues of . It is not p ossible to ha ve more are solutions then than one but less than infinitelyzman and y = αyxsolutions + (1 − α)for y a particular b ; if both x (2.26) are solutions then is also a solution for any real αz. = αx + (1 α)y (2.26) To analyze ho how w man many y solutions the equation − has, we can think of the columns is also a solution for any real α. of A as sp specifying ecifying different directions we can tra trave ve vell from the origin (the point T o analyze ho w man y solutions the equation has, wewcan think of the columns sp specified ecified by the vector of all zeros), and determine ho how many wa ways ys there are of A of as sp ecifying different directions we can tra ve l from the origin (the peloint reac reaching hing b. In this view, each element of x sp specifies ecifies ho how w far we should trav travel in sp ecified by the vector ofwith all zeros), and determine many ys direction there are of of xi sp eac each h of these directions, specifying ecifying how far ho towmo mov ve inwa the reachingi:b. In this view, each element of x specifies how far we should travel in column X each of these directions, with xAx sp= ecifying how. far to move in the direction of x iA (2.27) :,i column i: i Ax = xA . (2.27) In general, this kind of op operation eration is called a line linear ar combination ombination.. Formally ormally,, a linear com combination bination of some set of vectors {v (1) , . . . , v(n) } is given by multiplying each In general, of onding op eration is called a line combination Formally, a linear vector v(i) bthis y a kind corresp corresponding scalar co coefficien efficien efficient t ar and adding the. results: combination of some set of vectors v , . . . , v is given by multiplying each XX ( i ) vector v by a corresponding scalar{coefficien } adding the results: ci v . t and (2.28) i

cv . (2.28) The sp span an of a set of vectors is the set of all points obtainable by linear combination of the original vectors. The span of a set of vectors is the set of all points obtainable by linear combination of the original vectors. X37


Determining whether Ax = b has a solution th thus us amoun amounts ts to testing whether b is in the span of the columns of A. This particular span is known as the column Determining whether sp spac ac ace e or the range of A. Ax = b has a solution thus amounts to testing whether b is in the span of the columns of A. This particular span is known as the column Ax = b to ha order theofsystem hav ve a solution for all values of b ∈ R m , spacIn e or the rfor ange A. Rm we therefore require that the column space of A be all of R m . If any p oin ointt in R = b that b has, In order from for the to hapvoint e a solution for allvalue values b that is excluded thesystem columnAx space, is a potential of of R R we solution. therefore require that the column space of A space be all of of A b.eIfallany in m timplies ∈ no The requirement that the column of pRoin is excluded from column space, thatmpoint is a pi.e., otential value of b that has Am immediately thatthe ust hav have e at least columns, . Otherwise, the n≥m R A no solution. The requirement that the column space of b e all of implies dimensionalit dimensionality y of the column space would be less than m. For example, consider a mustb hav e at but leastx m n difying m . Otherwise, 3immediately x × 2 matrix. that The A target is 3-D, is columns, only 2-D, i.e., so mo modifying the value ofthe dimensionalit y of the column space would b e less than . F or example, consider m 3 ≥ at best allows us to trace out a 2-D plane within R . The equation has a solutiona 3 and b isplane. 2 matrix. 3-D, but x is only 2-D, so mo difying the value of x if only if bThe liestarget on that R at×best allows us to trace out a 2-D plane within . The equation has a solution if and only nif ≥ b lies thata plane. m isononly Ha Having ving necessary condition for ev every ery poin ointt to ha have ve a solution. It is not a sufficient condition, because it is possible for some of the columns to be m is only Having condition every point to ve a solution. redundan redundant. t. nConsider a 2 ×a 2necessary matrix where b othfor of the columns arehaequal to each It is not a sufficient condition, b ecause it is p ossible for some of the columns bye ≥ other. This has the same column space as a 2 × 1 matrix containing only one to cop copy redundan t. Consider a 2 2 matrix where b oth of the columns are equal to each of the replicated column. In other words, the column space is still just a line, and other.toThis has theall same column as there a 2 1are matrix containing only one copy 2 × fails encompass of R , ev even en space though tw two o columns. of the replicated column. In other words, the × column space is still just a line, and R redundancy is known as line Formally ormally,, this kind of linear ar dep dependenc endenc endencee. A set of fails to encompass all of , even though there are two columns. vectors is line linearly arly indep independent endent if no vector in the set is a linear combination of the F ormally , this kind redundancy known linear dependencof e. the A set of other vectors. If we add of a vector to a setis that is aas linear combination other vectors ectors is endent if nodo ves ector theany set pisoints a linear combination the v in line thearly set, indep the new vector does notinadd to the set’s span.ofThis other vthat ectors. wecolumn add a vector to the a set that is linear combination thematrix other means for Ifthe space of matrix toaencompass all of Rm,ofthe v ectors in the set, the new vector do es not add any p oints to the set’s span. This must con contain tain at least one set of m linearly independent columns. R This condition means for theand column spacefor of the encompass all of , thevmatrix is b oththat necessary sufficient Eq. matrix 2.11 totoha hav ve a solution for every alue of m.ust conthat tain at one set ofismforlinearly independent columns. This condition b m linear Note theleast requirement a set to hav havee exactly indep independent endent is b oth necessary andmsufficient Eq. 2.11 to havveectors a solution fore every alue m of columns, not at least . No set for of m -dimensional can hav have more v than b m Note that the indep requirement is for a set e exactly linear endent m.utually linearly independen enden endentt columns, buttoahav matrix with more thanindep m columns m m m columns, not at least . No set of -dimensional v ectors can hav e more than ma may y ha hav ve more than one such set. mutually linearly indep endent columns, but a matrix with more than m columns mayInhaorder ve more thanmatrix one such set.e an in for the to hav have inv verse, we additionally need to ensure that Eq. 2.11 has at most one solution for each value of b. To do so, we need to ensure order for the e an invOtherwise erse, we additionally need to ensure that thatInthe matrix has matrix at mosttomhav columns. there is more than one wa way y of b Eq. 2.11 has at most one solution for each v alue of . T o do so, we need to ensure parametrizing eac each h solution. that the matrix has at most m columns. Otherwise there is more than one way of Together, this means that the matrix must be squar squaree, that is, we require that parametrizing each solution. independent. endent. A square matrix m = n and that all of the columns must b e linearly indep T ogether, this means that the matrix must b e squar e , that is, we require that with linearly dependent columns is known as singular. m = n and that all of the columns must b e linearly independent. A square matrix If A is not square or is square but singular, it can still b e possible to solve the with linearly dependent columns is known as singular. 38 If A is not square or is square but singular, it can still b e possible to solve the


equation. How However, ever, we can not use the metho method d of matrix inv inversion ersion to find the solution. equation. However, we can not use the method of matrix inversion to find the So far we hav havee discussed matrix in inv verses as b eing multiplied on the left. It is solution. also possible to define an inv inverse erse that is multiplied on the righ right: t: So far we have discussed matrix inverses as b eing multiplied on the left. It is also possible to define an inverse that on the right: AAis−1multiplied = I. (2.29) AA = I . inv For square matrices, the left inv inverse erse and right inverse erse are equal.

(2.29)

For square matrices, the left inverse and right inverse are equal.

2.5

Norms

Sometimes we need to measure the size of a vector. In mac machine hine learning, we usually 2.5 Norms measure the size of vectors using a function called a norm. Formally ormally,, the Lp norm Sometimes is giv given en by we need to measure the size of a vector. In machine learning, we usually ! 1a norm. Formally, the L norm measure the size of vectors using a function called p X is given by ||x|| p = |xi |p (2.30) x

=

i

x (2.30) for p ∈ R, p ≥ 1. || || | | R including the L p norm, are functions mapping vectors to non-negative Norms, for p , p 1. x measures the distance from values. On an intuitiv intuitivee lev level, el, the norm of a vector !mapping ∈ ≥ L Norms, including the norm, are functions vectors to X the origin to the poin ointt x. More rigorously rigorously,, a norm is any function that satisfies f non-negative values. On anprop intuitiv e level, the norm of a vector x measures the distance from the follo following wing properties: erties: the origin to the point x. More rigorously, a norm is any function f that satisfies the•follo erties: f (xwing ) = 0prop ⇒x =0

• f (x)+=y0) ≤ fx(x=) 0 + f (y ) (the triangle ine inequality quality quality)) • ⇒ fα (x∈+Ry, )f (αx f ()x=) + • ∀ |αf|f((yx))(the triangle inequality) • R ≤ , f (αxwith )= α f (2x p= Theα L2 norm, 2,,) is known as the Euclide Euclidean an norm norm.. It is simply the • ∀ ∈ distance from | |the origin to the poin Euclidean ointt iden identified tified by x. The L 2 norm is L p The norm, with = 2 , is known as the Euclide an norm . It isassimply the used so frequently in mac machine hine learning that it is often denoted simply ||x||, with x. ofThe L norm Euclidean distance from the to the poin t identified is the subscript 2 omitted. It isorigin also common to measure theby size a vector using usedsquared so frequently in mac hine learning that it is simply often denoted the L2 norm, whic which h can b e calculated as x>x.simply as x , with the subscript 2 omitted. It is also common to measure the size of a vector || || using L2 squared moreb econv convenient enient to workaswith mathematically and the The squared L norm,norm whicis h can calculated simply x x . 2 computationally than the L norm itself. For example, the deriv derivatives atives of the The squared norm is more conv enient to work with mathematically L2 normL with x squared resp to each element of eac dep the respect ect each h depend end only on and L computationally than the norm itself. F or example, the deriv atives of the 2 corresp corresponding onding elemen elementt of x, while all of the deriv derivativ ativ atives es of the L norm dep depend end L x squared norm with resp ect to each element of eac h dep end only on the 2 on the en entire tire vector. In many contexts, the squared L norm ma may y be undesirable corresponding element of x, while all of the derivatives of the L norm depend on the entire vector. In many contexts,39 the squared L norm may be undesirable


because it increases very slowly near the origin. In sev several eral machine learning applications, it is imp importan ortan ortantt to discriminate b et etw ween elements that are exactly because increases slowlybut near the origin. several zero and it elements thatvery are small nonzero. In theseIncases, we machine turn to a learning function applications, it is imp ortan t to discriminate b et w een elements that are exactly that gro grows ws at the same rate in all lo locations, cations, but retains mathematical simplicity: zero and elements that are small but nonzero. In these cases, w e turn to a function the L1 norm. The L1 norm ma may y be simplified to that grows at the same rate in all locations, but retains mathematical simplicity: X the L norm. The L norm may ||bxe||simplified |xito |. (2.31) 1= i

x = x . (2.31) The L1 norm is commonly used in machine learning when the difference b etw etween een || || | | zero and nonzero elements is very imp importan ortan ortant. t. Every time an element of x mo mov ves norm usedincreases in machine when the difference between aThe wayLfrom 0 bis y commonly , the L1 norm by learning . zero and nonzero elements is very importan t. Every time an element of x moves X We sometimes measure the size of the vector by coun counting ting its num umb ber of nonzero away from 0 by , the L norm increases by . 0 norm,” but this is incorrect L elemen Some authors refer to this function as the “ elements. ts. We sometimes size of the vectorinbya coun tingis its ber of bnonzero terminology terminology. . The measure num numb ber the of non-zero entries vector notnum a norm, ecause L elemen ts. Some authors refer to this function as the “ norm,” but this is incorrect scaling the vector by α do does es not change the num umb ber of nonzero en entries. tries. The L 1 terminology . The num b er of non-zero entries in a v ector is not a norm, because norm is often used as a substitute for the number of nonzero en entries. tries. scaling the vector by α does not change the number of nonzero entries.∞The L One other norm that commonly arises in machine learning is the L norm, norm is often used as a substitute for the number of nonzero entries. also known as the max norm. This norm simplifies to the absolute value of the Onet with otherthe norm thatmagnitude commonlyinarises in machine learning is the L norm, elemen element largest the vector, also known as the max norm. This norm simplifies to the absolute value of the element with the largest magnitude vector, ||x||∞in=the |xi |. (2.32) max i

x (2.32) = max x . Sometimes we may also wish to measure the size of a matrix. In the con context text || || | | of deep learning, the most common wa way y to do this is with the otherwise obscure Sometimes we may also wish to measure the size of a matrix. In the context Frob obenius enius norm sX of deep learning, the most common way to do this is with the otherwise obscure A 2i,j , ||A|| F = (2.33) Frobenius norm i,j

A , A = (2.33) whic which h is analogous to the L2 norm of a vector. || || producttoofthe twoLvectors rewritten in terms of norms. Sp Specifically ecifically ecifically,, whicThe h isdot analogous normcan of abevector. sX The dot product of two vectors can rewritten ||xbe || 2|| y|| 2 cos θin terms of norms. Specifically x>y = (2.34), y = yx y cos θ where θ is the angle betw etween een x x and . || || || || where θ is the angle between x and y .

2.6

Sp Special ecial Kinds of Matrices and Vectors

Some special ecial kindsKinds of matrices vectors areand particularly useful. 2.6 sp Sp ecial of and Matrices Vectors 40 are particularly useful. Some special kinds of matrices and vectors

(2.34)


Diagonal matrices consist mostly of zeros and hav havee non-zero entries only along the main diagonal. F Formally ormally ormally,, a matrix D is diagonal if and only if Di,j = 0 for consistseen mostly zeros and e non-zero entriesthe only alongy all iDiagonal Weematrices hav havee already one ofexample of hav a diagonal matrix: identit identity 6= j . W D D the main diagonal. Formally , a entries matrixare is and if 0 for (v)only matrix, where all of the diagonal 1. diagonal We write ifdiag diag( to denote a=square all . W e hav e already seen one example of a diagonal matrix: the identit i = j diagonal matrix whose diagonal entries are given by the en entries tries of the vector vy. matrix, all ofare theofdiagonal 1. Wemultiplying write diag(vby ) to denote a square 6 where Diagonal matrices interest entries in partare because a diagonal matrix v. diagonal matrix whose diagonal entries are given b y the en tries of the vector is very computationally efficien efficient. t. To compute diag diag((v)x , we only need to scale each Diagonal are of interest indiag( part by aa square diagonal matrix ( vb )xecause = v multiplying elemen element t xmatrices Inverting erting diagonal x. Inv i by v i. In other words, diag ( v ) x is very computationally efficien t. T o compute diag , we only need to scale each matrix is also efficient. The inv inverse erse exists only if ev every ery diagonal entry is nonzero, x bycase, v . In (/v v)x, . = elemen other words, diag v x.>Inverting a square diagonal ( v) −1 = diag ([1 ([1/v and in tthat diag diag( diag([1 1 . . , 1/vn] ). In many cases, we may matrix is also efficient. inv exists algorithm only ifevery diagonal entry is matrices, nonzero, deriv derivee some very generalThe mac machine hineerse learning in terms of arbitrary ( v) (and = diag ([1descriptiv /v , . . . , 1e) /v algorithm ] ). In many and obtain in thatacase, diagensive cases, we some may but less exp expensive less descriptive) by restricting deriv e some v ery general mac hine learning algorithm in terms of arbitrary matrices, matrices to be diagonal. but obtain a less expensive (and less descriptive) algorithm by restricting some Not all diagonal matrices need be square. It is p ossible to construct a rectangular matrices to be diagonal. diagonal matrix. Non-square diagonal matrices do not hav havee inv inverses erses but it is still Not all diagonal matrices need b e square. It is p ossible to construct a rectangular D , the possible to multiply by them cheaply cheaply.. For a non-square diagonal matrix diagonal matrix. Non-square diagonal matrices do not hav e inv erses but it still pro product duct Dx will in inv volv olvee scaling each element of x , and either concatenatingissome D , last possible cheaply For non-square diagonal matrix the Dthem zeros to to themultiply result ifby is taller than. it is awide, or discarding some of the x , and either concatenating some producttsD willvector involvife scaling eachthan element elemen elements ofxthe D is wider it isoftall. zeros to the result if D is taller than it is wide, or discarding some of the last A symmetric matrixif is matrix thatit isis equal transpose: ose: elemen ts of the vector Dany is wider than tall. to its own transp A= A>is. equal to its own transpose: (2.35) A symmetric matrix is any matrix that Symmetric matrices often arise whenAthe of = entries A . are generated by some function (2.35) two argumen arguments ts that do does es not dep depend end on the order of the arguments. For example, Symmetric matrices often arise when the generated by some function of if A is a matrix of distance measuremen with Aare measurements, ts,entries i,j giving the distance from p oint op argumen ts that do= es A not dep end on the order of the arguments. For example, itwto oin ointt j , then Ai,j j,i b ecause distance functions are symmetric. if A is a matrix of distance measurements, with A giving the distance from point A unit ve vector ctor is a vector with unit norm: i to p oint j , then A = A because distance functions are symmetric. ||x||norm 1. (2.36) 2=1 A unit vector is a vector with unit :. x = 1. (2.36) > A vector x and a vector y are ortho orthogonal each h other if x y = 0. If both || || gonal to eac vectors ha e nonzero norm, this means that they are at a 90 degree angle to each hav v x y x ynonzero A vector and a vector are ortho gonal to eac h other ifwith = 0. If norm. both n other. In R , at most n vectors ma may y b e mutually orthogonal vectors ve nonzero norm, this means that they are at unit a 90 norm, degree we angle each If the vha ectors have ve calltothem R are not only orthogonal but also ha other. In , at most v ectors ma y b e mutually orthogonal with nonzero norm. n orthonormal orthonormal.. If the vectors are not only orthogonal but also have unit norm, we call them An ortho orthogonal gonal matrix is a square matrix whose rows are mutually orthonormal orthonormal . and whose columns are mutually orthonormal: An orthogonal matrix is a square matrix whose rows are mutually orthonormal A>orthonormal: A = AA> = I . (2.37) and whose columns are mutually A A = 41 AA = I .

(2.37)


This implies that

A −1 = A> ,

(2.38) This implies that so orthogonal matrices are of interest inv inverse erse is very cheap to compute. Abecause = A their , (2.38) Pay careful atten attention tion to the definition of orthogonal matrices. Counterin Counterintuitively tuitively tuitively,, so orthogonal are of interest because their inverse is very cheapistonocompute. their rows arematrices not merely orthogonal but fully orthonormal. There sp special ecial P a y careful atten tion to the definition of orthogonal matrices. Counterin tuitively term for a matrix whose rows or columns are orthogonal but not orthonormal. , their rows are not merely orthogonal but fully orthonormal. There is no special term for a matrix whose rows or columns are orthogonal but not orthonormal.

2.7

Eigendecomp Eigendecomposition osition

Man Many ob objects jects can be understo understoo o d better by breaking them in into to 2.7 y mathematical Eigendecomp osition constituen constituentt parts, or finding some properties of them that are univ universal, ersal, not caused Man y mathematical ob jects can b e understo o d b etter by breaking them into by the way we cho hoose ose to represen representt them. constituent parts, or finding some properties of them that are universal, not caused For wexample, integers tegers can bet decomp decomposed into to prime factors. The wa way y we by the ay we choin ose to represen them. osed in represen representt the num numb ber 12 will change dep depending ending on whether we write it in base ten example, tegers can decomp osed in×to2 ×prime factors. The way we or inFor binary binary, , but itin will alwa always ys bb e etrue that 12 = 22× 3. From this representation represen t the num ber 12prop willerties, change dep oniswhether we write in that base an teny w e can conclude useful properties, suc such h asending that 12 not divisible by 5it, or any 2 or in binary , but it will alwa ys b e true that 12 = 2 3 . F rom this representation in integer teger multiple of 12 will be divisible by 3. we can conclude useful properties, such as that 12×is not × divisible by 5, or that any Muc Much h as we can disco discov ver something ab about out the true nature of an integer by integer multiple of 12 will be divisible by 3. decomp decomposing osing it into prime factors, we can also decomp decompose ose matrices in ways that Muc h as we can disco v er something ab out the true of vious an integer by sho show w us information ab about out their functional prop properties erties thatnature is not ob obvious from the decomp osing we can decompose matrices in ways that represen representation tationitofinto theprime matrixfactors, as an array of also elements. show us information about their functional properties that is not obvious from the One of the most widely used kinds of matrix decomp decomposition osition is called eigenrepresentation of the matrix as an array of elements. de deccomp omposition osition osition,, in whic which h we decomp decompose ose a matrix in into to a set of eigenv eigenvectors ectors and One of the most widely used kinds of matrix decomp osition is called eigeneigen eigenv values. decomposition, in which we decompose a matrix into a set of eigenvectors and An eigenve eigenvector ctor of a square matrix A is a non-zero vector v suc such h that multiplieigen values. cation by A alters only the scale of v: An eigenvector of a square matrix A is a non-zero vector v such that multiplication by A alters only the scale of Av v: = λv. (2.39) Av =corresponding λv . The scalar λ is known as the eigenvalue to this eigen eigenv vector. (2.39) (One > > v A = λ v can also find a left eigenve eigenvector ctor suc such h that , but we are usually concerned λ is known The scalar as the eigenvalue corresponding to this eigenvector. (One with righ rightt eigen eigenv vectors). can also find a left eigenvector such that v A = λv , but we are usually concerned v ist an eigenv eigenvector ector of A, then so is an any y rescaled vector sv for s ∈ R, s 6 = 00.. withIf righ eigen vectors). Moreo Moreov ver, sv still has the same eigenv eigenvalue. alue. For this reason, we usually only look ok R lo v A s v s , s If is an eigenv ector of , then so is an y rescaled vector for = 0. for unit eigen eigenvectors. vectors. Moreover, sv still has the same eigenvalue. For this reason, we usually∈only 6look Supp Suppose ose that a matrix A has n linearly indep independen enden endentt eigenv eigenvectors, ectors, {v (1) , . . . , for unit eigenvectors. corresponding onding eigenv eigenvalues alues {λ1, . . . , λn} . We ma may y concatenate all of the v(n) } , with corresp Suppose that a matrix A has n linearly indep endent eigenvectors, v , . . . , , with corresponding eigenvalues λ42, . . . , λ . We may concatenate{all of the v } { }


E"ect of eigenvectors and eigenvalues

3 2

Before multiplication Before multiplication

1

2

After multiplication After multiplication

1

v (1)

0

x10

x1

3

v (1)

0

¸ 2v (2) v (2)

(2)

v

−1

−1

−2 −3 −3

¸1 v(1)

−2 −2

−1

0 x0

1

2

3

−3 −3

−2

−1

0 x00

1

2

3

Figure 2.3: An example of the effect of eigen eigenvectors vectors and eigenv eigenvalues. alues. Here, we hav havee a matrix A with tw two o orthonormal eigenv eigenvectors, ectors, v (1) with eigenv eigenvalue alue λ1 and v (2) with Figure 2.3:λ 2An example of the e hav e eigen eigenv value . (L (Left) eft) We plot theeffect set ofofalleigen unitvectors vectorsand a unit Here, circle.w(R (Right) ight) u ∈eigenv R 2 asalues. A v λ v a matrix with tw o orthonormal eigenv ectors, with eigenv alue and with Au A We plot the set of all points . By observing the way that Rdistorts the unit circle, we eft) W λ it. (L (i) unit vectors u eigensee value e plot set of vall as a unit circle. (Right) can that scales space in the direction by λi . Au A We plot the set of all points . By observing the way that∈ distorts the unit circle, we can see that it scales space in direction v by λ .

eigen eigenv vectors to form a matrix V with one eigen eigenv vector per column: V = [[v v(1) , . . . , v(n) ]. Likewise, we can concatenate the eigenv λ1 , . . . , eigenvalues alues to form a vector λ = [[λ eigen v ectors to form a matrix with one eigen v ector p er column: V V = [ v ,..., > λn ] . The eigende eigendeccomp omposition osition of A is then giv given en by v ]. Likewise, we can concatenate the eigenvalues to form a vector λ = [ λ , . . . , −1 λ ] . The eigendecomposition A of = AV is diag then(λgiv by diag( )Ven . (2.40)

A = matrices V diag(λwith )V sp . ecific eigenv (2.40) We hav havee seen that constructing specific eigenvalues alues and eigenv eigenvecectors allo allows ws us to stretch space in desired directions. Ho Howev wev wever, er, we often wan antt to W e hav e seen that c onstructing matrices with sp ecific eigenv alues and eigenv de deccomp ompose ose matrices into their eigen eigenv values and eigenv eigenvectors. ectors. Doing so can help ecus torsanalyze allows certain us to stretch space However, wean often waninto t to to prop properties erties of in thedesired matrix,directions. muc much h as decomp decomposing osing integer decomp osefactors matrices eigenvaluesthe andbehavior eigenvectors. so can help us its prime caninto helptheir us understand of thatDoing in integer. teger. to analyze certain prop erties of the matrix, much as decomposing an integer into Not every matrix can b e decomp decomposed osed in into to eigenv eigenvalues alues and eigenv eigenvectors. ectors. In some its prime factors can help us understand the behavior of that integer. Not every matrix can b e decomposed43into eigenvalues and eigenvectors. In some


cases, the decomp decomposition osition exists, but ma may y in inv volv olvee complex rather than real numbers. Fortunately ortunately,, in this b ook, we usually need to decomp decompose ose only a sp specific ecific class of cases, the decomp osition exists, but ma y in v olv e complex rather than numbers. matrices that ha have ve a simple decomp decomposition. osition. Sp Specifically ecifically ecifically,, ev every ery realreal symmetric Fortunately this b ook, we need to using decomp osereal-v only alued a specific class of matrix can b, eindecomposed intousually an expression only real-valued eigen eigenvectors vectors matrices that ha ve a simple decomp osition. Sp ecifically , ev ery real symmetric and eigen eigenv values: matrix can b e decomposed into anAexpression = QΛQ>,using only real-valued eigenvectors (2.41) and eigenvalues: where Q is an orthogonal matrixAcomp composed osed eigenvectors ectors of A, and Λ(2.41) is a = QΛ Q ,of eigenv diagonal matrix. The eigen eigenv value Λi,i is asso associated ciated with the eigen eigenv vector in column i Q is anasorthogonal ΛAis as where matrix osed of eigenv ectors , and of a of an orthogonal matrix, we of canAthink Q, denoted Q:,i. Because Q iscomp diagonalspace matrix. eigenvalue Λ(i) is associated with the eigenvector in column i scaling by λThe i in direction v . See Fig. 2.3 for an example. of Q, denoted as Q . Because Q is an orthogonal matrix, we can think of A as A guaran While an any ybreal guaranteed ha hav ve an eigendecomp eigendecomposiosiscaling space y λ symmetric in directionmatrix v . SeeisFig. 2.3 teed for antoexample. tion, the eigendecomp eigendecomposition osition ma may y not be unique. If any tw two o or more eigenv eigenvectors ectors A isofguaran While any real symmetric matrix teed to have an eigendecomp osishare the same eigenv eigenvalue, alue, then an any y set orthogonal vectors lying in their span tion,also theeigenv eigendecomp osition maeigenv y notalue, be unique. any tw o oralently more eigenv are eigenvectors ectors with that eigenvalue, and weIfcould equiv equivalently chooseectors aQ share the same eigenv alue, then an y set of orthogonal vectors lying in their span using those eigenv eigenvectors ectors instead. By con conv ven ention, tion, we usually sort the en entries tries of Λ are also eigenv ectors with that eigenv alue, and w e could equiv alently choose aQ in descending order. Under this conv convention, ention, the eigendecomp eigendecomposition osition is unique only using eigenvalues ectorsare instead. if all ofthose the eigenv eigenvalues unique.By convention, we usually sort the entries of Λ in descending order. Under this convention, the eigendecomposition is unique only The eigendecomp eigendecomposition osition of a matrix tells us man many y useful facts about the if all of the eigenvalues are unique. matrix. The matrix is singular if and only if any of the eigenv eigenvalues alues are 0. The The eigendecomp osition of a matrix tells us man y useful factstoabout the eigendecomp eigendecomposition osition of a real symmetric matrix can also be used optimize matrix. The matrix isof singular if any of the eigenv alues are 0. The x) =only x> Ax x x|| 2 = quadratic expressions the form iff (and sub subject ject to || 11.. Whenever eigendecomp osition of a real symmetric matrix can also be used to optimize is equal to an eigenv eigenvector ector of A, f tak takes es on the value of the corresponding eigenv eigenvalue. alue. f ( x ) = x Ax x x quadratic expressions of the form sub ject to = 1 . Whenever The maxim maximum um value of f within the constrain constraintt region is the maximum eigenv eigenvalue alue A f is equal to a n eigenv ector of , tak es on the v alue of the corresponding eigenv alue. || um eigen and its minim minimum um value within the constraint region is the||minim minimum eigenv value. The maximum value of f within the constraint region is the maximum eigenvalue A matrix whose eigen eigenv values are all positive is called positive definite definite.. A matrix and its minimum value within the constraint region is the minimum eigenvalue. whose eigenv eigenvalues alues are all positiv ositivee or zero-v zero-valued alued is called positive semidefinite semidefinite.. A matrix whose eigen v alues are all p ositive is called p ositive definite . A, matrix Lik Likewise, ewise, if all eigen eigenv values are negativ negative, e, the matrix is ne negative gative definite definite, and if whose eigenv alues are all p ositiv e or zero-v alued is called p ositive semidefinite all eigen eigenv values are negative or zero-v zero-valued, alued, it is ne negative gative semidefinite semidefinite.. Positiv Positivee. Likewise, if all eigenvare alues are negativ e, thethey matrix is negative , and x, x>Ax ≥ 0if. semidefinite matrices interesting because guarantee that ∀definite allositiv eigen alues are negative or zero-valued, it is that negative semidefinite e P ositive e vdefinite matrices additionally guarantee x>Ax = 0 ⇒ x =. 0Positiv . semidefinite matrices are interesting because they guarantee that x, x Ax 0. Positive definite matrices additionally guarantee that x Ax = 0 ∀ x = 0. ≥ 2.8 Singular Value Decomp Decomposition osition ⇒ In , we sa saw w ho how to decomp decompose ose a matrix in into to eigen eigenv vectors and eigen eigenv values. 2.8Sec. 2.7 Singular Vwalue Decomp osition The singular value de deccomp omposition osition (SVD) pro provides vides another wa way y to factorize a matrix, In Sec. 2.7 , w e sa w ho w to decomp ose a matrix in to eigen v ectors and veigen values. in into to singular ve vectors ctors and singular values. The SVD allows us to disco discov er some of The singular value de c omp osition (SVD) pro vides another wa y to factorize a matrix, the same kind of information as the eigendecomp eigendecomposition. osition. Ho How wev ever, er, the SVD is into singular vectors and singular values. The SVD allows us to discover some of the same kind of information as the eigendecomp osition. However, the SVD is 44


more generally applicable. Every real matrix has a singular value decomp decomposition, osition, but the same is not true of the eigenv eigenvalue alue decomp decomposition. osition. For example, if a matrix more applicable. Every real matrix has a singular alue osition, is not generally square, the eigendecomp eigendecomposition osition is not defined, and wevm ustdecomp use a singular but same is not true of the eigenvalue decomposition. For example, if a matrix v aluethe decomp decomposition osition instead. is not square, the eigendecomposition is not defined, and we must use a singular Recall that the eigendecomp eigendecomposition osition in involv volv volves es analyzing a matrix A to disco discov ver value decomposition instead. a matrix V of eigen eigenv vectors and a vector of eigen eigenv values λ suc such h that we can rewrite Recall that the eigendecomp osition involves analyzing a matrix A to discover A as λ such that we can rewrite a matrix V of eigenvectors andAa = vector of(eigen values V diag diag( λ)V −1 . (2.42) A as A = V diag (λ)V except . (2.42) A The singular value decomp decomposition osition is similar, this time we will write as a product of three matrices: The singular value decomposition is similar, except this time we will write A as a product of three matrices: A = U DV >. (2.43) A = U DV . (2.43) Supp Suppose ose that A is an m × n matrix. Then U is defined to be an m × m matrix, D to be an m × n matrix, and V to be an n × n matrix. Suppose that A is an m n matrix. Then U is defined to be an m m matrix, Eac Each h of these matrices is defined to hav havee a sp special ecial structure. The matrices U D to be an m n matrix, and × V to be an n n matrix. × and V are both defined to b e orthogonal matrices. The matrix D is defined to be × matrices is defined to have a×special structure. The matrices U Each ofmatrix. these a diagonal Note that D is not necessarily square. and V are both defined to b e orthogonal matrices. The matrix D is defined to be The elemen elements ts along the diagonal of D are kno known wn as the singular values of the a diagonal matrix. Note that D is not necessarily square. matrix A. The columns of U are kno known wn as the left-singular ve vectors ctors ctors.. The columns D The elemen ts along the diagonal of are kno wn as the singular values of the of V are kno known wn as as the right-singular ve vectors ctors ctors.. matrix A. The columns of U are known as the left-singular vectors. The columns e can actually the singular value decomposition osition of A in terms of of VWare kno wn as asinterpret the right-singular vectors . decomp the eigendecomposition of functions of A . The left-singular vectors of A are the A in terms Wveectors can actually the singular valueofdecomp osition of of. A are the eigen eigenv of AA> .interpret The righ right-singular t-singular vectors eigen eigenvectors vectors of A>A A A the eigendecomposition of functions of . The left-singular v ectors of are the The non-zero singular values of A are the square ro roots ots of the eigen eigenv values of A>A. AA A eigen v ectors of . The righ t-singular v ectors of are the eigen vectors of A A. > The same is true for AA . The non-zero singular values of A are the square roots of the eigenvalues of A A. Perhaps the most useful feature of the SVD is that we can use it to partially The same is true for AA . generalize matrix in inversion version to non-square matrices, as we will see in the next P erhaps the most useful feature of the SVD is that we can use it to partially section. generalize matrix inversion to non-square matrices, as we will see in the next section.

2.9

The Mo Moore-P ore-P ore-Penrose enrose Pseudoin Pseudoinverse verse

Matrix inv version is ore-P not defined for matrices that areverse not square. Supp Suppose ose we wan wantt 2.9 in The Mo enrose Pseudoin to mak makee a left-inv left-inverse erse B of a matrix A, so that we can solve a linear equation Matrix inversion is not defined for matrices that are not square. Suppose we want to make a left-inverse B of a matrixAx A, = so ythat we can solve a linear equation (2.44) Ax = y 45

(2.44)


by left-m left-multiplying ultiplying eac each h side to obtain by left-multiplying each side to obtain x = By.

(2.45)

= B y . it ma (2.45)a Dep Depending ending on the structure of the x problem, may y not be possible to design unique mapping from A to B . Depending on the structure of the problem, it may not be possible to design a If A is taller than it is wide, then it is possible for this equation to hav havee unique mapping from A to B . no solution. If A is wider than it is tall, then there could be multiple possible If A is taller than it is wide, then it is possible for this equation to have solutions. no solution. If A is wider than it is tall, then there could be multiple possible The Mo Moor or ore-Penr e-Penr e-Penrose ose pseudoinverse allows us to make some headwa headway y in these solutions. cases. The pseudoinv pseudoinverse erse of A is defined as a matrix The Moore-Penrose pseudoinverse allows us to make some headway in these cases. The pseudoinverse A of+A=islim defined (A>Aas+aαmatrix I ) −1 A> . (2.46) α&0

A = lim (A A + αI ) A . (2.46) Practical algorithms for computing the pseudoinv pseudoinverse erse are not based on this definition, but rather the formula Practical algorithms for computing the pseudoinverse are not based on this definition, but rather the formula A + = V D +U >, (2.47) A value = Vdecomp D U osition , (2.47) A , and the pseudoin where U , D and V are the singular decomposition of ofA pseudoinverse verse + D of a diagonal matrix D is obtained by taking the recipro reciprocal cal of its non-zero V arethe where U and thetransp singular decomp osition of A , and the pseudoinverse elemen elements ts, D then taking transpose osevalue of the resulting matrix. D of a diagonal matrix D is obtained by taking the reciprocal of its non-zero When A has more columns than rows, then solving a linear equation using the elements then taking the transp ose of the resulting matrix. pseudoin pseudoinv verse provides one of the man many y possible solutions. Specifically Specifically,, it pro provides vides A When has more columns than rows, then solving a linear equation using the + the solution x = A y with minimal Euclidean norm ||x||2 among all possible pseudoinverse provides one of the many possible solutions. Specifically, it provides solutions. the solution x = A y with minimal Euclidean norm x among all possible When A has more rows than columns, it is possible for there to be no solution. solutions. || || In this case, using the pseudoinv pseudoinverse erse gives us the x for which Ax is as close as WhentoAy has more rows than columns, is p−ossible possible in terms of Euclidean norm ||itAx y||2 . for there to be no solution. In this case, using the pseudoinverse gives us the x for which Ax is as close as possible to y in terms of Euclidean norm Ax y . 2.10 The Trace Op Operator erator || − || The operator gives Op the sum of all of the diagonal en entries tries of a matrix: 2.10traceThe Trace erator X The trace operator gives the sum entries of a matrix: (2.48) Tr(ofAall ) =of theAdiagonal i,i . i

Tr(A) = A . (2.48) The trace op operator erator is useful for a variety of reasons. Some op operations erations that are difficult to sp specify ecify without resorting to summation notation can b e sp specified ecified using The trace operator is useful for a variety of reasons. Some operations that are 46X difficult to specify without resorting to summation notation can b e specified using


matrix pro products ducts and the trace op operator. erator. For example, the trace op operator erator provides an alternativ alternativee way of writing the Frob robenius enius norm of a matrix: matrix products and the trace operator. For example, the trace operator provides q an alternative way of writing the Frobenius norm of a matrix: ||A||F = Tr( r(AA AA> ). (2.49)

A = Tr(AA ). (2.49) Writing an expression in terms of the trace op operator erator op opens ens up opp opportunities ortunities to || ||man manipulate the expression using many y useful identities. F For or example, the trace W riting an expression in terms of the trace operator opens up opp ortunities to op operator erator is in inv varian ariantt to the transp transpose ose op operator: erator: manipulate the expression using manyquseful identities. For example, the trace operator is invariant to the transp Tr(ose A) op = erator: T Tr( r(A> ). (2.50)

Tr(A) = Tr(A ). (2.50) The trace of a square matrix comp composed osed of many factors is also in inv varian ariantt to mo moving ving the last factor into the first p osition, if the shap shapes es of the corresp corresponding onding The trace of a square matrix comp osed of many factors is also in v arian t to matrices allo the resulting pro to b e defined: allow w product duct moving the last factor into the first p osition, if the shapes of the corresp onding matrices allow the resulting pro toCbAB e defined: Tr(AB ABC C )duct =T Tr( r( )=T Tr( r(B C A) (2.51) or more generally generally,, or more generally,

Tr(AB C ) = Tr(C AB) = Tr(B C A) n−1 n Y Y Tr( F (i)) = T Tr( r( r(F F (n) F (i) ). i=1

(2.51) (2.52)

i=1

Tr( F ) = Tr(F F ). (2.52) This inv invariance ariance to cyclic perm ermutation utation holds even if the resulting pro product duct has a differen differentt shap shape. e. For example, for A ∈ Rm×n and B ∈ R n×m, we ha hav ve This invariance to cyclic permutation holds even if the resulting product has a R R different shape. For example, for A B , we have Tr( r(and BA (2.53) YTr(AB ) = T Y) ∈ ∈ n×T n r(B A) AB (2.53) . ev even en though AB ∈ Rm×m and TBr(A ∈ )R= R mindRis that . a scalar is its own trace: a = Tr(a ). evenAnother though useful AB fact to keep and in BA ∈ is that a scalar is its own trace: a = Tr(a ). Another useful∈fact to keep in mind

2.11

The Determinan Determinantt

The of a square matrix, denoted det det((A ), is a function mapping 2.11determinant The Determinan t matrices to real scalars. The determinant is equal to the pro product duct of all the The determinant of a square ), is a function eigen eigenv v alues of the matrix. The matrix, absolute denoted value of det the(A determinant can bemapping thought matrices to real scalars. The determinant is equal to the pro duct of all the of as a measure of how muc uch h multiplication by the matrix expands or con contracts tracts eigen v alues of the matrix. The absolute v alue of the determinant can b e thought space. If the determinant is 0, then space is contracted completely along at least of asdimension, a measure causing of how m multiplication by the If matrix expands or is con one it uc tohlose all of its volume. the determinant 1,tracts then space. If the determinant is 0, then space is contracted completely along at least the transformation is volume-preserving. one dimension, causing it to lose all of its volume. If the determinant is 1, then the transformation is volume-preserving. 47


2.12

Example: Principal Comp Componen onen onents ts Analysis

One mac machine hine learning algorithm,Comp princip principal alonen comp omponents onents analysis or PCA can 2.12simple Example: Principal ts Analysis be deriv derived ed using only kno knowledge wledge of basic linear algebra. One simple machine learning algorithm, principal(1) components analysis or PCA can m poin , . . . , x (m)} in Rn . Supp Supp Suppose oseusing we ha hav ve akno collection of basic oints ts {xalgebra. Suppose ose we be deriv ed only wledge of linear would like to apply lossy compression to these poin Lossy compression means oints. ts. R m , . . . , x x Supp ose w e ha v e a collection of p oin ts in . Supp ose we storing the points in a wa way y that requires less memory but ma may y lose some precision. would like lik toeapply compression { points. Lossy} compression means W e would like to loselossy as little precisiontoasthese possible. storing the points in a way that requires less memory but may lose some precision. walik yw enco encode these points is represen representt a lo lower-dimensional wer-dimensional version We One would e etocan lose as de little precision asto possible. of them. For each point x(i) ∈ R n we will find a corresponding co code de vector c (i) ∈ R l. way wthan e can nenco points is to represent a lower-dimensional version If l One is smaller , it de willthese take code de points than the R less memory to store the co R x c of them. F or each p oint w e will find a corresponding co de vector original data. We will wan antt to find some enco encoding ding function that pro produces duces the co code de. l n If is smaller than , it will take less memory to store the co de p oints than the ∈ for an input, f (x) = c, and∈a deco decoding ding function that pro produces duces the reconstructed original data. W e will w an t to find some enco ding function that pro duces the co de input giv given en its co code, de, x ≈ g(f (x)). for an input, f (x) = c, and a decoding function that produces the reconstructed PCA is defined by xour cghoice decoding ding function. Sp Specifically ecifically ecifically,, to mak makee the input given its co de, (f (x))of. the deco deco decoder der very simple, we choose to use matrix multiplication to map the co code de back ≈ choice of the ndeco PCA is defined b y our ×l ding function. Sp ecifically, to make the n in into to R . Let g (c) = Dc, where D ∈ R is the matrix defining the deco decoding. ding. decoder very simple, we choose to use matrix multiplication to map the code back R R deco Computing the optimal co code de for this decoder der could be a difficult problem. To into is the matrix defining the decoding. . Let g (c) = Dc, where D D to be orthogonal keep the enco encoding ding problem easy easy,, PCA constrains the colum columns ns of ofD ∈ this decoder could be a difficult problem. To Computing the optimal co de for to eac each h other. (Note that D is still not technically “an orthogonal matrix” unless k eep the l = n) encoding problem easy, PCA constrains the columns of D to be orthogonal to each other. (Note that D is still not technically “an orthogonal matrix” unless With the problem as describ described ed so far, man many y solutions are possible, because we l = n) can increase the scale of D:,i if we decrease c i prop proportionally ortionally for all poin oints. ts. To giv givee With the problem as describ ed so far, man y solutions are p ossible, b ecause we the problem a unique solution, we constrain all of the columns of D to ha hav ve unit can increase the scale of if we decrease prop ortionally for all p oin ts. T o giv e D c norm. the problem a unique solution, we constrain all of the columns of D to have unit In order to turn this basic idea in into to an algorithm we can implement, the first norm. thing we need to do is figure out how to generate the optimal co code de point c∗ for orderpto turn idea to an algorithm wethe candistance implement, the x . this eac each hIninput oint One basic way to do in this is to minimize betw etween een first the c thing w e need to do is figure out how to generate the optimal co de p oint ∗ input point x and its reconstruction, g( c ). We can measure this distance usingfor a x . One each input point wayonents to do algorithm, this is to minimize theL2distance norm. In the principal comp components we use the norm: between the input point x and its reconstruction, g( c ). We can measure this distance using a norm. In the principal components algorithm, we use the L norm: c∗ = arg min ||x − g(c)||2 . (2.54) c

c = arg min x g(c) . (2.54) We can switch to the squared L 2 norm instead of the L2 norm itself, b ecause both are minimized by the same value of|| c .−This || is b ecause the L 2 norm is nonL L norm W e can switch to the squared norm instead of the b ecause negativ negativee and the squaring op operation eration is monotonically increasing foritself, non-negative both are minimized by the same value of c . This is b ecause the L norm is nonnegative and the squaring operation is monotonically increasing for non-negative 48


argumen arguments. ts.

c∗ = arg min ||x − g(c)||22 .

c arguments. c = arg min tox g(c) . The function being minimized simplifies || − || − g(c))>to (x − g(c)) The function being minimized(xsimplifies

(x Eq. g(c))2.30 (x) g(c)) (b (by y the definition of the L2 norm, − − − x>gEq. (c) − g (c) )>x + g(c) >g(c) = xL> xnorm, (by the definition of the 2.30 = x x x g(c) g (c) x + g(c) g(c) (b (by y the distributiv distributivee property) − − = x>x − 2x > g(c) + g (c)>g(c) (by the distributive property)

(2.55) (2.55) (2.56) (2.56) (2.57) (2.57) (2.58)

x 2to x the = is x equal g(c)transp + g (cose ) g(ofc)itself (2.58) (b (because ecause the scalar g(x)> x transpose itself). ). − being minimized again, to omit the first term, We can now change the function (because the scalar g(x) x is equal to the transpose of itself ). since this term do does es not dep depend end on c: We can now change the function being minimized again, to omit the first term, c∗ dep = arg since this term do es not endmin on−c2: x> g(c) + g (c)>g(c). (2.59) c

c = arg min 2x g(c) + g (c) g(c). (2.59) To mak makee further progress, we must substitute in the definition of g(c): − ∗ > > To make further progress, −2x c = argwe min Dc + c>inDthe must substitute Dcdefinition of g(c): (2.60) c

c = arg min 2x >Dc + c >D Dc = arg min −2x Dc + c Ilc − c = arg min c IDc) 2x Dc +on (b (by y the orthogonalit orthogonality y and unit norm constraints − > = arg minconstraints c >D −2x Dc +on (by the orthogonality and unit norm c)

(2.60) (2.61) (2.61) (2.62)

c

= arg min 2x Dc + c c (2.62) We can solve this optimization problem using vector calculus (see Sec. 4.3 if − you do not know how to do this): We can solve this optimization problem using vector calculus (see Sec. 4.3 if > > you do not know how to do ∇ this): (2.63) c (−2x D c + c c) = 0 c)0= 0 (2.63) Dxc++2cc = ( −22xD> (2.64) ∇ − c = D >x. (2.65) 2D x + 2c = 0 (2.64) − c = Dw . optimally enco (2.65)a This mak makes es the algorithm efficient: wee xcan encode de x just using matrix-v op T o enco a vector, we apply the enco function matrix-vector ector operation. eration. encode de encoder der This makes the algorithm efficient: we can optimally encode x just using a f (xa)vector, = D > xwe matrix-vector operation. To encode . apply the encoder function(2.66) 49 D x. f (x) =

(2.66)


Using a further matrix multiplication, we can also define the PCA reconstruction op operation: eration: Using a further matrix multiplication, also x. the PCA reconstruction (2.67) r(x) = g (f (wxe))can =D D>define operation: Next, we need to choose encoding the x.. To do so, we revisit (2.67) r(xthe ) = enco g (f (ding x)) =matrix DD D 2 idea of minimizing the L distance bet etw ween inputs and reconstructions. How However, ever, D Next, w e need to choose the enco ding matrix . T o do so, w e revisit the since we will use the same matrix D to deco decode de all of the points, we can no longer idea of minimizing between inputsminimize and reconstructions. ever, L distance consider the points the in isolation. Instead, we must the Frob robenius eniusHow norm of D since w e will use the same matrix to deco de all of the p oints, we can no longer the matrix of errors computed ov over er all dimensions and all points: consider the points in isolation. Instead, we must minimize the Frob enius norm of the matrix of errors computed over all dimensions and all points: s 2 X  (i) D ∗ = arg min xj − r (x(i))j sub subject ject to D> D = Il (2.68) D

i,j

D = arg min x r (x ) sub ject to D D = I (2.68) ∗ D , we will start by considering the case To derive the algorithm for finding − D where l = 1 In this case, is just a single vector, d. Substituting Eq. 2.67 in 1.. into to D T o derive the algorithm for finding , we will startto by considering the case Eq. 2.68 and simplifying D in into to d , the problem reduces s D where l = 1. In this case, X is just a single vector, d. Substituting Eq. 2.67 into Xinto d, the problem reduces to Eq. 2.68 and simplifying D || ||x x(i) − dd>x(i) || 22 sub subject ject to ||d||2 = 11.. d ∗ = arg min (2.69) d

i

x dd x sub ject to d = 1. d = arg min (2.69) The ab abo ove form formulation ulation is the most direct wa way y of performing the substitution, || − || || || but is not the most stylistically pleasing way to write the equation. It places the Thevalue abovde>form is theofmost erforming the substitution, scalar on the right the vdirect ector wa con conv ven entional tional to write x (i) ulation d. yItofispmore but is not the most stylistically pleasing w a y to write the equation. It places the scalar co coefficients efficients on the left operate erate on. We therefore usually write Xof vector they op scalar alueula d x suc such h a vform formula as on the right of the vector d. It is more conventional to write scalar coefficients on the left Xof vector they operate on. We therefore usually write || ||x x(i) − d>x(i) d|| 22 sub subject ject to ||d||2 = 11,, d∗ = (2.70) such a formula as arg min d

i

x d x d sub ject to d = 1, d = arg min or, exploiting the fact that a scalar is its own transp transpose, ose, as || − || || || X ∗ (i)> (i) 2 d = arg min || ||x x − x dd dd|| || sub subject ject to || 1.. or, exploiting the fact that a scalar is its own 2transpose, as d||2 = 1 d

(2.70)

(2.71)

i

d = arg min X x x dd sub ject to d = 1. (2.71) The reader should aim to become familiar with such cosmetic rearrangements. || − || || || At this point, it can be helpful to rewrite the problem in terms of a single The reader should aim to become familiar with such cosmetic rearrangements. design matrix of examples, rather than as a sum ov over er separate example vectors. A t this p oint, it can b e helpful to rewrite the problem of amatrix single n b e the X This will allo allow w us to use more compact notation. Let X ∈ Rinm×terms design matrix of examples, rather than as a sum ov er separate example vectors. (i)> defined by stacking all of the vectors describing the poin oints, ts, such R that X i,: = x . This willnow allorewrite w us tothe useproblem more compact notation. Let X be the matrix W e can as defined by stacking all of the vectors describing the points,∈such that X = x . > 2 d ∗ =the arg problem min ||X − subject ject to d> d = 11.. (2.72) We can now rewrite asX dd ||F sub d

d = arg min X X dd50 sub ject to d d = 1. || − ||

(2.72)


Disregarding the constraint for the moment, we can simplify the Frob robenius enius norm portion as follo follows: ws: Disregarding the constraint for norm ||X − X we dd>can ||2F simplify the Frobenius(2.73) argthe minmoment, portion as follows: d  min X Xdd  (2.73) arg > > > = arg min Tr X− X dd X − X dd (2.74) || − || d

(b (by y Eq. 2.49)

= arg min Tr

X

X dd

X

X dd

(2.74)

− − > > > > > (by Eq. 2.49 ) = arg min Tr( r(X X X − X X dd − dd X X + dd> X >X dd>) (2.75) d   dd X X + dd X X dd ) = arg min Tr(X X X  X dd (2.75) > > > > > = arg min Tr( r(X X X ) − Tr(X X dd ) − Tr(dd X X ) + Tr( r(dd dd>X > X dd> ) − − d (2.76) = arg min Tr(X X ) Tr(X X dd ) Tr(dd X X ) + Tr(dd X X dd ) > > > > > > > = arg min − Tr( r(dd dd X X dd ) (2.77) r(X X −X dd ) − Tr(dd −X X ) + Tr( (2.76) d = argterms min not X ddd)do T X the X )arg +T r(dd (2.77) Tr(in X r(dd (b (because ecause inv volving not affect min ) X X dd ) − − > affect > )> (because terms not in v olving d do the arg r(X Xnot = arg min −2 Tr( X dd>) + Tr( r(dd ddmin X X dd>) (2.78) d

= arg min 2 Tr(X X dd ) + Tr(dd X X dd ) (b (because ecause we can cycle the order of the matrices inside a trace, Eq. 2.52) − > matrices > > (because we can=cycle the − order of a dd trace, arg min X dd>) + Tinside r( r(X X>X ddEq. 2 Tr( r(X X the ) 2.52)

(2.78) (2.79)

d

= arg min 2 Tr(X X dd ) + Tr(X X dd dd ) (using the same prop propert ert erty y again) − At this poin oint, t, we re-in re-intro tro troduce duce the constrain constraint: t: (using the same prop erty again)

(2.79)

> duce the > A t this w tro t: >) sub subject ject to d > d = 1 2 Tt,r( r(X Xe >re-in arg min p−oin X dd ) + Tr( r(X X constrain X dd>dd

(2.80)

arg min 2 Tr(X X>dd ) +> Tr(X X>dd dd> ) sub ject to d> d = 1 = arg min −2 Tr( r(X X X dd ) sub r(X X X dd ) + Tr( subject ject to d d = 1 − d

(2.80) (2.81)

d

= arg 2 Tr(X X dd ) + Tr(X X dd ) sub ject to d d = 1 (due to the min constraint) − (due to the constraint) = arg min − Tr( r(X X > X dd>) sub subject ject to d>d = 1

(2.81) (2.82)

d

= arg min Tr(X X dd ) sub ject to d d = 1 = arg max Tr( r(X X > X dd> ) sub subject ject to d> d = 1 − d = arg max Tr(X> X>dd ) sub ject to d> d = 1 = arg max Tr( r(d d X X d) sub subject ject to d d = 1

(2.82) (2.83) (2.83) (2.84)

d

= arg max Tr(d X X d) sub ject to d d = 1 51

(2.84)


This optimization problem ma may y be solved using eigendecomp eigendecomposition. osition. Sp Specifically ecifically ecifically,, the optimal d is given by the eigen eigenv vector of X >X corresp corresponding onding to the largest This optimization problem may be solved using eigendecomposition. Specifically, eigen eigenv value. the optimal d is given by the eigenvector of X X corresponding to the largest Invalue. the general case, where l > 1, the matrix D is given by the l eigen eigenv vectors eigen corresp corresponding onding to the largest eigenv eigenvalues. alues. This may be shown using pro proof of by l > 1this D is In the general case, where , thepro matrix given by the l eigenvectors induction. We recommend writing proof of as an exercise. corresponding to the largest eigenvalues. This may be shown using proof by Linear algebra is one of the fundamen fundamental tal mathematical disciplines that is induction. We recommend writing this proof as an exercise. necessary to understand deep learning. Another key area of mathematics that is Linear algebra is one of the fundamentaltheory mathematical that is ubiquitous in mac machine hine learning is probability theory, , presenteddisciplines next. necessary to understand deep learning. Another key area of mathematics that is ubiquitous in machine learning is probability theory, presented next.

52

Chapter 3 Chapter 3

Probabilit Probability y and Information Theory Probability and Information Theory describee probabilit In this chapter, we describ probability y theory and information theory theory.. In this chapter, we describe probability theory and information theory.

Probabilit Probability y theory is a mathematical framew framework ork for represen representing ting uncertain In this chapter, we describe probability theory and information theory. statemen It pro a means of quan uncertain y and axioms for deriving statements. ts. provides vides quantifying tifying uncertaintt Probabilit y theory is a mathematical framew ork for represen ting uncertain new uncertain statements. In artificial intelligence applications, we use probability statemen projor vides means of the quanlaws tifying uncertaintyy tell and us axioms for systems deriving theory ints. twIt o ma major waays. First, of probabilit probability how AI new uncertain statements. In artificial intelligence applications, w e use probability should reason, so we design our algorithms to compute or approximate various theory in twderiv o maed jorusing ways.probabilit First, the laws of. Second, probabilit tell use us how AI systems expressions derived probability y theory theory. wey can probability and should reason, so we design our algorithms toofcompute orAI approximate statistics to theoretically analyze the beha ehavior vior prop proposed osed systems. various expressions derived using probability theory. Second, we can use probability and Probabilit Probability y theory is analyze a fundamental tool ol ofof man many disciplines of science and statistics to theoretically the behato vior propyosed AI systems. engineering. We provide this chapter to ensure that readers whose background is Probabilit y theory is a fundamental tool of man y disciplines of science primarily in soft with limited exp to probability theory and can softw ware engineering exposure osure engineering. W e provide this chapter to ensure that readers whose background is understand the material in this book. primarily in software engineering with limited exposure to probability theory can While probabilit probability y theory allows us to make uncertain statements and reason understand the material in this book. in the presence of uncertaint uncertainty y, information allows us to quan quantify tify the amount of While probabilit y theory allows us to make uncertain statements and reason uncertain y in a probabilit distribution. uncertaintt probability y in the presence of uncertainty, information allows us to quantify the amount of If you are already familiar with probability theory and information theory theory,, uncertainty in a probability distribution. you ma may y wish to skip all of this chapter except for Sec. 3.14, which describ describes es the If y ou are already familiar with probability theory and information theory graphs we use to describ describee structured probabilistic mo models dels for machine learning. If, you ou hav maye wish to skipnoallprior of this chapter for Sec. 3.14, which describshould es the y have absolutely exp experience erience except with these sub subjects, jects, this chapter graphs we usetotosuccessfully describe structured moresearch dels for machine If b e sufficient carry outprobabilistic deep learning pro projects, jects,learning. but we do you havethat absolutely no prior experienceresource, with these this chapter suggest you consult an additional suc such hsub asjects, Ja Jaynes ynes (2003 ). should be sufficient to successfully carry out deep learning research pro jects, but we do suggest that you consult an additional resource, such as Jaynes (2003). 53 53

CHAPTER 3. PROBABILITY AND INFORMATION THEORY

3.1

Wh Why y Probabilit Probability? y?

Man Many branches of computer science entirely tirely 3.1 y branc Whhes y Probabilit y? deal mostly with entities that are en deterministic and certain. A programmer can usually safely assume that a CPU will Many branc of computer science deal .mostly entities do that are en tirely execute eac each hhes machine instruction flawlessly flawlessly. Errorswith in hardware occur, but are deterministic and certain. A programmer can usually safely assume that a CPU will rare enough that most softw software are applications do not need to be designed to account execute h machine instruction flawlessly . Errors hardware do occur, butinare for them.eacGiv Given en that man many y computer scientists andinsoftw software are engineers work a rare enough that most softw are applications do not need to b e designed to account relativ relatively ely clean and certain environmen environment, t, it can be surprising that mac machine hine learning for them. Giv en that man y computer scientists and softw are engineers work in a mak makes es hea heavy vy use of probabilit probability y theory theory.. relatively clean and certain environment, it can be surprising that machine learning isvy because learning alwa always ys deal with uncertain quantities, makThis es hea use of machine probabilit y theorymust . and sometimes may also need to deal with sto stocchastic (non-deterministic) quan quantities. tities. This is b ecause machine learning must alwa ys deal with uncertain quantities, Uncertain Uncertaintty and sto stocchasticit hasticity y can arise from man many y sources. Researc Researchers hers ha hav ve made and sometimes may also need to deal with sto c hastic (non-deterministic) quan comp compelling elling argumen arguments ts for quantifying uncertaint uncertainty y using probability since at tities. least Uncertain t y and sto c hasticit y can arise from man y sources. Researc hers ha v e made the 1980s. Man Many y of the arguments presented here are summarized from or inspired comp elling argumen ts for quantifying uncertainty using probability since at least b y Pearl (1988 ). the 1980s. Many of the arguments presented here are summarized from or inspired Nearly all activities require some ability to reason in the presence of uncertaint uncertainty y. by Pearl (1988). In fact, beyond mathematical statements that are true by definition, it is difficult Nearly activities require some ability to reason theeven presence y. to think ofall any prop that is absolutely true orinany absolutely proposition osition event t thatofisuncertaint In fact,teed beyond mathematical statements that are true by definition, it is difficult guaran guaranteed to occur. to think of any proposition that is absolutely true or any event that is absolutely There are three possible sources of uncertain uncertaintty: guaranteed to occur. There are three possible sources of uncertainty: 1. Inheren Inherentt stochasticit stochasticity y in the system being mo modeled. deled. For example, most in interpretations terpretations of quantum mechanics describ describee the dynamics of subatomic 1. particles Inherent as stochasticit y in the system b eingcreate modeled. For example, being probabilistic. We can also theoretical scenariosmost that ineterpretations quantum mechanics describ dynamics of card subatomic w postulate toofha hav ve random dynamics, such easthe a hypothetical game particles being probabilistic. e can alsosh create theoretical scenarios where weasassume that the cardsWare truly shuffled uffled in into to a random order.that we postulate to have random dynamics, such as a hypothetical card game where we assume that the cards are truly shuffled intocan a random order. 2. Incomplete observ observability ability ability. . Ev Even en deterministic systems app appear ear sto stochastic chastic when we cannot observ observee all of the variables that drive the behavior of the 2. system. Incomplete observ ability EvMont en deterministic systems can sho appwear sto chastic For example, in .the Monty y Hall problem, a game show con contestan testan testant t is when w e cannot observ e all of the v ariables that drive the b ehavior of the ask asked ed to choose betw etween een three do doors ors and wins a prize held behind the chosen system. Foordo example, Montwhile y Halla problem, a game showThe contestan t is do door. or. Tw doors ors leadintothe a goat third leads to a car. outcome ask ed the to choose betw threeis do ors and winsbut a prize ehind thet’schosen giv given en contestan contestant’s t’seen choice deterministic, fromheld the bcon contestan testan testant’s poin ointt do or. T w o do ors lead to a goat while a third leads to a car. The outcome of view, the outcome is uncertain. given the contestant’s choice is deterministic, but from the contestant’s point of view, the mo outcome uncertain. 3. Incomplete modeling. deling.is When we use a mo model del that must discard some of the information we hav havee observ observed, ed, the discarded information results in 3. uncertain Incomplete mo deling. When we use a mo that must discard uncertaintty in the mo model’s del’s predictions. Fordel example, supp suppose ose we some build of a the we hav e observ the discarded information in rob robot otinformation that can exactly observe theed, lo location cation of every ob object ject aroundresults it. If the uncertainty in the model’s predictions. For example, suppose we build a robot that can exactly observe the54location of every ob ject around it. If the


rob robot ot discretizes space when predicting the future lo location cation of these ob objects, jects, then the discretization makes the robot immediately become uncertain ab about out rob ot discretizes space when predicting the future lo cation of these ob jects, the precise position of ob objects: jects: eac each h ob object ject could be anywhere within the then the discretization makes the robot immediately become uncertain about discrete cell that it was observ observed ed to occup ccupy y. the precise position of ob jects: each ob ject could be anywhere within the discrete cell that it was observed to occupy. In man many y cases, it is more practical to use a simple but uncertain rule rather than a complex but certain one, ev even en if the true rule is deterministic and our In man y cases, is more practical to use a simple but rule. uncertain rule rather mo modeling deling system hasitthe fidelit fidelity y to accommo accommodate date a complex For example, the than a complex but certain one, ev en if the true rule is deterministic and our simple rule “Most birds fly” is cheap to develop and is broadly useful, while a rule mothe deling system hasfly the fidelityfor to vaccommo date a complex rule. Foryet example, of form, “Birds fly, , except ery young birds that ha hav ve not learnedthe to simple rule “Most birds fly” is c heap to develop and is broadly useful, while a rule fly fly,, sick or injured birds that hav havee lost the abilit ability y to fly fly,, flightless sp species ecies of birds of the form, fly ,,except ery y.oung that to hadev ve not learned to including the “Birds cassow cassowary ary ary, ostric ostrich h for andvkiwi. . ” is birds exp expensive ensive develop, elop,yet main maintain tain and fly, sick or injured thatofhav lost the abilitvyery to brittle fly, flightless species of birds comm communicate, unicate, and birds after all thise effort is still and prone to failure. including the cassowary, ostrich and kiwi. . . ” is expensive to develop, maintain and Giv Given en that we need a means of representing and reasoning ab about out uncertaint uncertainty y, communicate, and after all of this effort is still very brittle and prone to failure. it is not immediately ob obvious vious that probabilit probability y theory can provide all of the to tools ols Given that we need a means ofapplications. representing Probability and reasoning about uncertaint y, we wan want t for artificial in intelligence telligence theory was originally it iselop noted immediately that probabilit yts.theory can provide all of the tools dev develop eloped to analyze ob thevious frequencies of ev even en ents. It is easy to see how probability we wantcan for bartificial intelligence applications. Probability originally theory e used to study ev even en ents ts like dra drawing wing a certaintheory hand w ofascards in a dev elop ed to analyze the frequencies of ev en ts. It is easy to see how probability game of poker. These kinds of even events ts are often rep repeatable. eatable. When we sa say y that theory can bhas e used to study ev like drawing a certain cards in p en an outcome a probability oftsoccurring, it means that hand if we of repeated thea game of poker. ts are often rep eatable. we sa y that p exp experimen erimen eriment t (e.g.,These draw akinds handofofeven cards) infinitely man many y times,When then prop proportion ortion p an outcome has a probability of o ccurring, it means that if w e repeated the of the rep repetitions etitions would result in that outcome. This kind of reasoning do does es not p exp erimen t (e.g., draw a hand of cards) infinitely man y times, then prop ortion seem immediately applicable to prop propositions ositions that are not rep repeatable. eatable. If a do doctor ctor of the repaetitions result in the thatpatient outcome. kind of reasoning es not analyzes patientwould and sa says ys that has This a 40% chance of havingdothe flu, seem immediately applicable to prop ositions that are not rep eatable. If a do ctor this means something very different—w different—wee can not mak makee infinitely man many y replicas of analyzes a patient and sa ys that the patient has a 40% chance of having flu, the patient, nor is there an any y reason to believe that differen differentt replicas of the the patien patient t this means something very different—w e can not mak e infinitely man y replicas of would present with the same symptoms yet hav havee varying underlying conditions. In the patient, nor do is there an y reason to b elieve that t replicasto of represent the patienat the case of the diagnosing the patient, we differen use probability doctor ctor would the1 same symptoms yet hav e varying underlying conditions. In de degr gr greee present of belief elief,with , with indicating absolute certaint certainty y that the patient has the flu the case of the doabsolute ctor diagnosing patient, we usedo probability represent a and 0 indicating certain certainttythe that the patient does es not hav haveetothe flu. The de gr e e of b elief , with 1 indicating absolute certaint y that the patient has the flu former kind of probability probability,, related directly to the rates at which even events ts occur, is and 0 indicating absolute certain t y that the patient do es not hav e the The kno known wn as fr freequentist pr prob ob obability ability ability,, while the latter, related to qualitative flu. lev levels els of former of probability , related directly certain certainttkind y, is known as Bayesian pr prob ob obability ability ability.. to the rates at which events occur, is known as frequentist probability, while the latter, related to qualitative levels of If we list several properties that we expect common sense reasoning ab about out certainty, is known as Bayesian probability. uncertain uncertaintty to ha hav ve, then the only wa way y to satisfy those prop properties erties is to treat If w e list several properties that w e expect common sense reasoning about Ba Bay yesian probabilities as beha ehaving ving exactly the same as frequentist probabilities. uncertain ty to haevw e,an then the onlythe wayprobabilit to satisfy those prop is to F or example, if w ant t to compute probability y that a play player ererties will win a ptreat oker Ba y esian probabilities as b eha ving exactly the same as frequentist probabilities. game giv given en that she has a certain set of cards, we use exactly the same form formulas ulas F or example, if w e w an t to compute the probabilit y that a play er will win a p oker as when we compute the probabilit probability y that a patien patientt has a disease giv given en that she game given that she has a certain set of cards, we use exactly the same formulas 55 a patient has a disease given that she as when we compute the probability that


has certain symptoms. For more details ab about out why a small set of common sense assumptions implies that the same axioms must con control trol both kinds of probability probability,, has certain symptoms. F or more details ab out why a small set of common sense see Ramsey (1926). assumptions implies that the same axioms must control both kinds of probability, Probability y can).be seen as the extension of logic to deal with uncertaint uncertainty y. Logic see Probabilit Ramsey (1926 pro provides vides a set of formal rules for determining what prop propositions ositions are implied to Probabilit y can be the seenassumption as the extension logicother to deal uncertaint be true or false given that of some set with of prop propositions ositionsy.isLogic true pro vides a set of formal rules for determining what prop ositions are implied to or false. Probabilit Probability y theory provides a set of formal rules for determining the b eeliho trueoor false given the assumption that other of other propositions is true lik likeliho elihoo d of a prop proposition osition being true giv given en some the lik likeliho eliho elihoo oset d of prop propositions. ositions. or false. Probability theory provides a set of formal rules for determining the likelihood of a proposition being true given the likelihood of other propositions.

3.2

Random Variables

A random variable isV aariables variable that can take on differen differentt values randomly randomly.. W Wee 3.2 Random typically denote the random variable itself with a low lower er case letter in plain typeface, A r andom variable is a v ariable that can take on differen values randomly . W x1 and x2e and the values it can take on with low lower er case script letters. tFor example, typically denote random ariable itselfvwith a low er case typeface, are both p ossiblethe values thatvthe random ariable x can takeletter on. Fin or plain vector-v vector-valued alued x x and the v alues it can take on with low er case script letters. F or example, and variables, we would write the random variable as x and one of its values as x. On are b oth p ossible v alues that the random v ariable x can take on. F or vector-v alued its own, a random variable is just a description of the states that are possible; it x. On variables, we would write the randomdistribution variable as xthat andsp one of its values aseach m ust be coupled with a probability specifies ecifies how likely of its own, a random these states are. variable is just a description of the states that are possible; it must be coupled with a probability distribution that specifies how likely each of Random variables may be discrete or contin continuous. uous. A discrete random variable these states are. is one that has a finite or countably infinite num umb ber of states. Note that these Random v ariables may b e discrete or contin uous. discrete random variable states are not necessarily the integers; they can also Ajust be named states that is one has a finite ore any countably infinite num er of states. Note that these are notthat considered to hav have numerical value. Abcontin continuous uous random variable is states are not necessarily the integers; they can also just b e named states that asso associated ciated with a real value. are not considered to have any numerical value. A continuous random variable is associated with a real value.

3.3

Probabilit Probability y Distributions

A prob ob obability ability distribution is a description of how likely a random variable or 3.3pr Probabilit y Distributions set of random variables is to take on each of its possible states. The way we A probeability distribution is a description how likely a random ariable or or describ describe probability distributions dep depends ends onofwhether the variables arevdiscrete set of uous. random variables is to take on each of its possible states. The way we con contin tin tinuous. describe probability distributions depends on whether the variables are discrete or continuous.

3.3.1

Discrete Variables and Probabilit Probability y Mass Functions

3.3.1 Discrete Variables and Probabilit y Mass Functions A probabilit probability y distribution ov over er discrete variables ma may y be describ described ed using a pr prob ob obaability mass function (PMF). We typically denote probabilit probability y mass functions with a A probabilit y distribution ov er discrete v ariables ma y b e describ ed using a probacapital P . Often we asso associate ciate each random variable with a different probability bility mass function (PMF). We typically denote probability mass functions with a 56 capital P . Often we associate each random variable with a different probability


mass function and the reader must infer which probability mass function to use based on the identit identity y of the random variable, rather than the name of the function; mass function and the ust which probability mass function to use P (x) is usually not the reader same asmP (y)infer . based on the identity of the random variable, rather than the name of the function; probabilit probability mass function from a state of a random variable to P (xThe ) is usually notythe same as P (ymaps ). the probabilit probability y of that random variable taking on that state. The probabilit probability y y mass from a state a random variable x is denoted (x), withmaps thatThe x =probabilit as Pfunction a probability of 1 of indicating that x = x to is the probabilit of that random variable taking state. The Sometimes probability = xthat certain and a yprobabilit probability y of 0 indicating that x on is imp impossible. ossible. = x P ( x = x is that x biguate is denoted , with a probability of 1 indicating that xvariable to disam disambiguate whic which has PMF )to use, we write the name of the random certain andP (ax probabilit y of 0 indicating x = xfirst, is imp ossible. Sometimes explicitly: we define athat variable then use ∼ notation to = x). Sometimes to disam biguate which PMF to use, we write of the random variable sp specify ecify whic which h distribution it follo follows ws later: x ∼ Pthe (x)name . explicitly: P (x = x). Sometimes we define a variable first, then use notation to Probabilit Probability mass functions can on many at the same time. Suc Such h specify which ydistribution it follo wsact later: x Pv(ariables x). ∼ a probability distribution over many variables is known as a joint pr prob ob obability ability ∼ variables at the same time. Such Probabilit y mass functions can act on many P ( = = y = x distribution distribution.. x x, y ) denotes the probabilit probability y that x and y = y a probability distribution o v er many v ariables is known as a joint probability sim simultaneously ultaneously ultaneously.. We ma may y also write P (x, y) for brevity brevity.. distribution. P (x = x, y = y) denotes the probability that x = x and y = y To be a probability variable x, a function P must simultaneously . We maymass also function write P (on x, ya) random for brevity . satisfy the follo following wing prop properties: erties: To be a probability mass function on a random variable x, a function P must satisfy the domain following erties: • The of prop P must be the set of all possible states of x. (xP ) ≤must 1. An • ∀ x ∈domain x, 0 ≤ Pof The beimp theossible set of ev allen ptossible states of impossible even ent has probabilit probability y 0x.and no state can even en entt that is guaran guaranteed teed to happ happen en • be less probable than that. Likewise, an ev 0 ( x ) 1 . x x , P An imp ossible ev en t has probabilit y 0 and no state can has probabilit probability y 1, and no state can ha hav ve a greater chance of occurring. b∀e less than • P ∈ probable ≤ ≤ that. Likewise, an event that is guaranteed to happen P (x) = y • hasx∈x probabilit , and no to state haerty ve aasgreater chance ofdo. ccurring. 11.. 1W e refer thiscan prop property being normalize normalized Without this prop propert ert erty y, we could obtain probabilities greater than one by computing the P (xy) = . Wof e refer this erty as being normalized. Without this probabilit probability of 1one man many ytoev even en ents tsprop occurring. • property, we could obtain probabilities greater than one by computing the of one ofa man y ev ents occurring. Forprobabilit example,yconsider single discrete random variable x with k differen differentt states. We can place a uniform distribution on x —that is, make each of its states equally ForPexample, a single discrete randomtovariable x with k different states. lik likely—b ely—b ely—by y settingconsider its probabilit probability y mass function We can place a uniform distribution on x —that is, make each of its states equally likely—by setting its probability mass function1 to P (x = xi ) = (3.1) k 1 (xrequirements =x)= (3.1) for all i. We can see that this fits P the k for a probability mass function. The value 1k is positiv ositivee because k is a positiv ositivee in integer. teger. We also see that for all i. We can see that this fits the requirements for a probability mass function. X e1 integer. The value is positive bX ecause k is a positiv We also see that k = = 11,, P (x = xi ) = (3.2) k k i i 1 k P (x = x ) = (3.2) = = 1, k so the distribution is prop properly erly normalized. k 57 so the distribution is properly normalized. X X


3.3.2

Con Contin tin tinuous uous Variables and Probabilit Probability y Densit Density y Functions

When orking contin tinuous uous random variables, we describ describeey probabilit probability y dis3.3.2 wCon tinwith uouscon Vtin ariables and Probabilit y Densit Functions tributions using a pr prob ob obability ability density function (PDF) rather than a probability When w orking with con tinuous yrandom ariables, awe describpe mprobabilit dismass function. To be a probabilit probability densit density yvfunction, function ust satisfyy the tributions using a probability density function (PDF) rather than a probability follo following wing prop properties: erties: mass function. To be a probability density function, a function p must satisfy the Theprop domain of p must be the set of all possible states of x. follo•wing erties: • ∀ x ∈domain x, p(x) ≥ Note weset doofnot x) ≤ 1of . x. The of 0p. m ust that be the allrequire possiblep(states R •• xp(x)xdx , p(= x)11.. 0. Note that we do not require p(x) 1. • ∀probabilit ≥ A probability density y function p(x) do does es not giv givee the ≤ probability of a sp specific ecific p(∈x)dx =y 1densit . state directly directly, , instead the probability of landing inside an infinitesimal region with • probability density function p(x) does not give the probability of a specific A volume δx is given by p(x)δx. state directly, instead the probability of landing inside an infinitesimal region with We can integrate the densit density y function to find the actual probability mass of a volume δx is given by p(x)δx. R set of points. Specifically Specifically,, the probabilit probability y that x lies in some set S is giv given en by the W e can integrate the densit y function to find example, the actualthe probability mass of xa in integral tegral of p (x) ov over er that set. In the univ univariate ariate probabilit probability y that S R set of oints. Specifically probabilit y that x lies in some set is given by the lies in pthe in interv terv terval al [a, b] is, the given by ] p(x)dx. integral of p (x) over that set. In the[a,b univ ariate example, the probability that x For an example of a probability density function corresp corresponding onding to a sp specific ecific lies in the interval [a, b] is given by p(x)dx. probabilit probability y density over a contin continuous uous random variable, consider a uniform distribuor an example a probability density function corresp to u a (sp x; ecific a, b), tionFon an in interv terv terval al ofofthe real num numbers. bers. We can do this with aonding function probabilit y density o v er a contin uous random v ariable, consider a uniform distribuwhere a and b are the endpoints of the in interv terv terval, al, with b > a. The “;” notation means u( xa ; a, b), tion on an in terv al of the real num bers. W e can do this withfunction, a function “parametrized by”; we consider x to Rbe the argument of the while and a and b are that b> a. The the endpoints the intervT al,o with means bwhere are parameters define theoffunction. ensure that there“;”isnotation no probability “parametrized by”;in wterv e consider while a[ a, and x toub(ex;the a, b)argument [a, b]. Within b], x 6∈function, mass outside the interv terval, al, we say = 0 for of allthe bu(are parameters that define the function. T o ensure that there is no probability 1 x; a, b) = b−a . We can see that this is nonnegative everywhere. Additionally dditionally,, it u ( x ; a, b [ a, b a, bb]], x mass outside the in terv al, we say ) = 0 for all ] . Within in integrates tegrates to 1. We often denote that x follo follows ws the uniform distribution on [[a, uy(xwriting ; a, b) =x ∼ .UW Additionally, it 6∈ b (a,e bcan ). see that this is nonnegative everywhere. integrates to 1. We often denote that x follows the uniform distribution on [a, b ] by writing x U (a, b). 3.4 Marginal Probability ∼ Sometimes we know the probabilit probability y distribution ov over er a set of variables and we wan wantt 3.4 Marginal Probability to know the probability distribution over just a subset of them. The probability Sometimes woevknow probabilit ywn distribution ov er a set ofability variables and we want distribution er thethe subset is kno known as the mar marginal ginal pr prob ob obability distribution. to know the probability distribution over just a subset of them. The probability For example, supp suppose ose we ha hav ve discrete random variables x and y, and we know distribution over the subset is known as the marginal probability distribution. P (x, y). We can find P (x) with the sum rule: For example, suppose we have discrete random variables x and y, and we know X ∀ x ∈ x , P ( x = x ) = (3.3) P (x, y). We can find P (x) with the sum ruleP: (x = x, y = y ). x

∀ ∈

x, P (x = x) =58

y

X

P (x = x, y = y ).

(3.3)


The name “marginal probabilit probability” y” comes from the pro process cess of computing marginal probabilities on pap paper. er. When the values of P (x, y ) are written in a grid with Thetname “marginal probabilit y” comes vfrom of computing marginal differen different values of x in rows and different aluesthe of pro columns, it is natural to y incess P ( probabilities on pap er. When the v alues of x , y ) are written in a grid with sum across a row of the grid, then write P ( x) in the margin of the pap paper er just to differen t v alues of in rows and different v alues of in columns, it is natural to x y the righ rightt of the ro row. w. sum across a row of the grid, then write P ( x) in the margin of the paper just to For contin continuous uous variables, we need to use integration instead of summation: the right of the row. Z For continuous variables, we need to use integration instead of summation: p(x) = p(x, y )dy dy.. (3.4) p(x) =

3.5

p(x, y )dy.

(3.4)

Conditional Probability

In cases, we are inte interested rested in theZprobabilit probability y of some even event, t, given that some 3.5manyConditional Probability other ev even en entt has happened. This is called a conditional pr prob ob obability ability ability.. We denote In many cases, we are inte rested in the probabilit y of some even some x). This | x = that the conditional probabilit probability y that y = y giv given en x = x as P (y = t,y given other event probabilit has happened. is called a conditional conditional probability y can bThis e computed with the form formula ulaprobability. We denote the conditional probability that y = y given x = x as P (y = y x = x). This P (y = x = ula x) conditional probability can be computed with they,form | . (3.5) P (y = y | x = x) = P (x = x) P (y = y, x = x) . (3.5) P (y = y x = x) = PP (x( x==x)x) > 0. We cannot compute The conditional probability is only defined when | the conditional probabilit probability y conditioned on an ev even en entt that nev never er happ happens. ens. The conditional probability is only defined when P ( x = x) > 0. We cannot compute It is imp important ortant not to confuse conditional probability with computing what the conditional probability conditioned on an event that never happens. would happ happen en if some action were undertaken. The conditional probability that It is imp ortantGerman not to yconfuse conditional probability computing what a person is from Germany giv given en that they sp speak eak Germanwith is quite high, but if would happen if someperson actioniswere undertaken. The conditional probability that a randomly selected taught to sp German, their coun of origin speak eak country try a peserson is from German y givthe en consequences that they speak German butan if do does not change. Computing of an action isis quite calledhigh, making a randomly selected person is taught spthe eak domain German, try of, origin intervention query query.. Interv Intervention ention queriestoare of their causalcoun mo modeling deling deling, which do es not change. Computing the consequences of an action is called making an we do not explore in this book. intervention query. Intervention queries are the domain of causal modeling, which we do not explore in this book.

3.6

The Chain Rule of Conditional Probabilities

An Any y join joint t probabilit probability y distribution man many y random variables ma may y be decomp decomposed osed 3.6 The Chain Rule ofover Conditional Probabilities in into to conditional distributions over only one variable: Any joint probability distribution over many random variables may be decomposed (1) n variable: (i) into conditional | x (1), . . . , x(i−1) ). P (xdistributions , . . . , x (n) ) o=ver P (only x(1))Πone (3.6) i=2P (x x ,...,x )Π P (x P (ation x , . is . . ,kno x wn ) =asPthe (x chain . (3.6) This observ observation known rule or pr pro oduct rule of )probability probability. . It | follo follows ws immediately from the definition of conditional probability in Eq. 3.5. For This observation is known as the chain rule or product rule of probability. It follows immediately from the definition59of conditional probability in Eq. 3.5. For


example, applying the definition twice, we get example, applying thePdefinition twice, (a, b, c) = P (aw|ebget , c)P (b, c) P (b, c) P (a, b, c) P (a, b, c) P (b, c)

3.7

= = = =

P (b | c)P (c) P (a b, c)P (b, c) P (a || b, c)P (b | c)P (c). P (b c)P (c)

P (a, b, c) = P (a | b, c)P (b c)P (c). Indep Independence endence and Conditional Independence endence | | Indep

T wo random variables x and yand are indep independent endent if theirIndep probability distribution can 3.7 Indep endence Conditional endence be expressed as a pro product duct of tw two o factors, one inv involving olving only x and one inv involving olving Two random variables x and y are independent if their probability distribution can only y: be expressed as a product of two factors, one involving only x and one involving only y: ∀x ∈ x, y ∈ y, p(x = x, y = y ) = p(x = x)p(y = y). (3.7) x, y xy,and p(xy=are x, yconditional = y ) = p(xly=indep x)p(endent y = y).given a random (3.7) Two random xvariables onditionally independent variable z if the∀conditional ∈ ∈ probability distribution over x and y factorizes in this T wo random v ariables way for ev every ery value of z: x and y are conditionally independent given a random variable z if the conditional probability distribution over x and y factorizes in this way for every value of z: ∀x ∈ x, y ∈ y, z ∈ z, p(x = x, y = y | z = z) = p(x = x | z = z )p(y = y | z = z). (3.8) x x, y y, z z, p(x = x, y = y z = z) = p(x = x z = z )p(y = y z = z). We can denote indep independence endence and conditional indep independence endence with compact (3.8) ∀ ∈ ∈ ∈ | | | notation: x⊥y means that x and y are indep independen enden endent, t, while x⊥y | z means that x can denote indep endence conditional independence with compact andW y eare conditionally indep independen enden endentand t giv given en z. notation: x y means that x and y are independent, while x y z means that x and y are conditionally independent given z. ⊥ ⊥ |

3.8

Exp Expectation, ectation, Variance and Co Cov variance

f (xv)ariance The expeeExp ctation or exp expeecte cted d value of some function with resp respect ect to a probabilit probability y 3.8 exp ectation, Variance and Co distribution P (x ) is the av average erage or mean value that f tak takes es on when x is drawn f ( x The exp e ctation or exp e cte d value of some function ) with ect to a probability from P . For discrete variables this can be computed with aresp summation: distribution P (x ) is the average or mean value that f takes on when x is drawn X from P . For discrete variables a summation: [f (xcan )] =be computed Ex∼Pthis P (x)f (x)with , (3.9) x E [f (x)] = P (x)f (x), (3.9) while for con contin tin tinuous uous variables, it is computed with an integral: Z while for continuous variables, it is computed with an integral: Ex∼p [f (x)] = p(x)f (x)dx. (3.10) X E [f (x)] = p(x)f (x)dx. (3.10) 60

Z


When the iden identit tit tity y of the distribution is clear from the context, we ma may y simply E x[f ( x )] write the name of the random variable that the exp expectation ectation is ov over, er, as in inE )].. When the iden tit y of the distribution is clear from the context, w e ma y simply If it is clear whic which h random variable the expectation is over, we may E omit the [f (ov xer write the name of the random v ariable that the exp ectation is ov er, as in )]. subscript en entirely tirely tirely,, as in E[f (x)] )].. By default, we can assume that E[·] averages over If it is clear whic h random v ariable the expectation is o v er, w e may omit the the values of all the random brackets. ets. Lik Likewise, ewise, E variables inside the brack E when there is [f (xthe subscript entirely as inomit )]. square By default, we no am ambiguit biguit biguity y, we, may brack brackets. ets.can assume that [ ] averages over the values of all the random variables inside the brackets. Likewise, · when there is Exp Expectations ectations are linear, for example, no ambiguity, we may omit the square brackets. Expectations are forβexample, αf (x) + g(x)] = αEx [f (x)] + β Ex[g(x)], Ex [linear, E E E (xendent ) + βg(on x)] x=. α [f (x)] + β [g(x)], when α and β are not[αf dep dependent

(3.11) (3.11)

The varianc variancee gives a measure of ho how w muc uch h the values of a function of a random when α and β are not dependent on x. variable x vary as we sample differen differentt values of x from its probability distribution: The variance gives a measure of hohw much the values iof a function of a random 2 probability distribution: x[ffrom variable x vary as we sample t v(alues Var(f (differen x)) = E f (x) of −E (x)])its . (3.12)

E E V f (x )) = of f(f(x(x) )cluster [f (near x)]) their . exp When the variance is lo low, w,ar( the values expected ected value.(3.12) The − square ro root ot of the variance is known as the standar standard d deviation deviation.. When the variance is low, the values of f (x ) cluster near their expected value. The Thero covarianc ovariance gives es some sense ofashow much h tw two alues are. linearly related to square ot of thee vgiv ariance is known themuc standar do v deviation h variables: i eac each h other, as well as the scale of these The covariance gives some sense of how much two values are linearly related to each other, as wv(ell variables: Co Cov( f (as x),the g(yscale )) = Eof[(these f (x) − E [f (x)]) (g (y) − E [g(y)])] . (3.13) E E E Covv( f (x),ofg(the y)) = [(f (x) mean [f (that x)]) (the g (y)values [g(change y)])] . very (3.13) High absolute alues cov covariance ariance muc much h and are both far from their resp means at the same time. If the sign of the respectiv ectiv ectivee − − High absolute valuese,ofthen the bcov mean that theevon alues change high veryvmuc h co cov variance is positiv ositive, othariance variables tend to tak take relatively alues and are both far theirofresp e meansis at the same time. the sign of the sim simultaneously ultaneously ultaneously. . Iffrom the sign theectiv co cov variance negative, then one Ifvariable tends to co v ariance is p ositiv e, then b oth v ariables tend to tak e on relatively high v alues tak takee on a relatively high value at the times that the other tak takes es on a relatively low sim ultaneously If theOther sign of the covariance negative, then one vthe ariable tends to value and vice .versa. measures such as iscorr normalize con orrelation elation contribution tribution takeac e on relatively high to value at theonly times thethe other takes on relatively low of each h vaariable in order measure ho how wthat much variables arearelated, rather value also and b vice Other asseparate correlation normalize the contribution than eingversa. affected by measures the scale such of the variables. of each variable in order to measure only how much the variables are related, rather The notions of co cov variance and dep dependence endence are related, but are in fact distinct than also being affected by the scale of the separate variables. concepts. They are related because two variables that are indep independent endent hav havee zero The notions of co v ariance and dep endence are related, but are in fact distinct co cov variance, and tw two o variables that hav havee non-zero cov covariance ariance are dep dependent. endent. Ho Howwconcepts. They are related b ecause t w o v ariables that are indep endent hav e zero ev ever, er, independence is a distinct prop property erty from co cov variance. For two variables to hav havee co v ariance, and tw o v ariables that hav e non-zero cov ariance are dep endent. Ho wzero co cov variance, there must be no linear dep dependence endence betw etween een them. Indep Independence endence ever, independence is a distinct propco erty from cobvecause ariance. Forendence two variables to have is a stronger requirement than zero cov variance, indep independence also excludes zero covariance, there must no linear endence betw Indepbut endence nonlinear relationships. It isbepossible fordep tw two o variables toeen be them. dep dependent endent ha hav ve is a stronger requirement than zero co v ariance, b ecause indep endence also excludes zero cov covariance. ariance. For example, suppose we first sample a real num number ber x from a nonlinear relationships. It is p ossible for tw o v ariables to b e dep endent have uniform distribution over the in interv terv terval al [− 1, 1] 1].. We next sample a randombut variable zero covariance. For example, suppose we first sample a real number x from a uniform distribution over the interval [ 611, 1]. We next sample a random variable −


s. With probabilit probability y 12, we choose the value of s to be 1. Otherwise, we choose y by assigning the value of s to be − 1. We can then generate a random variable ariabley sy.=With s probabilit y , we choose the v alue of to b e 1 . Otherwise, we choose Clearly,, x and y are not indep independen enden endent, t, because x completely determines sx. Clearly s y the magnitude value of to 1. W eer,can then the of bye. How Howev ev ever, Co Cov( v(x, ygenerate ) = 0. a random variable by assigning are not indep endent, because x completely determines y = sx. Clearly, x and y − The covarianc ovariancee matrix of a random vector x ∈ Rn is an n × n matrix, suc such h that the magnitude of y. However, Cov(x, y) = 0. R Cov( v(x) i,j = Co Cov( v( v(x xxi, x j ). is an n n matrix, suc(3.14) The covariance matrix of Co a random vector h that ∈ × The diagonal elemen elements ts of theCo co cov variance give v( x) = Co v(the x , xvariance: ).

(3.14)

Cov( v(x xi , xi) = V Var( ar( ar(x . The diagonal elements of the Co covv( ariance give thexiv)ariance:

(3.15)

Cov(x , x ) = Var(x ).

3.9

(3.15)

Common Probability Distributions

Sev Several simple probability distributionsDistributions are useful in many con contexts texts in machine 3.9eral Common Probability learning. Several simple probability distributions are useful in many contexts in machine learning.

3.9.1

Bernoulli Distribution

The is a distribution ov Bernoulli li distribution over er a single binary random variable. 3.9.1Bernoul Bernoulli Distribution [0,, 1] It is controlled by a single parameter φ ∈ [0 1],, whic which h gives the probability of the The Bernoul li distribution is a distribution ov er a single binary random variable being equal to 1. It has the following prop properties: erties:random variable. [0 , φ It is controlled by a single parameter 1], which gives the probability of the =∈1) φ (3.16) random variable being equal to 1.PIt(xhas the=following properties: P (Px(= 0)1) == 1− x= φφ

P (x P=(xx)==0)φx=(11 − φφ)1−x ] =(1φ − φ) P (x = xE) x=[xφ φ(1φ− φ) Var x(Ex)[x=] =

3.9.2

Var (x) = φ(1

Multinoulli Distribution

φ)

(3.17) (3.16)

(3.18) (3.17) (3.19) (3.18) (3.20) (3.19) (3.20)

−

3.9.2multinoul Multinoulli Distribution The multinoulli li or cate ategoric goric gorical al distribution is a distribution ov over er a single discrete variable with k differen differentt states, where k is finite.1 The multinoulli distribution is The multinoulli or categorical distribution is a distribution over a single discrete 1 “Multinoulli” is a termt that waswhere recently by Gustavo anddistribution popularized by k coined variable with k differen states, is finite. The mLacerdo ultinoulli is

Murphy (2012). The multinoulli distribution is a special case of the multinomial distribution. A multinomial distribution is the distribution over vectors in {0 , . . . , n} k representing how many times each of the k categories is visited when n samples are drawn from a multinoulli distribution. Many texts use the term “multinomial” to refer to multinoulli distributions without clarifying that they refer only to the n = 1 case. 62


[0,, 1]k−1 , where pi giv parametrized by a vector p ∈ [0 gives es the probability of the i-th state. The final, k -th state’s probability is given by 1 − 1> p. Note that we must , 1] , where p [0 parametrized givesused the to probability of the i-th constrain Multinoulli distributions arep often refer to distributions ≤ 1a. vector 1 > p by k p . 1 The final,of ob -thjects, state’s probability is given by 1 that state Note1 has thatnumerical we must ostate. ver categories objects, so∈we do not usually assume constrain 1 . Multinoulli distributions are often used to refer to distributions p 1 − compute the exp value 1, etc. For this reason, we do not usually need to expectation ectation o v er categories of ob jects, so we do not usually assume ≤ or variance of multinoulli-distributed random variables.that state 1 has numerical value 1, etc. For this reason, we do not usually need to compute the expectation The Bernoulli and multinoulli distributions are sufficient to describ describee an any y distrior variance of multinoulli-distributed random variables. bution over their domain. This is because they mo model del discrete variables for whic which h The Bernoulli and multinoulli distributions are sufficient to describ e an y distriit is feasible to simply enumerate all of the states. When dealing with contin continuous uous bution overthere theirare domain. This because theyso moany del distribution discrete variables foredwhic v ariables, uncoun uncountably tablyis many states, describ described by h a it is feasible to simply enumerate all of the states. When dealing with contin uous small num umb ber of parameters must imp impose ose strict limits on the distribution. variables, there are uncountably many states, so any distribution described by a small number of parameters must impose strict limits on the distribution.

3.9.3

Gaussian Distribution

The mostGaussian commonly Distribution used distribution over real num numb bers is the normal distribution, 3.9.3 also kno as the Gaussian distribution : known wn The most commonly used distribution over real numbers is the normal distribution, r   also known as the Gaussian distribution : 1 1 2 2 exp − 2 (x − µ) . N (x; µ, σ ) = (3.21) 2πσ2 2σ 1 1 exp (3.21) (x; µ, σ ) = (x µ) . See Fig. 3.1 for a plot of the densit density y function. 2π σ 2σ N − − The tw two o parameters R and σ y∈ function. (0, ∞ ) control the normal distribution. See Fig. 3.1 for a plotµof∈ the densit The parameter µ giv gives es the co coordinate ordinate R r of the central peak. This is also the mean of normal distribution. two parameters and σ (0 ,  ) control E[ x] = µµ. The standard the The distribution: deviation of thethe distribution is given by µ The parameter giv es the ordinate of the central p eak. This is also the mean of 2 co ∈ ∞ σ, and the variance E by σ . ∈ the distribution: [ x] = µ. The standard deviation of the distribution is given by When we ev evaluate aluate the PDF, we need to square and inv invert ert σ. When we need to σ, and the variance by σ . frequen frequently tly ev evaluate aluate the PDF with differen differentt parameter values, a more efficient way When we evaluate the PDF, weisneed to asquare and inv When we needthe to of parametrizing the distribution to use parameter ) to control β ert ∈ (0σ,. ∞ frequen tlyorevinv aluate PDF of with t parameter values, a more efficient way pr pre ecision inverse erse the variance thedifferen distribution: of parametrizing the distribution is to use a parameter β (0, ) to control the  precision or inverse variance of ther distribution: ∈ ∞ β 1 −1 (3.22) exp − β (x − µ)2 . N (x; µ, β ) = 2π 2 β 1 (3.22) exp (x; µ, β ) = β (x µ) . Normal distributions are a sensible2choice for many applications. In the absence π 2 − of prior knowledge N ab about out what form a distribution ov− er the real num numbers bers should Normal distributions are a sensible choice for many applications. In the absence tak take, e, the normal distribution is a go goo od default choice for two ma major jor reasons. of prior knowledge about what form over thereal numbers should r a distribution  First, many distributions we wish to mo model del are truly close to being normal take, the normal distribution is a good default choice for two ma jor reasons. distributions. The centr entral al limit the theor or orem em shows that the sum of many indep independent endent First, many distributions we wish to mo del are truly close to being normal random variables is approximately normally distributed. This means that in distributions. The central limit theorem shows that the sum of many independent 63 random variables is approximately normally distributed. This means that in


The normal distribution 0.35 0.40 0.30 0.35 0.25 0.30 0.20 0.25 0.15 0.20 0.10 0.15 0.05 0.10 0.00 −2.0 0.05

The normal distribution Ma Maxim xim ximu um aatt x = ¹ x = at ¹ In"ection points Maxim x =u¹m§a¾t

p(x)

p(x)

0.40

In"ection at x = ¹ points ¾ §

−1.5

−1.0

−0.5

0.0 0.5 1.0 1.5 2.0 x 0.00 2 Figure 3.1: The normal distribution : The normal ) exhibits N (x; µ, σ 1.5 −2.0 −1.5 −1.0 −0.5 0.0 distribution 0.5 1.0 2.0a classic “b “bell ell curv curve” e” shape, with the x co coordinate ordinate of its central peak given by µ, and the width x distribution (x; µ, σ ) exhibits a classic Thetrolled normalbydistribution Figure The normal of its p3.1: eak con controlled we depict the standar standard d normal distribution distribution,, σ. In this :example, “b ell µ curv shape, with = 0e”and σ =with 1. the x co ordinate of its central peakNgiven by µ, and the width of its p eak controlled by σ. In this example, we depict the standard normal distribution, with µ = 0 and σ = 1.

practice, man many y complicated systems can be mo modeled deled successfully as normally distributed noise, even if the system can be decomp decomposed osed into parts with more practice, man y complicated systems can b e mo deled successfully as normally structured beha ehavior. vior. distributed noise, even if the system can be decomposed into parts with more Second, out of all possible probability distributions with the same variance, structured behavior. the normal distribution enco encodes des the maxim maximum um amount of uncertaint uncertainty y ov over er the Second, out of all p ossible probability distributions with the same v ariance, real num umb bers. We can th thus us think of the normal distribution as being the one that the normal distribution des the maximum uncertaint y over and the inserts the least amoun amountt enco of prior kno knowledge wledge in into toamount a mo model. del.of F Fully ully dev developing eloping real n um b ers. W e can th us think of the normal distribution as b eing the one that justifying this idea requires more mathematical to tools, ols, and is postp postponed oned to Sec. inserts 19.4.2 19.4.2.. the least amount of prior knowledge into a model. Fully developing and justifying this idea requires more mathematical tools, and is postponed to Sec. The normal distribution generalizes to Rn, in whic which h case it is known as the 19.4.2. multivariate normal distribution distribution.. It ma may y be Rparametrized with a positiv ositivee definite The normal distribution generalizes to , in whic h case it is known as the symmetric matrix Σ: multivariate normal distribution. It may be parametrized with a positive definite s   symmetric matrix Σ: 1 1 > −1 exp − (x − µ) Σ (x − µ) . N (x; µ, Σ) = (3.23) (2π) ndet(Σ) 2 1 1 exp (x; µ, Σ) = (x µ) Σ (x µ) . (3.23) (2π) det(Σ) 2 The gives es the mean− of the now w it is N parameter µ still giv − distribution, − though no vector-v ector-valued. alued. The parameter Σ giv gives es the cov covariance ariance matrix of the distribution. µ The parameter still giv es the mean of the distribution, w itfor is As in the univ univariate ariatescase, when we wish to ev evaluate aluate the PDF though sev several eral no times   vector-valued. The parameter Σ gives the covariance matrix of the distribution. 64 to evaluate the PDF several times for As in the univariate case, when we wish


man many y differen differentt values of the parameters, the co cov variance is not a computationally efficien efficientt wa way y to parametrize the distribution, since we need to inv invert ert Σ to ev evaluate aluate man y differen t v alues of the parameters, the co v ariance is not a computationally the PDF. We can instead use a pr preecision matrix β: efficient way to parametrize the distribution, since we need to invert Σ to evaluate the PDF. We can instead usesa precision matrix β:   det( β ) 1 −1 > N (x; µ, β ) = exp − (x − µ) β(x − µ) . (3.24) (2π)n 2 det(β) 1 (x; µ, β ) = (x µ) β(x µ) . exp (3.24) π) to be a2diagonal matrix. An even simpler We often fix the co cov variance (2 matrix − whose − co − matrix is a scalar version is theNisotr isotropic opic Gaussian distribution, cov variance Wethe often fix covariance matrix to be a diagonal matrix. An even simpler times iden identit tit tity ythe matrix. s distribution, version is the isotropic Gaussian is a scalar  whose covariance matrix  times the identity matrix.

3.9.4

Exp Exponen onen onential tial and Laplace Distributions

3.9.4 Exponen tiallearning, and Laplace In the context of deep we oftenDistributions wan wantt to hav havee a probability distribution with a sharp point at x = 0. To accomplish this, we can use the exp exponential onential In the context of deep learning, w e often wan t to hav e a probability distribution distribution distribution:: with a sharp point at x =p(0x.; λT)o=accomplish this, λ1x≥0 exp (− λx) w . e can use the exponential (3.25) distribution: The exp exponen onen onential tial distribution probability p(uses x; λ) the = λindicator 1 exp (function λx) . 1 x≥0 to assign probabilit (3.25)y zero to all negativ negativee values of x. − 1 The exponential distribution uses the indicator function to assign probability A closely related probabilit probability y distribution that allo allows ws us to place a sharp peak zero to all negative values of x. of probabilit probability y mass at an arbitrary poin ointt µ is the Laplac aplacee distribution A closely related probability distribution that  allows us  to place a sharp peak | x − µ | 1 of probability mass atLaplace( an arbitrary is the−Laplace distribution x; µ, γ )poin = t µexp . (3.26) 2γ γ x µ 1 Laplace(x; µ, γ ) = exp . (3.26) 2γ γ | | − 3.9.5 The Dirac Distribution and Empirical Distribution − 3.9.5 The Dirac In some cases, we wishDistribution to sp specify ecify that and all of Empirical themass in aDistribution probability y distribution probabilit clusters around a single poin oint. t. This can be accomplished by defining a PDF using In some wish to δsp the Diraccases, deltawe function, (xecify ): that all of the mass in a probability distribution clusters around a single point. This can be accomplished by defining a PDF using (3.27) the Dirac delta function, δ(x): p(x) = δ(x − µ). The Dirac delta function is defined zero-valued alued everywhere except p(x)such = δ(that x µit).is zero-v (3.27) 0, yet integrates to 1. The Dirac delta function is not an ordinary function that − The Dirac delta function is defined such that it is zero-valued except asso associates ciates eac each h value x with a real-v real-valued alued output, instead it is everywhere a different kind of 0, y et integrates to 1. The Dirac delta function is not an ordinary function that mathematical ob object ject called a gener generalize alize alized d function that is defined in terms of its asso ciates h vintegrated. alue x withW a ereal-v alued of output, instead is a different kindthe of prop properties erties eac when can think the Dirac deltait function as being mathematical ject called a generthat alizeput d function defined in pterms of its limit poin ointt of aob series of functions less andthat less ismass on all oints other prop erties when integrated. W e can think of the Dirac delta function as b eing the than µ. limit point of a series of functions that put less and less mass on all points other 65 than µ.


By defining p( x) to be δ shifted by −µ we obtain an infinitely narrow and infinitely high peak of probabilit probability y mass where x = µ. By defining p( x) to be δ shifted by µ we obtain an infinitely narrow and A common use of the Dirac delta distribution is as a component of an empiric empirical al infinitely high peak of probability mass where x = µ. − distribution distribution,, m A common use of the Dirac delta 1distribution is as a component of an empirical X pˆ(x) = (3.28) δ(x − x(i)) distribution, m i=1 1 pˆ(x1 ) = (3.28) δ(x x ) whic which h puts probability mass m on m eac each h of the m poin oints ts x (1) , . . . , x(m) forming − a giv given en data set or collection of samples. The Dirac delta distribution is only , . . . , x Forforming which puts mass distribution on each of ov the poin ts xvariables. necessary to probability define the empirical over er m contin continuous uous discrete a giv en data set or collection of samples. The Dirac delta distribution is only X variables, the situation is simpler: an empirical distribution can be conceptualized necessary to define the empirical distribution over asso contin uous to variables. For discrete as a multinoulli distribution, with a probability each possible input associated ciated v ariables, the situation is simpler: an empirical distribution can b e conceptualized value that is simply equal to the empiric empirical al fr freequency of that value in the training as a m ultinoulli distribution, with a probability associated to each possible input set. value that is simply equal to the empirical frequency of that value in the training We can view the empirical distribution formed from a dataset of training set. examples as sp specifying ecifying the distribution that we sample from when we train a model W e can view the empirical distribution formed from a dataset of training on this dataset. Another imp important ortant persp erspective ective on the empirical distribution is examples as sp ecifying the distribution that w e sample from when w e train a model that it is the probabilit probability y density that maximizes the likelihoo likelihood d of the training data on this dataset. Another imp ortant p ersp ective on the empirical distribution is (see Sec. 5.5). that it is the probability density that maximizes the likelihood of the training data (see Sec. 5.5).

3.9.6

Mixtures of Distributions

3.9.6 Distributions It is alsoMixtures common toof define probability distributions by com combining bining other simpler probabilit probability y distributions. One common w wa ay of com combining bining distributions is to It is also common to define probability distributions by com construct a mixtur mixturee distribution distribution.. A mixture distribution is bining made other up of simpler several probabilit y distributions. One common w a y of com bining distributions is to comp componen onen onentt distributions. On eac each h trial, the choice of whic which h comp component onent distribution construct the a mixtur e distribution . A mixture distribution is made uptit of generates sample is determined by sampling a comp component onent iden identit tity y several from a comp onen t distributions. On eac h trial, the choice of whic h comp onent distribution multinoulli distribution: generates the sample is determined by sampling a component identity from a X multinoulli distribution: P (x) = P (c = i)P (x | c = i) (3.29) i

P (c = i)P (x c = i) P (x) = where P (c) is the multinoulli distribution ov over er comp component | onent identities.

(3.29)

We ha hav ve already seen one example of a mixture distribution: the empirical where P (c) is the multinoulli distribution over component identities. distribution ov over er real-v real-valued alued variables is a mixture distribution with one Dirac X W e ha v e already seen oneexample. example of a mixture distribution: the empirical comp componen onen onentt for eac each h training distribution over real-valued variables is a mixture distribution with one Dirac The mixture mo model is one simple strategy for combining probability distributions comp onen t for eac hdel training example. to create a ric richer her distribution. In Chapter 16, we explore the art of building complex The mixture model is one simple strategy combining probabilit probability y distributions from simple ones informore detail. probability distributions to create a richer distribution. In Chapter 16, we explore the art of building complex 66 in more detail. probability distributions from simple ones


The mixture mo model del allows us to briefly glimpse a concept that will be of paramoun paramountt imp importance ortance later—the latent variable. A laten latentt variable is a random The mixture mo del allows us to briefly glimpse a iden concept will of variable that we cannot observe directly directly.. The component identit tit tity y vthat ariable c ofbethe paramoun t del impprovides ortance an later—the variable . A may latenbtevrelated ariable to is xa through random mixture mo model example.latent Laten Latent t variables vthe ariable that w e cannot observe directly . The component iden tit y v ariable c ( c) joint distribution, in this case, P (x, c) = P (x | c )P (c ). The distributionofPthe delt provides Latent vP ariables may be related to xvariables through (x | c) relating omixture ver the mo laten latent variable an andexample. the distribution the latent P (xshap P (the )P (c ). ThePdistribution P ( c) x cdistribution thethe joint distribution, this case,the , c) = (x ) ev to visible variables in determines shape e of even en though P (x |to c) relating over the latenttovariable distribution thevariable. latent variables P (xthe it is p ossible describeand ) without reference the latent Laten Latentt P ( to the visible v ariables determines the shap e of the distribution x ) ev en though | variables are discussed further in Sec. 16.5. it is possible to describe P (x) without reference to the latent variable. Latent A veryare podiscussed werful andfurther common type16.5 of mixture mo model del is the Gaussian mixtur mixturee variables in Sec. . mo model, del, in whic which h the comp componen onen onents ts p (x | c = i ) are Gaussians. Each comp component onent has A v ery p o w erful and common type of mixture mo del is the Gaussian mixtur ( i ) ( i ) a separately parametrized mean µ and cov covariance ariance Σ . Some mixtures can hav haveee p ( = i x c mo del, in whic h the comp onen ts ) are Gaussians. Each comp onent has more constraints. For example, the cov covariances ariances could be shared across comp componen onen onents ts Σ . distribution, a separately parametrized and covariance Some mixtures have | a single via the constraint Gaussian thecan mixture Σ(i) = Σmean ∀i. Asµ with more constraints. Forconstrain example, the the cov could befor shared componen ts of Gaussians might co cov variances ariance matrix eac each h across component to be via the constraint Σ = Σ i. As with a single Gaussian distribution, the mixture diagonal or isotropic. of Gaussians might constrain ∀ the covariance matrix for each component to be In addition to the means and cov covariances, ariances, the parameters of a Gaussian mixture diagonal or isotropic. sp specify ecify the prior pr prob ob obability ability α i = P ( c = i) giv given en to each comp component onent i. The word In addition to the means and cov ariances, the parameters of a Gaussian mixture “prior” indicates that it expresses the mo model’s del’s beliefs about c before it has observed α = P ( = i i the prior prPob c )pr giv en to ,each comp . The after word By comparison, prob ob obability ability ability, because itonent is computed xsp. ecify ( cability | x) is a posterior b efore “prior” indicates that it expresses the mo del’s beliefs about it has observed c observ observation ation of x. A Gaussian mixture mo model del is a universal appr approximator oximator of after . By comparison, ayposterior obability , because it is computed xdensities, P ( c that x) isan in the sense any smo smooth oth pr densit density y can be approximated with an any y x observ ation of . A Gaussian mixture mo del is a universal appr oximator of | sp specific, ecific, non-zero amoun amountt of error by a Gaussian mixture model with enough densities, in the sense that any smooth density can be approximated with any comp componen onen onents. ts. specific, non-zero amount of error by a Gaussian mixture model with enough Fig. 3.2 sho shows ws samples from a Gaussian mixture mo model. del. components. Fig. 3.2 shows samples from a Gaussian mixture model.

3.10

Useful Prop Properties erties of Common Functions

Certain oftenerties while working with probabilit probability y distributions, especially 3.10 functions Usefularise Prop of Common Functions the probabilit distributions used in deep learning mo probability y models. dels. Certain functions arise often while working with probability distributions, especially of these functions is the logistic gistic sigmoid: : models. the One probabilit y distributions usedlo in deepsigmoid learning 1 One of these functions is the logistic sigmoid : σ(x) = . (3.30) 1 + exp(−x) 1 σ(x) = . (3.30) + exp( x) the φ parameter of a Bernoulli The logistic sigmoid is commonly used1 to pro produce duce distribution because its range is (0 (0,, 1) 1),, whic which h lies − within the valid range of values φ parameter The logistic sigmoid is commonly used to pro duce a Bernoulli for the φ parameter. See Fig. 3.3 for a graph of thethe sigmoid function.ofThe sigmoid distribution because its range is (0, 1), which lies within the valid range of values for the φ parameter. See Fig. 3.3 for a graph of the sigmoid function. The sigmoid 67

x2


x1

Figure 3.2: Samples from a Gaussian mixture mo model. del. In this example, there are three comp componen onen onents. ts. From left to righ right, t, the first comp component onent has an isotropic co cov variance matrix, Figure 3.2: Samples from a Gaussian mixture mo del. In this example, three meaning it has the same amount of variance in each direction. The secondthere has aare diagonal comp onen ts. F rom left to righ t, the first comp onent has an isotropic co v ariance matrix, co cov variance matrix, meaning it can control the variance separately along each axis-aligned meaning itThis has the same has amount variancealong in each Thealong second diagonal x 2 axis than x 1 aaxis. direction. example moreofvariance thedirection. thehas The covariance matrix, meaning it cancov control thematrix, varianceallowing separately along eachthe axis-aligned third comp componen onen onentt has a full-rank covariance ariance it to control variance direction. This has more along the x axis than along the x axis. The separately alongexample an arbitrary basisvariance of directions. third comp onent has a full-rank covariance matrix, allowing it to control the variance separately along an arbitrary basis of directions.

function satur saturates ates when its argument is very positiv ositivee or very negative, meaning that the function becomes very flat and insensitiv insensitivee to small changes in its input. function saturates when its argument is very positive or very negative, meaning Another commonly encountered function is the softplus function (Dugas et al., that the function becomes very flat and insensitive to small changes in its input. 2001 2001): ): Another commonly encountered function is the function (Dugas(3.31) et al., ζ (x) = log (1 + exp( x))softplus . 2001): The softplus function can be useful producing the ζ (x) =for logpro (1 ducing + exp(x )) .β or σ parameter of a normal (3.31) distribution because its range is (0, ∞ ). It also arises commonly when manipulating β or σ The softplusin function be useful producing parameter of afrom normal expressions inv volving can sigmoids. Theforname of the the softplus function comes the , distribution b ecause its range is (0 ) . It also arises commonly when manipulating fact that it is a smo smoothed othed or “softened” version of expressions involving sigmoids. The ∞name of the softplus function comes from the x+ = max(0 , x)of . (3.32) fact that it is a smoothed or “softened” version See Fig. 3.4 for a graph of the softplus function. x = max(0 , x). (3.32) following properties erties aresoftplus all useful enough that you may wish to memorize See The Fig. follo 3.4 wing for a prop graph of the function. them: The following properties are all useful enough that you may wish to memorize them: exp(x) σ(x) = (3.33) exp(x) + exp(0) exp(x) σd(x) = (3.33) exp( ) +− exp(0) σ(x) = σ(x)(1 σ(x)) (3.34) dx d σ(x) = σ68 (x)(1 σ(x)) (3.34) dx −


The logistic sigmoid function 1.0

¾(x)

0.8

The logistic sigmoid function

0.6 0.4 0.2 0.0 −10

−5

0

5

10

x

Figure 3.3: The logistic sigmoid function. Figure 3.3: The logistic sigmoid function.

The softplus function 10

The softplus function

³(x)

8 6 4 2 0 −10

−5

0

5

x

Figure 3.4: The softplus function. Figure 3.4: The softplus function.

69

10


1 − σ(x) = σ (−x)

(3.35)

log 1 σσ(x (x))==−σζ((−xx)) dσ(x) = ζ (− x) log− ζ (x) = σ (x) dx − − d  ζ−1 (x) = σ (x) x ∀x ∈ (0, 1)dx , σ (x) = log 1−x x (x)log = (exp( log x) − 1) , σ ∀xx > (0 0,, 1) ζ −1 (x) = 1 x ∀ ∈ Z x − x > 0,ζ ζ(x) (= x) = log σ((exp( y)dy x) 1) −∞ ∀  −  ζζ((xx))= σ ( y ) dy − ζ (−x) = x

(3.36) (3.35) (3.36) (3.37) (3.37) (3.38) (3.38) (3.39) (3.39) (3.40) (3.40) (3.41)

The function σ−1 (x) is called the logit rarely ζ (xlo ) git ζin ( statistics, x) = x but this term is more (3.41) used in mac machine hine learning. − Zin−statistics, but this term is more rarely The function σ (x) is called the logit Eq. 3.41 pro provides vides extra justification for the name “softplus.” The softplus used in machine learning. function is intended as a smo smoothed othed version of the positive part function, x + = Eq. provides extra justification name “softplus.” The softplus max max{ {0, x3.41 } . The positive part function isfor thethe counterpart of the ne negative gative part function is intended as a smo othed v ersion of the p ositive p art function, x = − max{ {0, −x}. To obtain a smo function, x = max smooth oth function that is analogous to the max , x 0 . The p ositive part function is the counterpart of the ne gative p art negativ negativee part, one can use ζ (−x ). Just as x can be recov recovered ered from its positive part maxvia x .iden 0, the function, To tit obtain is analogous to {negativ }x e=part x− oth = x,function x and negative identit tity y x+ a−smo it is alsothat possible to reco recov verthe x ζ ( x negativ e part, one can use ) . Just as can b e recov ered from its p ositive part { − } using the same relationship bet etw ween ζ (x) and ζ (−x), as shown in Eq. 3.41. x = x, it is also possible to recover x and negative part via the iden − tity x using the same relationship between ζ (− x) and ζ ( x), as shown in Eq. 3.41. 3.11 Ba Bay yes’ Rule − W e often Ba findyourselv ourselves es in a situation where we know P ( y | x) and need to know 3.11 es’ Rule P ( x | y). Fortunately ortunately,, if we also know P (x), we can compute the desired quantit quantity y y x P ( W e often find ourselv es in a situation where we know ) and need to know using Bayes’ rule: P ( x y). Fortunately, if we also know PP(x(x),)P w(eycan | the desired quantity | x)compute P ( x | y ) = . (3.42) using| Bayes’ rule: P (y) P (x)P (y x) P (xin ythe ) =formula, it is . usually feasible to compute (3.42) P (y ) appears Note that while P (y) | P | not need to begin with knowledge of P (y). P (y) = x P (y | x)P (x), so we do Note that while P (y ) appears in the formula, it is usually feasible to compute Bay straightforward to deriv derive the definition of conditional P (y isxstraigh )P (x), tforward so we do not needetofrom begin with knowledge of P (y). P (yBa ) =yes’ rule probabilit probability y, but it is useful to know the name of this form formula ula since many texts | straightforward to derive from the definition of conditional yes’ rule is referBato it b y name. It is named after the Reverend Thomas Ba Bay yes, who first probabilit y , but it is useful to know the name of this form ula since many disco discov vered a sp special ecial case of the formula. The general version presented heretexts was refer to it b y name. It is named after the Reverend Thomas Ba y es, who first indep independen enden endently discov vered by Pierre-Simon Laplace. P tly disco discovered a special case of the formula. The general version presented here was independently discovered by Pierre-Simon Laplace. 70


3.12

Tec echnical hnical Details of Con Contin tin tinuous uous Variables

A prop proper er T formal understanding of of con contin tin tinuous uous random ariables and probabilit probability y 3.12 echnical Details Con tin uous vV ariables densit density y functions requires dev developing eloping probabilit probability y theory in terms of a branc branch h of A prop er formal understanding of con tin uous random v ariables and probabilit y mathematics kno known wn as me measur asur asuree the theory ory ory.. Measure theory is bey eyond ond the scop scopee of densit y functions requires dev eloping probabilit y theory in terms of a branc h of this textb textbo ook, but we can briefly sk sketc etc etch h some of the issues that measure theory is mathematics known emplo employ yed to resolv resolve. e. as measure theory. Measure theory is beyond the scope of this textbook, but we can briefly sketch some of the issues that measure theory is In ySec. 3.3.2 , wee.sa saw w that the probabilit probability y of a con contin tin tinuous uous vector-v vector-valued alued x lying emplo ed to resolv in some set S is given by the in integral tegral of p( x) ov over er the set S. Some choices of set S In Sec. 3.3.2 , w e sa w that the probabilit y of a conto tinconstruct uous vector-v S 1lying can pro produce duceSparadoxes. For example, it is possible tw two oalued sets x and S S pbut ( x)Sov∩ in some set pis(xgiven by the in tegral of erSthe=set . Some choices of set ) + p ( ) > ∈ S x ∈ S ∅ S suc such h that 1 . These sets are generally 2 1 2 1 2 S can produce making paradoxes. Fheavy or example, itthe is infinite possibleprecision to construct twonum setsbers,and constructed very use of of real numbers, for S S S S S p ( ) + p ( ) > = x x suc h that 1 but . These sets are generally example by making fractal-shap fractal-shaped ed sets or sets that are defined by transforming constructed making very heavy of real num bers,is for 2 ∈ use of the infinite ∈ ∩ precision ∅ of measure the set of rational num numbers. bers. One of the key contributions theory to example y making fractal-shap or sets are defined transforming pro provide vide abcharacterization of theed setsets of sets thatthat we can computeby the probability thewithout set of rational numbers. One of In thethis keybo contributions oftegrate measure to of encountering paradoxes. book, ok, we only in integrate ovtheory er sets is with provideely a characterization of the sets we can compute the relativ relatively simple descriptions, so set thisofasp aspect ectthat of measure theory nev never er probability becomes a of without encountering paradoxes. In this bo ok, w e only in tegrate o v er sets with relev relevant ant concern. relatively simple descriptions, so this aspect of measure theory never becomes a Fant or our purp purposes, oses, measure theory is more useful for describing theorems that relev concern. apply to most points in R n but do not apply to some corner cases. Measure theory For our purposes, measure theorythat is more fortsdescribing theorems that pro provides vides a rigorous wa way y Rof describing a setuseful of poin oints is negligibly small. Such apply to most p oints in but do not apply to some corner cases. Measure theory a set is said to hav havee “ me measur asur asuree zer zero o.” W Wee do not formally define this concept in this provides rigorous describing that a setthe of pin oin ts is negligibly Such textb textbo ook.a Ho How wev ever, er,wa it yisofuseful to understand intuition tuition that a setsmall. of measure a set oisccupies said tonohav e “ measur e zer o.” W e do formally Fdefine this concept zero volume in the space we arenot measuring. or example, withinin ,a R2this textb o ok. Ho w ev er, it is useful to understand the in tuition that a set of measure line has measure zero, while a filled polygon has positiv ositivee measure. Likewise, R an zero o ccupies no volume in the space we are measuring. F or example, within ,a individual point has measure zero. An Any y union of countably many sets that each line measure a filled pzero olygon ositiv e measure. Likewise, an ha hav vehas measure zerozero, alsowhile has measure (so has the p set of all the rational num numb bers individual point zero. Any union of countably many sets that each has measure zero,has formeasure instance). have measure zero also has measure zero (so the set of all the rational numbers term from measure theory is “ almost everywher everywheree.” A prop property erty has Another measure useful zero, for instance). that holds almost ev everywhere erywhere holds throughout all of space except for on a set of Another useful termthe from measure otheory almost everywher .” space, A propthey erty measure zero. Because exceptions ccupy isa “negligible amounteof thatbholds almost everywhere throughout allimp of space on a set of can e safely ignored for man many y holds applications. Some importan ortan ortanttexcept resultsfor in probability measure zero. thevalues exceptions occupy negligible amount for of space, they theory hold for Because all discrete but only hold a“almost everywhere” con contin tin tinuous uous can b e safely ignored for man y applications. Some imp ortan t results in probability values. theory hold for all discrete values but only hold “almost everywhere” for continuous Another tec technical hnical detail of contin continuous uous variables relates to handling contin continuous uous values. random variables that are deterministic functions of one another. Supp Suppose ose we ha hav ve Another tec hnical detail of contin uous v ariables relates to handling contin uous two random variables, x and y , suc such h that y = g (x ), where g is an inv invertible, ertible, conrandom variables that are deterministic functions of one another. Suppose we have 2 Banach-Tarski theorem sets. g is an invertible, conx andprovides y , sucha fun y = g (of x )such twoThe random variables, thatexample , where 71


tin tinuous, uous, differen differentiable tiable transformation. One migh mightt exp expect ect that py (y ) = p x (g−1(y )) )).. This is actually not the case. tinuous, differentiable transformation. One might expect that p (y ) = p (g (y )). As a simple example, supp suppose ose we ha hav ve scalar random variables x and y. Suppose This isx actually not the case. (0,, 1) y = 2 and x ∼ U (0 1).. If we use the rule py (y ) = p x(2 y) then p y will be 0 As a simple example, suppalose 1 have scalar random variables x and y. Suppose ev everywhere erywhere except the interv interval [0 [0,,we interv terv terval. al. This means 2 ], and it will b e 1 on this in U (0, 1). If we use y = and x the rule p (y ) = p (2 y) then p will be 0 Z everywhere except it will1 be 1 on this interval. This means ∼ the interval [0, ]p, and (3.43) y (y )dy = , 2 1 p (y)dy = , (3.43) whic which h violates the definition of a probabilit probability y distribution. 2 common is of wrong becausey it fails to accoun accountt for the distortion whicThis h violates the mistake definition a probabilit distribution. of space in intro tro troduced duced by the function g . Recall that the probability of x lying in Z bvecause This common is wrong toenaccoun distortion δxfails )δx.the g can an infinitesimally mistake small region with olume it is giv given by p(tx for Since g x of space in tro duced by the function . Recall that the probability of lying in expand or con contract tract space, the infinitesimal volume surrounding x in x space ma may y an small with volume δx is given by p(x )δx. Since g can ha hav vinfinitesimally e differen differentt volume in region y space. expand or contract space, the infinitesimal volume surrounding x in x space may To see ho how w to correct the problem, we return to the scalar case. We need to have different volume in y space. preserv preservee the prop propert ert erty y To see how to correct the to|. the scalar case. We need to |pyproblem, (g(x))dy dy|| w=e |return p x (x)dx (3.44) preserve the property Solving from this, we obtain p (g(x))dy = p (x)dx . (3.44)   | | ∂ x | Solving from this, we obtain| p y (y) = px (g−1 (y))   (3.45) ∂y ∂x p (y) = p (g (y)) (3.45) or equiv equivalently alently  ∂y  ∂ g ( x)    p ( x ) = p ( g ( x )) (3.46) x y or equivalently  ∂x  . ∂ g(x)  p (ative x) = pgeneralizes (g(x)) to (3.46)  . determinan In higher dimensions, the deriv derivative determinantt of the Jac Jacobian obian ∂xthe  ∂xi  alued vectors x and y , matrix matrix—the —the matrix with Ji,j = ∂y j . Th Thus, us, for real-v real-valued In higher dimensions, the derivative generalizes to the determinant of the Jacobian  forreal-valued  matrix—the matrix with J = . Thus,   ∂ g(x )  vectors x and y , px (x) = py (g(x)) det (3.47)  .  ∂ x   ∂ g(x ) p (x) = p (g(x)) det . (3.47) ∂x

3.13

Information Theory

   that rev Information theory is a branc branch h of applied mathematics revolv olv olves es around 3.13 Information Theory     quan quantifying tifying how muc much h information is presen present inven en ented ted  t in a signal. It was originally inv Information theory is a branc of applied mathematics that channel, revolves such around to study sending messages fromhdiscrete alphab alphabets ets over a noisy as quan tifying how muc h information is presen t in a signal. It w as originally inv en ted comm communication unication via radio transmission. In this context, information theory tells how to study messages discrete alphab over of a messages noisy channel, such as to design sending optimal co codes des and from calculate the exp expected ectedets length sampled from communication via radio transmission. In this context, information theory tells how to design optimal codes and calculate the72expected length of messages sampled from


sp specific ecific probability distributions using various enco encoding ding schemes. In the con context text of mac machine hine learning, we can also apply information theory to con contin tin tinuous uous variables specificsome probability distributions using interpretations various encodingdoschemes. In. the confield text is of where of these message length not apply apply. This mac hine learning, we areas can also apply information to continscience. uous variables fundamen fundamental tal to many of electrical engineeringtheory and computer In this where some of these message length interpretations do not apply . This field is textb textbo ook, we mostly use a few key ideas from information theory to characterize fundamental to many areas electrical engineering and science. In this probabilit probability y distributions or of quantify similarity betw etween een computer probability distributions. textb ook, detail we mostly use a few key ideas from theory to)characterize F or more on information theory theory, , see Co Cov vinformation er and Thomas (2006 or MacKa MacKay y probabilit y distributions or quantify similarity b etw een probability distributions. (2003). For more detail on information theory, see Cover and Thomas (2006) or MacKay The basic intuition behind information theory is that learning that an unlik unlikely ely (2003). ev even en entt has occurred is more informative than learning that a lik likely ely ev event ent has The basic intuitionsaying behind“the information is that learning that an unlikely occurred. A message sun rosetheory this morning” is so uninformative as ev en t has occurred is more informative than learning that a lik ely ev ent has to be unnecessary to send, but a message sa saying ying “there was a solar eclipse this o ccurred. A message saying “the sun rose this morning” is so uninformative as morning” is very informative. to be unnecessary to send, but a message saying “there was a solar eclipse this We would like to quantify information in a way that formalizes this intuition. morning” is very informative. Sp Specifically ecifically ecifically,, We would like to quantify information in a way that formalizes this intuition. Specifically • Lik Likely ely, ev even en ents ts should ha hav ve lo low w information con conten ten tent, t, and in the extreme case, ev even en ents ts that are guaranteed to happen should ha hav ve no information conten contentt Lik ely ev en ts should ha v e lo w information con ten t, and in the extreme case, whatso whatsoev ev ever. er. • events that are guaranteed to happen should have no information content • Less lik likely elyer.ev even en ents ts should ha hav ve higher information con conten ten tent. t. whatso ev lik ely ev ts should haha vevhigher information conten t. example, finding • Less Indep Independen enden endent t en ev even en ents ts should hav e additiv additive e information. For convey ey twice as • out that a tossed coin has come up as heads twice should conv Indep enden t ev en ts should ha v e additiv e information. F or example, muc uch h information as finding out that a tossed coin has come up asfinding heads out that a tossed coin has come up as heads twice should convey twice as • once. much information as finding out that a tossed coin has come up as heads In once. order to satisfy all three of these prop properties, erties, we define the self-information of an ev even en entt x = x to be In order to satisfy all three Iof(xthese ) = −prop log Perties, (x). we define the self-information (3.48) of an event x = x to be In this book, we alwa always ys use logIto . Our (x)mean = the log Pnatural (x). logarithm, with base e(3.48) definition of I( x) is therefore written in of nats. One nat is the amount of − unitsnatural In this book,gained we alwa use log to logarithm, with base e. Our 1 information byysobserving anmean even eventtthe of probability e . Other texts use base-2 definition ofand ) is therefore written in units of nats. One nat is the of I( xunits logarithms called bits or shannons ; information measured in amount bits is just by observing an even t of probability . Other texts use base-2 ainformation rescaling ofgained information measured in nats. logarithms and units called bits or shannons; information measured in bits is just When x is contin we use the same definition of information by analogy continuous, uous,measured analogy,, a rescaling of information in nats. but some of the prop properties erties from the discrete case are lost. For example, an even eventt When x is contin uous, we use the same definition of information b y analogy with unit density still has zero information, despite not being an ev even en entt that is, but some of the prop erties from the discrete case are lost. F or example, an event guaran guaranteed teed to occur. with unit density still has zero information, despite not being an event that is 73 guaranteed to occur.


Shannon entropy in nats

0.7 0.6 0.5

Shannon entropy of a binary random variable Shannon entropy of a binary random variable

0.4 0.3 0.2 0.1 0.0 0.0

0.2

0.4

0.6

0.8

1.0

p

Figure 3.5: This plot sho shows ws ho how w distributions that are closer to deterministic hav havee lo low w Shannon entrop entropy y while distributions that are close to uniform hav havee high Shannon en entrop trop tropy y. Figure 3.5: This plot sho ws ho w distributions that are closer to deterministic hav e lo w p On the horizon horizontal tal axis, we plot , the probabilit probability y of a binary random variable being equal Shannon yywhile distributions that are to puniform high0,Shannon entropy. to 1. Theentrop en entrop trop tropy is giv given en by (p − 1) log . When phav is enear the distribution log(1 (1 (1− − p )close − p log p Onnearly the horizon tal axis, bwecause e plot the , the probabilit y of aisbinary variable equal is deterministic, random variable nearly random alwa always ys 0. When bpeing is near 1, (p 1) log(1b ecause p ) p the log prandom to 1.distribution The entropyisisnearly given by . When pvariable is near is 0, nearly the distribution the deterministic, alwa always ys 1. peristhe − − variable − the is is nearly b ecause the random nearly alwaisysuniform 0. When near 1, p =deterministic, 00..5, the en When entrop trop tropy y is maximal, because distribution ov over tw two o the distribution is nearly deterministic, b ecause the random variable is nearly always 1. outcomes. When p = 0.5, the entropy is maximal, because the distribution is uniform over the two outcomes.

Self-information deals only with a single outcome. We can quantify the amoun amountt of uncertain uncertaintty in an en entire tire probabilit probability y distribution using the Shannon entr entropy opy opy:: Self-information deals only with a single outcome. We can quantify the amount of uncertainty in an en probabilit y )]distribution using : Htire (x) = Ex∼P [I (x P (xthe )]. Shannon entropy (3.49) = −Ex∼P [log

E E H (other x) = words, P (x [I (the x)] = also denoted H ( P ). In Shannon [log en entrop trop tropy y)]of. a distribution (3.49) is the exp expected ected amoun amountt of information in an ev even en entt−dra drawn wn from that distribution. It gives also denoted ) . In other w ords, the Shannon entrop y of2,a otherwise distribution the H ( P a lo low wer bound on the num numb ber of bits (if the logarithm is base theisunits expected amoun t of information even drawn from thatfrom distribution. It gives P. are differen different) t) needed on av average eragein toan enco encode det symbols drawn a distribution a lo w er bound on the num b er of bits (if the logarithm is base 2, otherwise the units Distributions that are nearly deterministic (where the outcome is nearly certain) P. are t) needed on averagethat to enco symbols drawnhav from a distribution ha hav vedifferen lo low w en entrop trop tropy; y; distributions are de closer to uniform have e high entrop entropy y. See Distributions that are nearlyWhen deterministic (where is nearly Fig. 3.5 for a demonstration. x is contin continuous, uous, the the outcome Shannon entrop entropy y iscertain) kno known wn havthe e lodiffer w entrop y; entr distributions that are closer to uniform have high entropy. See as differential ential entropy opy opy.. Fig. 3.5 for a demonstration. When x is continuous, the Shannon entropy is known If we hav havee two separate probability distributions P ( x) and Q(x) ov over er the same as the differential entropy. random variable x, we can measure ho how w different these two distributions are using If we hav e t w o separate probability the Kul Kullb lb lback-L ack-L ack-Leibler eibler (KL) diver divergenc genc gencee: distributions P ( x) and Q(x) over the same random variable x, we can measure how different these two distributions are using   the Kul lback-Leibler (KL) divergenc P (x)e: D KL(P kQ) = E x∼P log = E x∼P [log P (x) − log Q(x)] . (3.50) Q(x) P (x) E E D (P Q) = log = [log P (x) log Q(x)] . (3.50) Q(x) 74 k − 




In the case of discrete variables, it is the extra amount of information (measured in bits if we use the base 2 logarithm, but in machine learning we usually use nats thenatural case of logarithm) discrete variables, the extra amount of information (measured andInthe neededittoissend a message containing symbols drawn in bits if we use the base 2 logarithm, but in machine learning w e usually use nats from probabilit probability y distribution P , when we use a co code de that was designed to minimize and the natural logarithm) needed to send a message containing symbols drawn the length of messages dra drawn wn from probabilit probability y distribution Q. from probability distribution P , when we use a code that was designed to minimize The KL div divergence ergence has many useful prop properties, erties, most notably that it is nonthe length of messages drawn from probability distribution Q. negativ negative. e. The KL divergence is 0 if and only if P and Q are the same distribution in The KLdiscrete divergence has many useful propev erties, mostinnotably it tin is uous nonthe case of variables, or equal “almost everywhere” erywhere” the casethat of con contin tinuous P Q negativ e. The KL divergence is 0 if and only if and are the same distribution in variables. Because the KL div divergence ergence is non-negativ non-negativee and measures the difference the case variables,itorisequal everywhere” in the case of contin uous b et etw w een oftwdiscrete o distributions, often“almost conceptualized as measuring some sort of vdistance ariables.bBecause the KL div ergence is non-negativ e and measures the difference etw etween een these distributions. How However, ever, it is not a true distance measure b et w een t w o distributions, it is often conceptualized measuring some of 6 DKL( QkP )asfor Q.sort because it is not symmetric: DKL( P kQ ) = some P and This distance b etw een these distributions. How ever, it is not a true distance measure asymmetry means that there are imp importan ortan ortantt consequences to the choice of whether Q ) =3.6 P ) detail. D). See ( P Fig. D for( Q because is not symmetric: for some P and Q. This to use Dit ( P k Q ) or D ( Q k P more KL KL asymmetry means that there are impkortan6 t consequences to the choice of whether k A quan quantit tit tity y that is closely related to the KL div divergence ergence cross-entr oss-entr oss-entropy opy to use D (P Q) or D (Q P ). See Fig. 3.6 for more detail.is the cr H (P, Q ) = H ( P ) + DKL (P kQ), whic which h is similar to the KL div divergence ergence but lac lacking king k that is closely k related to the KL divergence is the cross-entropy A quan tit y the term on the left: H (P, Q ) = H ( P ) + D (P HQ(P, ), whic isEsimilar to the KL divergence but lac king Q) =h − (3.51) x∼P log Q(x). the term on the left: k Eect to Q is equiv Minimizing the cross-entrop cross-entropy resp respect equivalent the H y(P,with Q) = log Q(x ). alent to minimizing (3.51) KL div divergence, ergence, because Q do does es not participate omitted term. − ect to Qinisthe Minimizing the cross-entropy with resp equivalent to minimizing the computing man many ofesthese quan quantities, tities, it common encoun encounter ter expresKL When divergence, because Qydo not participate in isthe omittedtoterm. sions of the form 0 log 0. By con conv ven ention, tion, in the con context text of information theory theory,, we When computing man y of these quan tities, it is common to encoun ter exprestreat these expressions as limx→0 x log x = 00.. sions of the form 0 log 0. By convention, in the context of information theory, we treat these expressions as lim x log x = 0.

3.14

Structured Probabilistic Mo Models dels

Mac Machine hine Structured learning algorithms often in inv volv olvee probabilit probability y distributions ov over er a very 3.14 Probabilistic Models large num umb ber of random variables. Often, these probabilit probability y distributions in inv volv olvee Mac hine learning algorithms often in v olv e probabilit y distributions ov er a very direct in interactions teractions betw etween een relatively few variables. Using a single function to large num ber entire of random ariables. Often, these probabilit distributions volve describ describe e the join jointt vprobabilit probability y distribution can be yvery inefficien inefficienttin(both direct interactions een relatively few variables. Using a single function to computationally andbetw statistically). describe the entire joint probability distribution can be very inefficient (both Instead of using a single function to represen representt a probability distribution, we computationally and statistically). can split a probability distribution in into to man many y factors that we multiply together. of supp using a we single to represen t a probability we For Instead example, suppose ose ha hav vfunction e three random variables: a, b and cdistribution, . Supp Suppose ose that split a probability into manthe y factors together. acan influences the value ofdistribution b and b influences value that of c, we butmultiply that a and c are F or example, supp ose we ha v e three random v ariables: a , b and c . Supp ose that indep independen enden endentt giv given en b. We can represent the probabilit probability y distribution over all three a influences the value of b and b influences the value of c, but that a and c are independent given b. We can represent the probability distribution over all three 75


q ∗ = argminq D KL(pkq ) (pp(xq)) q∗k(x) p(x) q (x)

q = argmin D Probability Density

Probability Density

q = argmin D

q ∗ = argmin q DKL (q kp)

x

(p( q xp)) q ∗k(x) p(x) q (x)

x

Figure 3.6: The KL divergence is asymmetric. Suppose we ha hav ve a distribution p(x) and wish to approximate it with another distribution q ( x). We hav havee the choice of minimizing p(x) and Figure D 3.6: divergence iseasymmetric. havcehoice a distribution either or D illustrate theSuppose effect ofwethis using a mixture of pkq) KL KL (The KL ( q kp). W ) q ( x wish to approximate it with another distribution . W e hav e the c hoice of minimizing two Gaussians for p, and a single Gaussian for q . The choice of whic which h direction of the D (p qto ( q p). We illustrate ) or either the effect of this crequire hoice using a mixture of KL divergence useDis problem-dep problem-dependen enden endent. t. Some applications an approximation two Gaussians for p,high and aprobabilit for qthat . Thethe choice which direction the k single Gaussian that usually kplaces probability y anywhere true ofdistribution placesofhigh KL divergence to use is problem-dep enden t. Some applications require an approximation probabilit probability y, while other applications require an appro approximation ximation that rarely places high that usually places y anywhereplaces that the distribution placesofhigh probabilit probability y an anywhere ywherehigh that probabilit the true distribution low true probabilit probability y. The choice the probabilitofy, the while applications requireofan appro ximation that rarely places direction KLother div divergence ergence reflects which these considerations takes priorit priority y for high eac each h probabilit y an ywhere that the true distribution places low probabilit y . The choice of the application. (L (Left) eft) The effect of minimizing DKL ( pkq). In this case, we select a q that has direction of theyKL divergence reflects which yof. these priorit y for p has high p has multipletakes q cho high probabilit probability where probabilit probability Whenconsiderations mo modes, des, hooses oseseac toh (Left) ( p q). yInmass q that The application. The effect of minimizing this on case, select a(Right) has blur the modes together, in order to put highDprobabilit probability allwofe them. p (has high probabilit y where . When multiple moprobability des, q cho oses to k pa has DKL q that effect of minimizing . In probabilit this case, ywe select has low where q kp)high (Right) together, in order tomput highmo probabilit mass on all of The hasthe lo low wmodes probabilit probability y. When ultiple modes des thaty are sufficien sufficiently tlythem. widely separated, pblur p has q that ahas ). In this q pergence effect of minimizing case, we select low mo probability where as in this figure, theDKL (div divergence is minimized by cahoosing single mode, de, in order to p p has lo w probabilit y . When has m ultiple mo des that are sufficien tly widely separated, k avoid putting probability mass in the low-probabilit low-probability y areas b et etw ween mo modes des of p. Here, we as in this the figure, the KL divergence is minimized by choosing a singleWmo de, inalso order to q is chosen illustrate outcome when to emphasize the left mode. e could hav have e avhiev oid putting probability low-probabilit y areas bthe etwright een mo des ofIfpthe . Here, we ac achiev hieved ed an equal value ofmass the in KLthe div divergence ergence by choosing mode. mo modes des illustrate the outcome when qnis emphasize the left mode. We could alsoofhav are not separated by a sufficie sufficien tlychosen strongto lo low w probabilit probability y region, then this direction thee ac hiev ed an equal v alue of the KL div ergence b y c hoosing the right mode. If the mo des KL div divergence ergence can still choose to blur the mo modes. des. are not separated by a sufficiently strong low probability region, then this direction of the KL divergence can still choose to blur the mo des.

76


variables as a pro product duct of probability distributions ov over er tw two o variables: variables as a product of pprobability (a, b, c) = pdistributions (a)p(b | a)p(cov| er b).two variables:

(3.52)

p(a, b, c) = p(a)p(b a)p(c b). (3.52) These factorizations can greatly reduce the num numb ber of parameters needed | a num | ber of parameters that is to describ describee the distribution. Each factor uses numb These factorizations can greatly reduce the num of parameters exp exponen onen onential tial in the num number ber of variables in the factor. Thisber means that we can needed greatly to describ the of distribution. factor uses ber of that is reduce theecost representing Each a distribution if wae num are able to parameters find a factorization exp tial in the num of in into toonen distributions over ber few fewer er vvariables ariables.in the factor. This means that we can greatly reduce the cost of representing a distribution if we are able to find a factorization We can describ describee these kinds of factorizations using graphs. Here we use the into distributions over fewer variables. word “graph” in the sense of graph theory: a set of ve vertices rtices that may be connected Weh can describ these kinds graphs. Here we use the to eac each other with eedges. Whenofwefactorizations represent theusing factorization of a probability w ord “graph” with in thea sense a seteof veob rtices that mo may e connected distribution graph,ofwgraph e call theory: it a structur structure d pr prob obabilistic abilistic model delbor gr graphic aphic aphical al to eac h other with edges. When we represent the factorization of a probability mo model del del.. distribution with a graph, we call it a structured probabilistic model or graphical There are two main kinds of structured probabilistic mo models: dels: directed and model. undirected. Both kinds of graphical mo models dels use a graph G in which each no node de There are t w o main kinds of structured probabilistic mo dels: directed and in the graph corresp corresponds onds to a random variable, and an edge connecting tw two o undirected. Both kinds of graphical mo dels use a graph in which each no de random variables means that the probability distribution is able to represen representt direct in the graph corresp onds to a random v ariable, and an edge connecting two G in interactions teractions bet etw ween those two random variables. random variables means that the probability distribution is able to represent direct Dir Direecte cted d mo models dels use graphs with directed edges, and they represen representt factorizainteractions between those two random variables. tions into conditional probability distributions, as in the example ab abov ov ove. e. Sp Specifically ecifically ecifically,, Dir e cte d mo dels use graphs with directed edges, and they represen t factorizaa directed mo model del con contains tains one factor for every random variable ariablex xi in the distribution, tions into conditional probability distributions, as in the example ab e. Sp ecifically and that factor consists of the conditional distribution ov over er x i giv given enovthe paren parents ts of , for every random variable x in the distribution, xaidirected , denotedmo Pdel a G (con xi ):tains one factorY and that factor consists of the conditional distribution over x given the parents of p(x) = p (xi | P aG (xi )) . (3.53) x , denoted P a (x ): i p (x ) = p (x P a (x )) . (3.53) See Fig. 3.7 for an example of a directed graph and the factorization of probability | distributions it represen represents. ts. See Fig. 3.7 for an example of a directed graph and the factorization of probability Undir Undireecte cted d mo models dels use graphs with undirected edges, and they represen representt facdistributions it represents. Y torizations in into to a set of functions; unlik unlikee in the directed case, these functions are Undir e cte d mo dels use graphs with and t facusually not probability distributions ofundirected any kind. edges, An Any y set of they no nodes desrepresen that are all torizations to into a set of functions; unlik in the Each directed case, functions are ) in an connected each other in G is called a eclique. clique undirected C (ithese usually probability kind. factors Any setare of just nodes that arenot all ) C (iany mo model del isnot asso associated ciated withdistributions a factor φ(i) (of ). These functions, connectedytodistributions. each other inThe is output called aofclique. Each clique in an undirected probabilit probability each factor must be non-negative, but φ ( mo del is asso ciated with a factor ) . These factors are just functions, not G C there is no constraint that the factor must sum or integrate to 1 like a probability probability distributions. The outputCof each factor must be non-negative, but distribution. there is no constraint that the factor must sum or integrate to 1 like a probability The probability of a configuration of random variables is pr prop op oportional ortional to the distribution. pro product duct of all of these factors—assignments that result in larger factor values are The probability of a configuration of random variables is proportional to the 77 product of all of these factors—assignments that result in larger factor values are


a

b

c

d

e

Figure 3.7: A directed graphical mo model del ov over er random variables a , b, c , d and e. This graph corresp corresponds onds to probabilit probability y distributions that can b e factored as Figure 3.7: A directed graphical model over random variables a , b, c , d and e. This graph p(a, b, cy, d , e) = p(a)p(b | a)can p(c |bae, b )p(d | b)as p(e | c). (3.54) corresp onds to probabilit distributions that factored p(ato , b,quickly (d distribution. This graph allo allows ws us seep(some of)pthe a c, d, e) = a)p(b properties a)p(c a, b b)p(e c). For example, (3.54) and c in interact teract directly directly,, but a and e in interact teract | only indirectly | |via c. | This graph allows us to quickly see some properties of the distribution. For example, a and c interact directly, but a and e interact only indirectly via c.

more lik likely ely ely.. Of course, there is no guarantee that this pro product duct will sum to 1. We therefore divide by a normalizing constant Z, defined to be the sum or integral likstates ely. Ofofcourse, there guarantee thatinthis proto duct will sum to 1. We φ functions, omore ver all the pro product ductisofno the order obtain a normalized therefore by a normalizing constant Z, defined to be the sum or integral probabilit probability ydivide distribution: over all states of the product of the φ functions, in order to obtain a normalized 1 Y (i)  (i)  probability distribution: p(x) = φ . (3.55) C Z i 1 p(x) = φ . (3.55) See Fig. 3.8 for an example of anZundirectedC graph and the factorization of probabilit probability y distributions it represen represents. ts. See Fig. 3.8 for an example of an undirected graph and the factorization of Keep in mind that these graphical representations of factorizations are a probability distributions it represents. Y   language for describing probability distributions. They are not mutually exclusive Keepofin mind that these graphical of factorizations are ay families probabilit probability y distributions. Beingrepresentations directed or undirected is not a prop propert ert erty language for describing probability distributions. They are not m utually exclusive of a probability distribution; it is a prop property erty of a particular description of a families of probabilit y distributions. Being directed or undirected is not aedprop probabilit probability y distribution, but an any y probability distribution may be describ described in bert othy ofays. a probability distribution; it is a property of a particular description of a w probability distribution, but any probability distribution may be described in both Throughout Part I and Part II of this book, we will use structured probabilistic ways. mo models dels merely as a language to describ describee which direct probabilistic relationships Throughout P art I and P art I I of choose this book, we will useNostructured probabilistic differen differentt mac machine hine learning algorithms to represent. further understanding mostructured dels merelyprobabilistic as a language to describ e which probabilistic relationships of mo models dels is needed un until tildirect the discussion of researc research h topics, differen t mac hine learning algorithms choose to represent. No further understanding in Part III, where we will explore structured probabilistic mo models dels in muc much h greater of structured probabilistic mo dels is needed un til the discussion of researc h topics, detail. in Part III, where we will explore structured probabilistic models in much greater 78 detail.


a

b

c

d

e

Figure 3.8: An undirected graphical model ov over er random variables a , b, c, d and e . This graph corresp corresponds onds to probabilit probability y distributions that can b e factored as Figure 3.8: An undirected graphical model over random variables a , b, c, d and e . This 1 (1) (3) graph corresp onds top(probabilit distributions can φ (a, b,that a, b, c, d, ey) = c)φ(2) (b,bde)φfactored (c, e). as (3.56) Z 1 φ properties = some (a, b, c)φ of (bthe , d)φdistribution. (c, e). a, quickly b, c, d, e)see (3.56) This graph allo allows ws usp(to For example, a Z and c in interact teract directly directly,, but a and e in interact teract only indirectly via c. This graph allows us to quickly see some properties of the distribution. For example, a and c interact directly, but a and e interact only indirectly via c.

This chapter has reviewed the basic concepts of probabilit probability y theory that are most relev relevant ant to deep learning. One more set of fundamen fundamental tal mathematical to tools ols This c hapter has reviewed the basic concepts of probabilit y theory that are remains: numerical metho methods. ds. most relevant to deep learning. One more set of fundamental mathematical tools remains: numerical methods.

79

Chapter 4 Chapter 4

Numerical Computation Numerical Computation Mac Machine hine learning algorithms usually require a high amoun amountt of numerical computation. This typically refers to algorithms that solve mathematical problems by Mac hine learning algorithms require avia high t of numerical compumetho methods ds that update estimatesusually of the solution an amoun iterative pro process, cess, rather than tation. This t ypically refers to algorithms that solve mathematical problems by analytically deriving a form formula ula providing a symbolic expression for the correct sometho ds that update estimates of the solution via an iterative pro cess, rather than lution. Common op operations erations include optimization (finding the value of an argument analytically deriving a formulaa providing symbolic the correct sothat minimizes or maximizes function) aand solvingexpression systems offor linear equations. lution. Common operations include optimization value of anbargument Ev Even en just ev evaluating aluating a mathematical function on a(finding digital the computer can e difficult that minimizes or maximizes a function) and solving systems of linear equations. when the function inv involv olv olves es real num numbers, bers, whic which h cannot be represented precisely Ev en just ev aluating a mathematical function on a digital computer can be difficult using a finite amount of memory memory.. when the function involves real numbers, which cannot be represented precisely using a finite amount of memory.

4.1

Ov Overflo erflo erflow w and Underflo Underflow w

The fundamental tal w difficulty performingwcontin continuous uous math on a digital computer 4.1 fundamen Overflo and in Underflo is that we need to represent infinitely many real num numb bers with a finite num number ber Thebitfundamen tal This difficulty in pthat erforming continall uous math on a digital computer of patterns. means for almost real num numbers, bers, w wee incur some is thatximation we neederror to represent a finiteIn num ber appro approximation when weinfinitely represen representmany t the nreal um umb bnum er inbers thewith computer. man many y of bit patterns. This means that for almost all real num bers, w e incur some cases, this is just rounding error. Rounding error is problematic, esp especially ecially when appro ximation error when w e represen t the n um b er in the computer. maniny it compounds across man many y op operations, erations, and can cause algorithms that In work cases, this is just rounding error.are Rounding errortois minimize problematic, especially when theory to fail in practice if they not designed the accum accumulation ulation of it compounds rounding error.across many operations, and can cause algorithms that work in theory to fail in practice if they are not designed to minimize the accumulation of One form of rounding error that is particularly dev devastating astating is underflow. Underrounding error. flo flow w occurs when num numbers bers near zero are rounded to zero. Many functions behav ehavee One form of rounding error that is particularly dev astating is underflow . Underqualitativ differen when their argument is zero rather than a small p ositive qualitatively ely differently tly flo w boer. ccurs numbers near zero are rounded to zero. by Many behav n um umb Forwhen example, we usually wan want t to av avoid oid division zerofunctions (some softw software aree qualitatively differently when their argument is zero rather than a small positive number. For example, we usually want80 to avoid division by zero (some software 80

CHAPTER 4. NUMERICAL COMPUTATION

en environmen vironmen vironments ts will raise exceptions when this occurs, others will return a result with a placeholder not-a-n not-a-num um umb ber value) or taking the logarithm of zero (this is en vironmen ts will raise exceptions thisnot-a-n occurs,um others a result usually treated as −∞, whic which h then when becomes not-a-num umb ber if will it is return used for many with a placeholder not-a-n um b er v alue) or taking the logarithm of zero (this is further arithmetic op operations). erations). usually treated as , which then becomes not-a-number if it is used for many Another highly damaging form of numerical error is overflow. Overflo Overflow w occurs further arithmetic −∞ operations). when num umbers bers with large magnitude are appro approximated ximated as ∞ or −∞. Further Anotherwill highly damaging numerical error is overflow . Overflo w occurs arithmetic usually changeform theseofinfinite values into not-a-num not-a-number ber values. when numbers with large magnitude are approximated as or . Further One example of a function that must be stabilized against underflow and arithmetic will usually change these infinite values into not-a-num ber v ∞ −∞ alues. overflo erflow w is the softmax function. The softmax function is often used to predict the One example of a function that must distribution. be stabilizedThe against underflow and probabilities asso associated ciated with a multinoulli softmax function is odefined verflowto is bthe softmax function. The softmax function is often used to predict the e probabilities associated with a multinoulli distribution. The softmax function is exp( exp(x xi ) softmax(x)i = Pn . (4.1) defined to be exp(x xj ) j=1 exp( exp(x ) softmax( x) x= are equal to some . constantt c. Analytically (4.1), Consider what happ happens ens when all of the Analytically, i exp(x ) constan we can see that all of the outputs should be equal to 1n . Numerically Numerically,, this may x Consider what happ ens when all of the are equal to some constan t c. Analytically exp exp((c) will, not occur when c has large magnitude. If c is very negativ negative, e, then w e can see that means all of the should be equal to will . Numerically , this underflo underflow. w. This the outputs denominator of the softmax become 0, so the may final c c exp ( c not occur when has large magnitude. If is v ery negativ e, then ) will exp((c) will ov result is undefined. When c is very large P and positiv ositive, e, exp overflo erflo erflow, w, again underflow.inThis means the denominator of theundefined. softmax will become 0, so the final resulting the expression as a whole being Both of these difficulties c exp ( c result is undefined. When is very large and p ositiv e, ) will ov erflo w, again softmax((z ) where z = x − max i xi . Simple can be resolved by instead ev evaluating aluating softmax resultingshows in the expression whole beingfunction undefined. of these difficulties algebra that the valueasofathe softmax is notBoth changed analytically by softmax z ) where z = x max can be resolved by instead evaluating . Simple maxx x adding or subtracting a scalar from the input (vector. Subtracting results i i algebra shows argument that the value of bthe softmax function is not by − analytically in the largest to exp eing 0, whic which h rules out thechanged possibility of ov overflo erflo erflow. w. max x adding or subtracting a scalar from the input vector. Subtracting results Lik Likewise, ewise, at least one term in the denominator has a value of 1, which rules out in the largest argument to exp being 0, which rules out the ossibilityby of zero. overflow. the possibility of underflow in the denominator leading to apdivision Likewise, at least one term in the denominator has a value of 1, which rules out There is still one small problem. Underflow in the numerator can still cause the possibility of underflow in the denominator leading to a division by zero. the expression as a whole to ev evaluate aluate to zero. This means that if we implement There is still one small problem. Underflow in the numerator can cause log softmax softmax((x) by first running the softmax subroutine then passing thestill result to the expression as waewhole evaluate to zero. −∞ This means that if we implement implement the log function, could to erroneously obtain . Instead, we must softmaxfunction (x) by first running the softmax subroutine then passing thewa result to alogseparate that calculates in a numerically stable way y. The log softmax the softmax log function, we could obtain . Instead, log function can beerroneously stabilized using the same trick aswe we must used implement to stabilize a separate function that calculates in a n umerically stable way. The log softmax −∞ the softmax function. log softmax function can be stabilized using the same trick as we used to stabilize For the most part, we do not explicitly detail all of the numerical considerations the softmax function. in inv volv olved ed in implementing the various algorithms describ described ed in this book. Developers F or the most part, we do not explicitly detail all of the numerical of low-lev low-level el libraries should keep numerical issues in mind when considerations implementing in v olv ed in implementing the v arious algorithms describ ed in this b o ok. Developers deep learning algorithms. Most readers of this book can simply rely on lowof low-lev el libraries should k eep numerical issues in mind when implementing lev level el libraries that provide stable implementations. In some cases, it is possible deep learning aalgorithms. Most of this ook can simplyautomatically rely on lowto implement new algorithm andreaders hav havee the new b implementation level libraries that provide stable implementations. In some cases, it is possible to implement a new algorithm and hav81 e the new implementation automatically


stabilized. Theano (Bergstra et al., 2010; Bastien et al., 2012) is an example of a softw software are pack package age that automatically detects and stabilizes man many y common stabilized. Theano ( Bergstra et al. , 2010 ; Bastien et al. , 2012 ) is an example numerically unstable expressions that arise in the context of deep learning. of a software package that automatically detects and stabilizes many common numerically unstable expressions that arise in the context of deep learning.

4.2

Poor Conditioning

Conditioning to how rapidly a function changes with resp respect ect to small changes 4.2 Poorrefers Conditioning in its inputs. Functions that change rapidly when their inputs are perturb erturbed ed slightly Conditioning refers to how rapidly a function c hanges with resp ect to small can be problematic for scientific computation because rounding errors in thechanges inputs in its inputs. F unctions that change rapidly when their inputs are p erturb ed slightly can result in large changes in the output. can be problematic for scientific computation because rounding errors in the inputs Consider the function f ( x ) = A−1x. When A ∈ R n×n has an eigenv eigenvalue alue can result in large changes in the output. decomp decomposition, osition, its condition numb number er is R A Consider the function f ( x ) = A x has an eigenvalue  . When  decomposition, its condition number is  λi  ∈ (4.2) max   . i,j λj λ max . (4.2) This is the ratio of the magnitude of the λlargest and smallest eigen eigenv value. When this num numb ber is large, matrix inv inversion ersion is particularly sensitive to error in the input. This is the ratio of the magnitude of the largest and smallest eigenvalue. When This sensitivit sensitivity y is an in intrinsic trinsic prop propert ert erty y of the matrix itself, not the result this number is large, matrix inversion is particularly sensitive to error in the input.    Poorly conditioned matrices amplify of rounding error during matrix inv inversion. ersion.   This sensitivit is anwe intrinsic prop of the matrix itself,Innot the result pre-existing errorsywhen multiply by ert they true matrix inv inverse. erse. practice, the of rounding error during further matrix binv Poorly matrices amplify error will be comp compounded ounded y nersion. umerical errorsconditioned in the in inv version pro process cess itself. pre-existing errors when we multiply by the true matrix inverse. In practice, the error will be compounded further by numerical errors in the inversion process itself.

4.3

Gradien Gradient-Based t-Based Optimization

Most learningt-Based algorithms Optimization in inv volv olvee optimization of some sort. Optimization 4.3 deep Gradien refers to the task of either minimizing or maximizing some function f (x ) by altering Most learning algorithms involve optimization some of sort. Optimization x f (x). . W Weedeep usually phrase most optimization problems inofterms minimizing f ( x refers to the task of either minimizing or maximizing some function ) b y altering Maximization ma may y be accomplished via a minimization algorithm by minimizing x −.f (W xe). usually phrase most optimization problems in terms of minimizing f (x). Maximization may be accomplished via a minimization algorithm by minimizing antt to minimize or maximize is called the obje objective ctive function f (The x). function we wan or criterion. When we are minimizing it, we may also call it the cost function function,, − The function we want to minimize or maximize is called the objective function loss function, or err error or function. In this book, we use these terms in interc terc terchangeably hangeably hangeably,, or criterion . When we are minimizing it, w e may also call it the c ost though some machine learning publications assign sp special ecial meaning to somefunction of these, loss function, or error function. In this book, we use these terms interchangeably, terms. though some machine learning publications assign special meaning to some of these We often denote the value that minimizes or maximizes a function with a terms. sup superscript erscript ∗. For example, we might say x∗ = arg min f (x). We often denote the value that minimizes or maximizes a function with a superscript . For example, we might say 82 x = arg min f (x). ∗


Gradient descent

2.0

Global minimum Gradient descentat x =0. 0

1.5

Since f (x) =0, gradient descent halts here.

1.0 0.5 0.0 −0.5

For x <0, we have f0(x) <0, so we can decrease f by moving rightward.

For x >0, we have f 0(x) >0, so we can decrease f by moving leftward.

−1.0

f(x) = 12 x2

−1.5 −2.0 −2.0

ff0((xx))= = xx −1.5

−1.0

−0.5

0.0 x

0.5

=x 1.0 f (x)1.5

2.0

Figure 4.1: An illustration of how the deriv derivativ ativ atives es of a function can b e used to follow the function downhill to a minimum. This tec technique hnique is called gr gradient adient desc descent ent ent.. Figure 4.1: An illustration of how the derivatives of a function can b e used to follow the function downhill to a minimum. This technique is called gradient descent.

We assume the reader is already familiar with calculus, but pro provide vide a brief review of how calculus concepts relate to optimization here. We assume the reader is already familiar with calculus, but provide a brief = f (to x),optimization Supp Suppose wecalculus hav havee a concepts function yrelate where both x and y are real num numbers. bers. review ofose how here. dy 0 The derivative of this function is denoted as f ( x) or as dx . The deriv derivative ative f 0 (x) y = f ( x x y Supp ose we hav e a function ) , where b oth and are real bers. giv gives es the slop slopee of f (x) at the point x. In other words, it sp specifies ecifies hownum to scale (x) derivative denoted as f ( x ) orcorresp as .onding The deriv ativeinf the aThe small change of in this the function input in is order to obtain the corresponding change gives thef (slop ) )at . x. In other words, it specifies how to scale output: x +e)of≈ff(x(x + the f 0(xp)oint a small change in the input in order to obtain the corresponding change in the The deriv derivative ative for minimizing a function because it tells us x). output: f (x + ) isf therefore (x) + f (useful ho how w to change x in order to make a small improv improvemen emen ementt in y . For example, we ≈is therefore The deriv ative useful for minimizing a function because tells us 0 . We it kno know w that f(x −  sign (f ( x))) is less than f (x ) for small enough can thus x in order how to fchange makesteps a small emen t in . Fderiv or example, we (x) by moving x intosmall reduce withimprov opp opposite osite sign of ythe derivativ ativ ative. e. This x  sign f ( ( f ( x f ( x  kno w that ))) is less than ) for small enough . W e can thus tec technique hnique is called gr gradient adient desc descent ent (Cauc Cauch hy, 1847). See Fig. 4.1 for an example of f (x) by − reduce moving x in small steps with opposite sign of the derivative. This this technique. technique is0 called gradient descent (Cauchy, 1847). See Fig. 4.1 for an example of f (x) = 00,, the deriv derivative ative provides no information ab about out which direction thisWhen technique. 0 to mov move. e. Poin Points ts where f (x) = 0 are known as critic critical al points or stationary points. f ( x When ) = 0 , the deriv ative provides no information out which direction A lo loccal minimum is a point where f ( x) is low lower er than at allabneighboring poin oints, ts, f (x to mov ts where 0 are known critical infinitesimal points or stationary f (x ) bas so it is e. noPoin longer possible to) = decrease y making steps. Apoints lo loccal. f ( x) isthan A local minimum ist awhere pointf (where loweratthan at all neighboring poin maximum is a poin oint all neigh neighb boring poin oints, ts, so it ts, is x ) is higher f ( x so it is no longer possible to decrease ) by making infinitesimal steps. A local maximum is a point where f (x ) is higher than at all neighboring points, so it is 83


Types of critical points Minimum

Types Maximum of critical points

Saddle point

Minimum

Maximum

Saddle point

Figure 4.2: Examples of each of the three typ ypes es of critical poin oints ts in 1-D. A critical point is a p oint with zero slop slope. e. Such a p oin ointt can either b e a lo local cal minimum, which is low lower er than Figure 4.2: Examples of each of the three t yp es of critical p oin ts in 1-D. A critical p oint or is the neighboring p oints, a lo local cal maximum, which is higher than the neigh neighb b oring p oints, a saddle p oint with zero slop e. Such a p oin t can either b e a lo cal minimum, which is low er than p oint, which has neighbors that are b oth higher and low lower er than the p oin ointt itself. the neighboring p oints, a lo cal maximum, which is higher than the neighb oring p oints, or a saddle p oint, which has neighbors that are b oth higher and lower than the p oint itself.

not possible to increase f( x) by making infinitesimal steps. Some critical points are neither maxima nor minima. These are known as sadd saddle le points oints.. See Fig. 4.2 f ( x not p ossible to increase ) b y making infinitesimal steps. Some critical points for examples of each type of critical point. are neither maxima nor minima. These are known as sadd le points. See Fig. 4.2 A point that obtains lowest est value of f ( x) is a glob global al minimum minimum.. It for examples of each typethe of absolute critical plow oint. is possible for there to be only one global minim minimum um or multiple global minima of f ( x) is athat point that the absolute lowest of minima globare al minimum . It the A function. It isobtains also possible for there to bvealue lo local cal not globally is possibleInforthe there to beofonly one global minim um or multiple optimal. context deep learning, we optimize functionsglobal that minima may ha hav vofe the function. It is also p ossible for there to b e lo cal minima that are not globally man many y lo local cal minima that are not optimal, and man many y saddle points surrounded by optimal. In the context of deep we optimize functions that may ve v ery flat regions. All of this makeslearning, optimization very difficult, esp especially ecially whenhathe many to local are not optimal, and many saddle points surrounded input theminima functionthat is multidimensional. We therefore usually settle for findingby a v ery flat regions. All of this makes optimization very difficult, esp ecially when the value of f that is very low, but not necessarily minimal in any formal sense. See input4.3 to for thean function is multidimensional. We therefore usually settle for finding a Fig. example. value of f that is very low, but not necessarily minimal in any formal sense. See We often minimize functions that ha hav ve multiple inputs: f : R n → R. For the Fig. 4.3 for an example. concept of “minimization” to make sense, there must still be R only one R (scalar) f : W e often minimize functions that ha v e m ultiple inputs: . For the output. concept of “minimization” to make sense, there must still be only→one (scalar) For functions with multiple inputs, we must mak makee use of the concept of partial output. ∂ f ( x derivatives derivatives.. The partial deriv derivativ ativ ativee ∂x ) measures how f changes as only the i F or functions with multiple inputs, we must mak e use of the concept of partial variable xi increases at point x. The gr gradient adient generalizes the notion of deriv derivativ ativ ativee f ( x f derivatives . The partial deriv ativ e ) measures how c hanges as only to the case where the deriv derivativ ativ ativee is with resp respect ect to a vector: the gradient of f is the the x x v ariable increases at p oint . The gr adient generalizes the notion of deriv ativ vector containing all of the partial deriv derivativ ativ atives, es, denoted ∇xf ( x). Elemen Elementt i of thee to the case the deriv ativ e eis of with respect toect a vector: gradient of f is the f with gradien gradient t is where the partial deriv derivativ ativ ative resp respect to xi . Inthe multiple dimensions, vector containing all of the partial derivatives, denoted f ( x). Element i of the gradient is the partial derivative of f with . In multiple dimensions, 84 resp ect to x∇


Approximate minimization This local minimum

Approximate minimization performs nearly as well as f(x)

the global one, so it is an acceptable halting point.

Ideally, we would like to arrive at the global minimum, but this might not be possible.

This local minimum performs poorly, and should be avoided.

x

Figure 4.3: Optimization algorithms ma may y fail to find a global minimum when there are multiple lo local cal minima or plateaus present. In the context of deep learning, we generally Figure such 4.3: Optimization y fail findminimal, a globalso minimum whencorresp there ond are accept solutions ev even en algorithms though theyma are not to truly long as they correspond m ultiple lo cal minima or plateaus present. In the context of deep learning, we generally to significantly low values of the cost function. accept such solutions even though they are not truly minimal, so long as they corresp ond to significantly low values of the cost function.

critical points are points where ev every ery element of the gradient is equal to zero. u The pdir dire ectional derivative vector) is is equal the slop slope of the critical oints are points whereinevdirection ery element (a of unit the gradient to ezero. function f in direction u. In other words, the directional deriv derivative ative is the deriv derivativ ativ ativee u The dir e ctional derivative in direction (a unit vector) is the slop e of the of the function f ( x + αu) with resp respect ect to α , ev evaluated aluated at α = 00.. Using the chain function in see direction derivative is the derivative ∂u. In other words, ∇x fdirectional f (x + αu) = u> the rule, we fcan that ∂α (x). of the function f ( x + αu) with respect to α , evaluated at α = 0. Using the chain f , we would lik direction in which f decreases the f (x + like αue)to = find u the rule,Toweminimize can see that f (x ). fastest. We can do this using the directional deriv derivativ ativ ative: e: To minimize f , we would like to find∇the direction in which f decreases the fastest. We can do this using the directional derivative: min u>∇ x f (x) (4.3) u,u > u=1

f (x) (4.3) min min ||u||u (4.4) 2 ||∇ xf (x)||2 cos θ > u,u u=1 ∇ = min u f (x) Substituting cos θ where θ is the angle betw etween een u and the gradient. in ||u||2 = 1 (4.4) and || || ||∇ || ignoring factors that do not dep depend end on u, this simplifies to minu cos θ. This is θ u where is the angle b etw een and the gradient. Substituting in ut. = and u minimized when poin oints ts in the opp opposite osite direction as the gradien gradient. In1other ignoring that dotsnot depend on uand , this to min ||pcos is ||θ. This w ords, thefactors gradient poin oints directly uphill, thesimplifies negative gradient oints directly minimized points finbythe opposite direction as ofthe t. gradient. In other do downhill. wnhill. Wwhen e can udecrease moving in the direction thegradien negative w ords, the gradient p oin ts directly uphill, and the negative gradient p oints directly This is known as the metho method d of ste steep ep epest est desc descent ent or gr gradient adient desc descent ent. downhill. We can decrease f by moving in the direction of the negative gradient. Steepest est descent proposes oses a new oin oint t ent or gradient descent. ThisSteep is known as theprop metho d of steepp est desc =

0 Steepest descent proposes axnew poin =x − t∇x f (x)

(4.5)

x = x 85 f (x) − ∇

(4.5)


where  is the le learning arning rate ate,, a positiv ositivee scalar determining the size of the step. We can cho hoose ose  in sev several eral differen differentt wa ways. ys. A popular approach is to set  to a small  where is the le arning r ate , a p ositiv e scalar determining size of step. We constan constant. t. Sometimes, we can solve for the step size thatthe makes thethedirectional   can cative hoosevanish. in several differen t ways. isAto popular approach set to a small f (x − is ∇to deriv derivative Another approach ev evaluate aluate xf (x)) for several constan t. Sometimes, we can solve for the step size that makes directional values of  and choose the one that results in the smallest ob objective jectivethe function value. derivative vanish. isAnother is. to evaluate f (x  f (x)) for several This last strategy called aapproach line se sear ar arch ch ch. values of  and choose the one that results in the smallest ob function value. − jective ∇ Steep Steepest est descen descentt con conv verges when every element of the gradient is zero (or, in This last strategy is called a line search. practice, very close to zero). In some cases, we may be able to avoid running this Steep descent and convjust ergesjump whendirectly every element the gradient is solving zero (or,the in iterativ iterative e est algorithm, to the of critical point by practice, very zero). equation ∇xf (close x) = to 0 for x. In some cases, we may be able to avoid running this iterative algorithm, and just jump directly to the critical point by solving the Although gradient descent is limited to optimization in contin continuous uous spaces, the equation f (x) = 0 for x. general concept of making small mo mov ves (that are appro approximately ximately the best small mo mov ve) ∇ gradient descent is limited to optimization in continuous spaces, the Although to tow wards better configurations can be generalized to discrete spaces. Ascending an general concept of making small moves (that are appro best and smallNorvig move), ob objectiv jectiv jective e function of discrete parameters is called hil hilll ximately climbing the (Russel towards 2003 2003). ). better configurations can be generalized to discrete spaces. Ascending an ob jective function of discrete parameters is called hil l climbing (Russel and Norvig, 2003).

4.3.1

Bey Beyond ond the Gradien Gradient: t: Jacobian and Hessian Matrices

4.3.1 Bey Gradien t: Jacobian and Matrices Sometimes weond needthe to find all of the partial deriv derivativ ativ atives esHessian of a function whose input and output are both vectors. The matrix containing all suc such h partial deriv derivatives atives is Sometimes need find .allSp ofecifically the partial deriv ativ of a function f : R m whose → Rn, input kno known wn as a we Jac Jacobian obianto matrix matrix. Specifically ecifically, , if we hav have e a es function then and output are b oth vectors. The matrix containing all suc h partial deriv atives is ∂ n×m of f is defined such that Ji,j = Rj f (x) i .R the Jacobian matrix J ∈ R : known as a Jacobian matrix. Specifically, if we have a function f ∂x , then Rin e are also sometimes interested terested in defined a deriv derivativ ativ ative e of a deriv derivative. ative. fThis known wn (x→ ) .is kno of f is such that Jn = the W Jacobian matrix J as a se seccond derivative derivative.. For example, for a function f : R → R, the deriv derivativ ativ ativee ∈ interested in a derivative of a derivative. This is kno W e are also sometimes ∂ 2 wn with resp respect ect to x i of the deriv derivativ ativ ativee of f with resp respect ect to xRj is denoted as ∂xi ∂xj f . R f : as a second derivative. For example, for a function , the deriv ative d2 00 f f ( x In a single dimension, we can denote b y ) . The second deriv derivativ ativ ative e tells 2 f with respect to x is→ f. with respect to x of the derivative of dx denoted as us how the first deriv derivativ ativ ativee will change as we vary the input. This is imp important ortant (x). as Inecause a single dimension, we can denote stepf will by fcause The second deriv ativeemen tellst b it tells us whether a gradient muc uch h of an improv improvemen ement us we howwould the first deriv ative will change as wealone. vary the Thisofisthe impsecond ortant as exp expect ect based on the gradient We input. can think b ecause it tells us whether a gradient step will cause as m uc h of an improv ement deriv derivative ative as measuring curvatur curvaturee. Supp Suppose ose we hav havee a quadratic function (many as we would ect in based on the alone.but Wecan canbethink of the second functions thatexp arise practice are gradient not quadratic approximated well deriv ative as measuring curvatur e . Supp ose we hav e a quadratic function (many as quadratic, at least lo locally). cally). If suc such h a function has a second deriv derivativ ativ ativee of zero, functions that arise in practice are not quadratic but can b e approximated well then there is no curv curvature. ature. It is a perfectly flat line, and its value can be predicted as quadratic, at least lo cally). If suc h a function has a second deriv ativ e of zero, using only the gradient. If the gradient is 1, then we can mak makee a step of size  then there is no curv ature. Itand is athe perfectly flat line, and its value predicted . Ifbethe along the negative gradient, cost function will decrease bycan second using only the gradient. If the gradient is 1 , then w e can mak e a step of  deriv derivative ative is negative, the function curves down downw ward, so the cost functionsize will along thedecrease negativeby gradient, and the cost ,function will decrease by is . Ifpositiv the second actually more than . Finally Finally, if the second deriv derivative ative ositive, e, the deriv ative is negative, the function curves down w ard, so the cost function will function curves upw upward, ard, so the cost function can decrease by less than . See Fig. actually decrease by more than . Finally, if the second derivative is positive, the 86 function curves upward, so the cost function can decrease by less than . See Fig.


No curv curvature ature

Positiv ositivee curv curvature ature

Negative curvature

No curvature

Positive curvature

x

f (x)

f (x)

f (x)

Negativ Negativee curv curvature ature

x

x

Figure 4.4: The second deriv derivativ ativ ativee determines the curv curvature ature of a function. Here we show quadratic functions with various curv curvature. ature. The dashed line indicates the value of the cost Figure 4.4: second deriv ativon e determines theinformation curvature of a function. Herea we show function we The would exp expect ect based the gradient alone as we make gradient quadratic functions with v arious curv ature. The dashed line indicates the v alue of the cost step downhill. In the case of negative curv curvature, ature, the cost function actually decreases function we the would exp ectpredicts. based onIn the gradient alone we make a gradient faster than gradient the case ofinformation no curv curvature, ature, theas gradient predicts the step downhill. In. the casecase of negative curv ature, the function actually decreases decrease correctly correctly. In the of p ositive curv curvature, ature, thecost function decreases slow slower er than faster than the gradient In the case ofono curv the gradient predicts the exp expected ected and even eventually tually bpredicts. egins to increase, so to too large ofature, step sizes can actually increase decrease correctly . In the the function inadverten inadvertently tly tly..case of p ositive curvature, the function decreases slower than expected and eventually begins to increase, so too large of step sizes can actually increase the function inadvertently.

4.4 to see how different forms of curv curvature ature affect the relationship betw etween een the value of the cost function predicted by the gradient and the true value. 4.4 to see how different forms of curvature affect the relationship between the value When our function has multiple input dimensions, there are many second of the cost function predicted by the gradient and the true value. deriv derivatives. atives. These deriv derivatives atives can be collected together into a matrix called the Whenmatrix our function has multiple input therethat are many second Hessian matrix. . The Hessian matrix H (f )(xdimensions, ) is defined such derivatives. These derivatives can be collected together into a matrix called the Hessian matrix. The Hessian matrix H (f )(∂x2) is defined such that f (x). (4.6) H (f )(x)i,j = ∂ x i ∂ xj ∂ f (x). (4.6) H (f )(x) = Equiv Equivalently alently alently,, the Hessian is the Jacobian ∂ofx the ∂ x gradient. An Anywhere ywhere that the second partial deriv derivativ ativ atives es are contin continuous, uous, the differential Equivalently, the Hessian is the Jacobian of the gradient. op operators erators are commutativ commutative, e, i.e. their order can be swapped: Anywhere that the second partial derivatives are continuous, the differential ∂ 2 their order can ∂ 2 be swapped: operators are commutative, i.e. f (x) = f (x). (4.7) ∂ x i ∂ xj ∂ x j ∂ xi ∂ ∂ f (x) = f (x). (4.7) H = H This implies that i,j is symmetric at such points. ∂j,ix, ∂sox the Hessian ∂ x matrix ∂x Most of the functions we encoun encounter ter in the context of deep learning ha hav ve a symmetric H = H This implies that , so the Hessian matrix is symmetric at points. Hessian almost everywhere. Because the Hessian matrix is real andsuch symmetric, Most of the functions we encoun ter in the context of deep learning ha v e a symmetric we can decomp decompose ose it in into to a set of real eigen eigenv values and an orthogonal basis of Hessian almost everywhere. Because the Hessian matrix is real and symmetric, we can decompose it into a set of real87eigenvalues and an orthogonal basis of


eigen eigenv vectors. The second deriv derivative ative in a sp specific ecific direction represented by a unit vector d is giv given en by d> H d. When d is an eigenv eigenvector ector of H , the second deriv derivativ ativ ativee eigen v ectors. The second deriv ative in a sp ecific direction represented by a unit in that direction is given by the corresp corresponding onding eigenv eigenvalue. alue. For other directions of d d H d d H v ector is giv en by . When is an eigenv ector of , the second deriv ativ e d , the directional second deriv derivativ ativ ativee is a weigh weighted ted average of all of the eigen eigenv values, in that direction is given by the corresp onding eigenv alue. F or other directions of with weigh eights ts bet etw ween 0 and 1, and eigenv eigenvectors ectors that hav havee smaller angle with d d , the directional second deriv ativeum is aeigenv weigh ted determines average of all the eigensecond values, receiving more weigh eight. t. The maxim maximum eigenvalue alue theofmaximum d with weighand ts bthe etwminim een 0 um andeigenv 1, and eigenv ectors that have smaller angle with deriv derivative ative minimum eigenvalue alue determines the minim minimum um second deriv derivative. ative. receiving more weight. The maximum eigenvalue determines the maximum second The (directional) second deriv derivative ative tells us how well we can exp expect ect a gradient derivative and the minimum eigenvalue determines the minimum second derivative. descen descentt step to perform. We can mak makee a second-order Taylor series appro approximation ximation Thefunction (directional) second deriv ative tells us x how (0) well we can exp ect a gradient to the f (x) around the current point : descent step to perform. We can make a second-order Taylor series approximation 1 to the function (0) f (x) f≈(xf)(xaround ) + (the x −current x(0) )>gp+oint(xx − :x(0) )> H (x − x (0)). (4.8) 2 1 ) g + f ( x ) f ( x ) + ( x x (x x(0) ) H (x x ). (4.8) where g is the gradient and H is the Hessian 2 at x . If we use a learning rate ≈ point x will − − − g. Substituting of , then the new be given by x (0) − this into our where is the gradient and is the Hessian at . If w e use a learning rate g H x appro approximation, ximation, we obtain g. Substituting this into our of , then the new point x will be given by x > − 1 2 > approximation, wefobtain (0) (0) (x − g) ≈ f (x ) − g g +  g H g. (4.9) 2 1 g) thef (original x ) vgalue f (x here: g +of the g H g. (4.9) There are three terms function, the exp expected ected 2 − e of ≈ the function, − impro improv vemen ementt due to the slop slope and the correction we must apply There are three terms here: the original v alue thelast function, the expected to account for the curv curvature ature of the function. Whenofthis term is to too o large, the impro v emen t due to the slop e of the function, and the correction we must apply > gradien gradientt descen descentt step can actually mov movee uphill. When g H g is zero or negative, to account for the curv ature of the function. this last term toodecrease large, the  forev f the Taylor series appro approximation ximation predicts thatWhen increasing forever er iswill gradien descen t stepthe can mov uphill. When orlarge negative, g H g is zerofor , so forev forever. er.t In practice, Taactually ylor series is eunlik unlikely ely to remain accurate large fe, the T ylorresort seriestoappro forever  inincreasing g>will H g decrease one maust moreximation heuristic predicts choices ofthat this case. When is positiv ositive, forever. for In the practice, thestep Taylor is unlikelythe to T remain accurate for large , so solving optimal sizeseries that decreases aylor series approximation of  g H g one m ust resort to more heuristic c hoices of in this case. When is p ositiv e, the function the most yields solving for the optimal step size that decreases g >g the Taylor series approximation of ∗ . (4.10)  = the function the most yields g> H g g g . ector of H corresp (4.10)  =the eigenv In the worst case, when g aligns with eigenvector corresponding onding to the g Hg 1 λ maximal eigenv eigenvalue alue max , then this optimal step size is given by λ max . To the In thet worst case, when gwaligns with the ector of H corresp to the exten extent that the function e minimize caneigenv be appro approximated ximated well bonding y a quadratic λ ,ofthen maximal eigenv alue this optimal size isthe given byof the. learning To the function, the eigen eigenv values the Hessian thus step determine scale exten t that the function w e minimize can b e appro ximated w ell b y a quadratic rate. function, the eigenvalues of the Hessian thus determine the scale of the learning The second deriv derivative ative can be used to determine whether a critical point is a rate. lo local cal maximum, a lo local cal minimum, or saddle point. Recall that on a critical point, second deriv be used to that determine whether aascritical peoint is a f 0(xThe f 00(ative x) > can f 0(x) increases )=0 . When 0, this means we mov move to the lo cal maximum, a lo cal minimum, or saddle p oint. Recall that on a critical p oint, righ right, t, and f 0 (x ) decreases as we mov movee to the left. This means f 0 ( x − ) < 0 and f (x) = 0. When f (x) > 0, this means that f (x) increases as we move to the 88 the left. This means f ( x ) < 0 and right, and f (x ) decreases as we move to −


f 0(x + ) > 0 for small enough  . In other words, as we mov movee right, the slop slopee begins to poin ointt uphill to the right, and as we mo mov ve left, the slop slopee begins to point uphill f (the x + left. ) > 0Thus,  .0Inand for small other mov e right, that the slop egins to whenenough wewe can conclude local cal f 0 (x ) = f 00(xw)ords, > 0, as x iseablo to p oin t uphill to the right, and as w e mo v e left, the slop e b egins to p oint uphill 0 00 minim minimum. um. Similarly Similarly,, when f ( x) = 0 and f (x) < 0, we can conclude that x is a to the left. Thus,This whenisfknown 0 and 0, we can test conclude that x is a, when local (x ) = as (cx ) >derivative lo local cal maximum. the fse sec ond test. . Unfortunately Unfortunately, 00 (x um. Similarly, when f ( x) = 0 and f (x)x < 0, we can conclude that x is a fminim ) = 00,, the test is inconclusive. In this case ma may y be a saddle poin oint, t, or a part lo cal maximum. This is known as the se c ond derivative test . Unfortunately , when of a flat region. f (x) = 0, the test is inconclusive. In this case x may be a saddle point, or a part In multiple dimensions, we need to examine all of the second deriv derivatives atives of the of a flat region. function. Using the eigendecomp eigendecomposition osition of the Hessian matrix, we can generalize multiple dimensions, to examine all of the derivpatives of the the In second deriv derivativ ativ ativee test we to need multiple dimensions. At second a critical oint, where function. Using of the matrix, we can generalize ∇ 0,, we the can eigendecomp examine the osition eigen eigenv values of Hessian the Hessian to determine whether x f( x) = 0 the second deriv ativ e test to multiple dimensions. A t a critical p oint, where the critical point is a lo local cal maximum, lo local cal minimum, or saddle point. When the f ( x ) = 0 , w e can examine the eigen v alues of the Hessian to determine whether Hessian is positiv ositivee definite (all its eigenv eigenvalues alues are positive), the poin ointt is a lo local cal the critical pointcan is ablo maximum, localthat minimum, or saddlesecond point. deriv Whenativ thee ∇ minim minimum. um. This e cal seen by observing the directional derivativ ative Hessian is positiv e definite (all itsand eigenv alues reference are positive), poin t is second a local in any direction must be positive, making to thethe univ univariate ariate minim um.test. ThisLik can be seen bythe observing the directional second derivalues ative deriv derivative ative Likewise, ewise, when Hessian that is negativ negative e definite (all its eigenv eigenvalues in any direction must b e p ositive, and making reference to the univ ariate second are negative), the point is a lo local cal maxim maximum. um. In multiple dimensions, it is actually deriv ativetotest. ewise, evidence when theofHessian negativ definite (all When its eigenv p ossible find Lik positive saddle ispoin oints ts inesome cases. at alues least are negative), the p oint is a lo cal maxim um. In multiple dimensions, it is actually one eigenv eigenvalue alue is positiv ositivee and at least one eigenv eigenvalue alue is negative, we kno know w that p ossible to find p ositive evidence of saddle p oin ts in some cases. When at least local cal maximum on one cross section of f but a lo local cal minimum on another x is a lo one eigenv alue is p ositiv e and at least one eigenv alue is negative, we kno w that cross section. See Fig. 4.5 for an example. Finally Finally,, the multidimensional second localtest maximum one crosse,section of the a ariate local minimum on another x is aative f but deriv derivative can be on inconclusiv inconclusive, just like univ univariate version. The test is cross section. See Fig. 4.5 for an example. Finally , the multidimensional second inconclusiv inconclusivee whenev whenever er all of the non-zero eigenv eigenvalues alues ha hav ve the same sign, but at deriv ative test can b e inconclusiv e, just like the univ ariate version. The test least one eigenv eigenvalue alue is zero. This is because the univ univariate ariate second deriv derivative ative test is is inconclusiv e whenev er all of the non-zero eigenv alues ha v e the same sign, but at inconclusiv inconclusivee in the cross section corresp corresponding onding to the zero eigenv eigenvalue. alue. least one eigenvalue is zero. This is because the univariate second derivative test is In multiple dimensions, there can b e a wide variety of different second deriv derivatives atives inconclusive in the cross section corresponding to the zero eigenvalue. at a single point, because there is a different second deriv derivative ative for each direction. In multiple dimensions, there can b e a wide v ariety of different second deriv derivativ atives The condition num numb ber of the Hessian measures how muc much h the second derivativ atives es at a single p oint, b ecause there is a different second deriv ative for each direction. vary ary.. When the Hessian has a poor condition num number, ber, gradien gradientt descent performs The condition ber of measures how muc h the second deriv ativin es p oorly orly. . This is num because in the oneHessian direction, the deriv derivative ative increases rapidly rapidly, , while vary. When the Hessian has a slowly poor condition ber, gradien tare descent performs another direction, it increases slowly. . Gradientnum descent is unaw unaware of this change p o orly . This is b ecause in one direction, the deriv ative increases rapidly , while in in the deriv derivativ ativ ativee so it do does es not know that it needs to explore preferentially in another direction, it increases slowly . Gradient descent is unaw are of this c hange the direction where the deriv derivativ ativ ativee remains negativ negativee for longer. It also mak makes es it in the deriv ative so it do es not know that it needs to explore preferentially in difficult to choose a go d step size. The step size m ust b e small enough to av goo o avoid oid direction where the deriv ativgoing e remains negativ e for longer. also mak es it othe versho ershooting oting the minimum and uphill in directions with Itstrong positive difficult to This choose a goodmeans step size. step size size ismto usto b e small enough to avoid curv curvature. ature. usually that The the step too small to make significant o v ersho oting the minimum and going uphill in directions with strong positive progress in other directions with less curv curvature. ature. See Fig. 4.6 for an example. curvature. This usually means that the step size is too small to make significant This issue candirections be resolved byless using information from4.6 theforHessian matrix to progress in other with curv ature. See Fig. an example. This issue can be resolved by using information from the Hessian matrix to 89


0

f(x1 ;x2 )

500

−500

−15

x1 0

15

−15

15 0 x2

Figure 4.5: A saddle p oint containing b oth p ositive and negative curv curvature. ature. The function 2 2 in this example is f (x ) = x1 − x 2. Along the axis corresp corresponding onding to x1, the function Figure 4.5: ard. A saddle ointiscontaining bector oth pof ositive and negative curv The function curv curves es upw upward. This paxis an eigenv eigenvector the Hessian and has a ature. p ositive eigenv eigenvalue. alue. ) = x to x x2 ,. the in this the example is f (xonding Along the axis corresp onding to x direction , the function Along axis corresp corresponding function curv curves es down downward. ward. This is an curv es upward. axis with is an− eigenvector of alue. the Hessian and“saddle has a ppositive eigenvfrom alue. eigen eigenv vector of theThis Hessian negative eigenv eigenvalue. The name oint” derives x Along the axis corresp onding to , theThis function estessential downward. This direction is an the saddle-like shap shapee of this function. is thecurv quin quintessential example of a function eigenvaector of pthe Hessian with negative eigenvalue. The “saddle oint” derives with saddle oint. In more than one dimension, it is notname necessary to pha hav ve an eigen eigenv vfrom alue the saddle-like shap e of this function. This is the quin tessential example of a function of 0 in order to get a saddle point: it is only necessary to hav havee both positive and negative with vaalues. saddleWpeoint. In more onepdimension, is signs not necessary to havas e an eigen alue eigen eigenv can think of athan saddle oint with bitoth of eigenv eigenvalues alues being a vlo local cal of 0 in um order to getone a saddle point: it is only necessary to hav e both positive andsection. negative maxim maximum within cross section and a lo local cal minim minimum um within another cross eigenvalues. We can think of a saddle point with both signs of eigenvalues as being a local maximum within one cross section and a lo cal minimum within another cross section.

90


20

x2

10 0 −10 −20 −30 −30 −20 −10

0 x1

10

20

Figure 4.6: Gradient descent fails to exploit the curv curvature ature information contained in the Hessian matrix. Here we use gradient descent to minimize a quadratic function f ( x) whose Figure 4.6: Gradient descent num failsber to exploit curvthat ature information in the Hessian matrix has condition number 5. This the means the direction ofcontained most curv curvature ature f ( x ) Hessian matrix. Here we use gradient descent to minimize a quadratic function whose has five times more curv curvature ature than the direction of least curv curvature. ature. In this case, the most Hessian hasdirection condition that isthe of most curv curvature aturematrix is in the [1 [1,,num and 5. theThis leastmeans curv curvature ature in direction the direction [1, −curv The 1]> ber 1]> .ature has lines five times morethe curv ature thanedthe least curv ature. this case,quadratic the most red indicate path follow followed by direction gradient of descent. This veryInelongated [1 , 1]on.and [1, descending 1] . The curvatureresembles is in the direction the least curvature is intime the direction function a long cany canyon. Gradient descent wastes rep repeatedly eatedly red yon lineswindicate the path ed steep by gradient descent. This the verystep elongated quadratic can canyon alls, b ecause they follow are the steepest est feature. Because size is −somewhat function a long cany Gradient rep eatedly descending to too o large,resembles it has a tendency to on. ov overshoot ershoot thedescent b ottomwastes of the time function and thus needs to can yon w alls, b ecause they are the steep est feature. Because the step size is somewhat descend the opp opposite osite cany canyon on wall on the next iteration. The large p ositive eigenv eigenvalue alue to othe large, it hascorresp a tendency ershoot the b ottom of the function and indicates thus needs to of Hessian corresponding ondingtotoovthe eigenv eigenvector ector p oin ointed ted in this direction that descend the oppderiv ositeativ cany onrapidly wall onincreasing, the next iteration. The large algorithm p ositive eigenv this directional derivativ ative e is so an optimization basedalue on of the Hessian corresp onding thesteep eigenv p ointed in this direction indicates that the Hessian could predict thattothe steepest estector direction is not actually a promising search this directional ative is rapidly increasing, so an optimization algorithm based on direction in this deriv context. the Hessian could predict that the steep est direction is not actually a promising search direction in this context.

91


guide the search. The simplest metho method d for doing so is known as Newton Newton’s ’s metho method d. Newton’s metho method d is based on using a second-order Taylor series expansion to guide ximate the search. metho d :for doing so is known as Newton’s method. appro approximate f (x)The nearsimplest some poin oint t x(0) Newton’s method is based on using a second-order Taylor series expansion to approximate(0) f (x) near some point x : 1 f (x) ≈ f (x )+(x−x(0) ) >∇ x f (x(0))+ (x−x (0))> H (f )(x(0) )( )(x x−x(0)). (4.11) 2 1 ) (x of )+this (x)then f (solve x )+ (xthex critical (x x ) H (f )(x )(x x ). (4.11) Iffwe for pf oint 2 function, we obtain: ≈ − ∇ − − (0) obtain: If we then solve for thex∗critical H (of f )(this x(0) function, = x(0) p−oint )−1 ∇x f (xwe ). (4.12) H (f )(xfunction, = x quadratic ) fNewton’s (x ). metho (4.12) When f is a positive xdefinite method d consists of applying Eq. 4.12 once to jump to directly.. When f is − the minimum of ∇the function directly f When is a p ositive definite quadratic function, Newton’s metho d consists of not truly quadratic but can be lo locally cally approximated as a positive definite quadratic, applying Eq. 4.12 consists once to jump to the Eq. minimum the function directlyely . When f is Newton’s method of applying 4.12 mof ultiple times. Iterativ Iteratively up updating dating not truly quadratic but be locally approximated a pappro ositive definite quadratic, the approximation and can jumping to the minimum ofasthe approximation ximation can reach Newton’s method consists of applying Eq. 4.12 m ultiple times. Iterativ ely dating the critical point muc much h faster than gradient descent would. This is a useful up prop propert ert erty y the approximation and jumping to the minimum of the appro ximation can reach near a lo local cal minimum, but it can be a harmful prop property erty near a saddle point. As the criticalinpoint h faster than gradient This iswhen a useful erty discussed Sec.muc 8.2.3 , Newton’s metho method ddescent is onlywould. appropriate theprop nearby near a lo caltminimum, but(all it can e a harmful erty near point. As critical poin oint is a minimum the beigen eigenv values of prop the Hessian areapsaddle ositive), whereas discussed in Sec. , Newton’s is only appropriate when the nearby gradien gradientt descen descent t is8.2.3 not attracted to metho saddledpoints unless the gradient points tow toward ard critical p oin t is a minimum (all the eigen v alues of the Hessian are p ositive), whereas them. gradient descent is not attracted to saddle points unless the gradient points toward Optimization algorithms such as gradient descen descentt that use only the gradien gradientt are them. called first-or first-order der optimization algorithms algorithms.. Optimization algorithms such as NewOptimization algorithms such as gradient t thatse use onlyder the optimization gradient are ton’s metho method d that also use the Hessian matrixdescen are called sec cond-or ond-order called first-or der optimization algorithms algorithms (No Nocedal cedal and Wright , 2006). . Optimization algorithms such as Newton’s method that also use the Hessian matrix are called second-order optimization The optimization algorithms emplo employ algorithms (Nocedal and Wright, 2006 ). yed in most contexts in this book are applicable to a wide variety of functions, but come with almost no guaran guarantees. tees. This The optimization algorithms emplo y ed in most contexts in this book are is because the family of functions used in deep learning is quite complicated. In applicable to a wide v ariety of functions, but come with almost no guaran tees. This man many y other fields, the dominant approach to optimization is to design optimization is because the of family functions used in deep learning is quite complicated. In algorithms for afamily limited of functions. many other fields, the dominant approach to optimization is to design optimization In the context of deep learning, we sometimes gain some guarantees by restrictalgorithms for a limited family of functions. ing ourselv ourselves es to functions that are either Lipschitz continuous or ha hav ve Lipsc Lipschitz hitz In the context of deep learning, we sometimes gain some guarantees by restrictf con deriv A Lipsc contin function is a function whose rate contin tin tinuous uous derivativ ativ atives. es. Lipschitz hitz continuous uous ingchange ourselvisesbounded to functions are either Lipschitz continuous or have Lipschitz of by a that Lipschitz constant L: continuous derivatives. A Lipschitz continuous function is a function f whose rate of change is bounded by : x − y ||2 . ∀xa, ∀Lipschitz y, |f (x) −constant f (y)| ≤ L|| (4.13) L y, f (xit) allo f (ws y) us to quantify x y .our assumption that (4.13)a This prop propert ert erty y is usefulxb, ecause allows small change in the input such−as ||gradien gradientt descen descentt will hav havee ∀ made ∀ | by an − algorithm | ≤ L|| This prop ert y is useful b ecause it allo ws us to quantify our assumption that a a small change in the output. Lipschitz con contin tin tinuit uit uity y is also a fairly weak constrain constraint, t, small change in the input made by an algorithm such as gradient descent will have a small change in the output. Lipschitz 92 continuity is also a fairly weak constraint,


and many optimization problems in deep learning can be made Lipschitz con contin tin tinuous uous with relatively minor mo modifications. difications. and many optimization problems in deep learning can be made Lipschitz continuous Perhaps the most successful field of sp specialized ecialized optimization is convex optimizawith relatively minor modifications. tion tion.. Conv Convex ex optimization algorithms are able to provide many more guarantees P erhaps the mostrestrictions. successful field ofvsp optimization is convex optimizaby making stronger Con Conv execialized optimization algorithms are applicable tion. to Conv exexoptimization algorithms ablethe to provide more guarantees only conv convex functions—functions forare which Hessian many is positiv ositive e semidefinite b yerywhere. making stronger restrictions. Convex optimization algorithms are papplicable ev everywhere. Suc Such h functions are well-behav well-behaved ed because they lack saddle oints and only to conv ex functions—functions for which the Hessian is p ositiv e semidefinite all of their lo local cal minima are necessarily global minima. Ho How wev ever, er, most problems ev erywhere. Suc h functions are well-behav ed b ecause they lack saddle points and in deep learning are difficult to express in terms of conv convex ex optimization. Conv Convex ex all of their local minima minima. However, most problems optimization is used only are as anecessarily subroutineglobal of some deep learning algorithms. Ideas in deep difficult to express inalgorithms terms of conv Convthe ex from thelearning analysisare of conv convex ex optimization can ex be optimization. useful for proving optimization is deep used only as aalgorithms. subroutine of some deep learningthe algorithms. Ideas con conv vergence of learning How Howev ev ever, er, in general, imp importance ortance of from the analysis of conv ex optimization algorithms can b e useful for proving the con conv vex optimization is greatly diminished in the context of deep learning. For convergence of deep learning algorithms. Howsee ever, in the imp ortance of) more information ab about out conv convex ex optimization, Bo Boyd ydgeneral, and Vanden andenb berghe (2004 conRo vex optimization or Rock ck ckafellar afellar (1997).is greatly diminished in the context of deep learning. For more information about convex optimization, see Boyd and Vandenberghe (2004) or Rockafellar (1997).

4.4

Constrained Optimization

Sometimes we wish not only to maximize or minimize a function f (x) ov over er all 4.4 Constrained Optimization possible values of x. Instead we ma may y wish to find the maximal or minimal value of Sometimes we wish not only to maximize or minimize a function er all f (x) ov f (x) for values of x in some set S. This is known as constr onstraine aine ained d optimization optimization. . Poin Points ts x p ossible v alues of . Instead w e ma y wish to find the maximal or minimal v alue of x that lie within the set S areScalled fe feasible asible poin oints ts in constrained optimization fterminology. (x) for values set . This is known as constrained optimization. Points terminology . of x in some S x that lie within the set are called feasible points in constrained optimization We often wish to find a solution that is small in some sense. A common terminology. approac approach h in such situations is to imp impose ose a norm constrain constraint, t, such as ||x|| ≤ 1. We often wish to find a solution that is small in some sense. A common One hsimple approac is simply to as mo approach h to constrained modify dify approac in such situations is to imposeoptimization a norm constrain t, such x gradient 1. descen descentt taking the constraint into account. If we use a small constant step size , || ≤ Onemake simple approac h to constrained simply to mo||dify S. Ifgradient we can gradien gradient t descent steps, thenoptimization pro project ject the is result bac back k into we use descen t taking the constraint into account. If we use a small constant step size , a line searc search, h, we can search only ov over er step sizes  that yield new x poin oints Sts that are we can make t descent steps, pro ject result into . If we use feasible, or wegradien can pro pointthen on the line the back into bac thekconstraint region. project ject each  that yield x poin a line searc h, wethis canmetho searchd only stepmore sizes efficient new ts that are When possible, method can bov e er made by pro projecting jecting the gradient feasible, or we can pro ject each point region on the bline back into the the step constraint region. in into to the tangen tangent t space of the feasible efore taking or beginning When p ossible, this metho d can b e made more efficient by pro jecting the gradient the line search (Rosen, 1960). into the tangent space of the feasible region before taking the step or beginning more sophisticated approach is to design a different, unconstrained optithe A line search (Rosen, 1960 ). mization problem whose solution can be conv converted erted in into to a solution to the original, A more sophisticated approach is to design a different, constrained optimization problem. For example, if we wan wantt tounconstrained minimize f( x)optifor mization problem whose solution can b e conv erted in to a solution to the original, 2 2 x ∈ R with x constrained to ha hav ve exactly unit L norm, we can instead minimize constrained optimization problem. For example, if we want to minimize f( x) for R x with x constrained to have exactly unit L norm, we can instead minimize 93 ∈


g(θ ) = f ([cos θ, sin θ]> ) with resp respect ect to θ, then return [ cos θ, sin θ] as the solution to the original problem. This approac approach h requires creativit creativity; y; the transformation gb(et θ ) = f ([ cos θ, sin θ ] θ cos θ, sin θfor ) with resp ect to , then return [ ] aseac theh solution etw ween optimization problems must be designed sp specifically ecifically each case we to the original problem. This approac h requires creativit y; the transformation encoun encounter. ter. between optimization problems must be designed specifically for each case we The Karush–Kuhn–T Karush–Kuhn–Tucker ucker (KKT) approach1 pro provides vides a very general solution encounter. to constrained optimization. With the KKT approach, we in intro tro troduce duce a new function The Karush–Kuhn–T ucker (KKT) approach pro vides a very general solution called the gener generalize alize alized d Lagr agrangian angian or gener generalize alize alized d Lagr agrange ange function . to constrained optimization. With the KKT approach, we introduce a new function To define the Lagrangian, we first need to describ describee in terms of equations called the generalized Lagrangian or generalized LagrangeSfunction. and inequalities. W Wee wan antt a description of S in terms ofSm functions g (i) and n To define Lagrangian, of equations (i) (x)need h(j) the = 0Sto 0} . The i, gfirst anddescrib ∀j, h (je)( x )in≤terms S = {x | ∀we functions so that equations m g and inequalities. W e w an t a description of in terms of functions and ( i ) h (nj) in inv volving g are called onstraints aints and the inequalities inv involving olving S the equality constr h = ( x ) = 0 ( x ) 0 x i, g and j, h functions so thatconstr . The equations are called ine inequality quality onstraints aints aints.. involving g are called the{ equality c onstr aints and the inequalities |∀ ∀ ≤ } involving h We in intro tro troduce duce new variables λi and α j for each constraint, these are called the are called inequality constraints. KKT multipliers. The generalized Lagrangian is then defined as We introduce new variables λ and these are called the Xα for each constraint, X (i) is then defined (j ) as KKT multipliers. The generalized Lagrangian L(x, λ, α) = f (x) + λ ig (x) + α j h (x). (4.14) i

j

L(x, λ, α) = f (x) + λ g (x) + α h (x). (4.14) We can no now w solve a constrained minimization problem using unconstrained optimization of the generalized Lagrangian. Observe that, so long as at least one We can noexists w solve a constrained problem unconstrained feasible point and f (x) is not pminimization ermitted to hav have e value using ∞, then X optimization of the generalized Lagrangian. ObserveX that, so long as at least one feasible point exists and f (xmin ) ismax not pmax ermitted hav L(x, to λ, α ). e value , then (4.15) x λ α,α≥0 ∞ min max max L(x, λ, α). (4.15) has the same optimal ob objectiv jectiv jectivee function value and set of optimal points x as has the same optimal ob jective function as min fv(alue x). and set of optimal points x (4.16) x∈S

min f (x). are satisfied, This follows because any time the constraints

(4.16)

This follows because any time max the maxconstraints L(x, λ, α)are = fsatisfied, (x),

(4.17)

max max L(x, λ, α) = f (x), while any time a constraint is violated,

(4.17)

while any time a constraintmax is violated, max L(x, λ, α) = ∞.

(4.18)

λ

λ

α,α≥0

α,α≥0

(4.18) max max L(x, λ, α) = . These prop properties erties guarantee that no infeasible poin ointt will ever be optimal, and that ∞ the optimum within the feasible poin oints ts is unchanged. These properties guarantee that no infeasible point will ever be optimal, and that 1 KKT approach generalizes method Lagrange multipliers which allows equality the The optimum within the feasiblethepoin ts is of unchanged. constraints but not inequality constraints.

94


To perform constrained maximization, we can construct the generalized Lagrange function of −f (x), whic which h leads to this optimization problem: To perform constrained maximization, we can construct the generalized LaX X (i)optimization problem: grange function ofmax f (xmax ), whic this min α j h(j)(x). −fh(xleads ) + to λ (4.19) ig (x) + x −λ α,α≥0 i j min max max f (x) + λ g (x) + α h (x). (4.19) We ma may y also conv convert ert this to − a problem with maximization in the outer lo loop: op: X X ) (j )the outer lo op: We may also conv ert thismin to afproblem (x) − max min (x) + with λ ig (imaximization α j hin (x). (4.20) x α , α ≥0 λ X X i j max min min f (x) + λ g (x) α h (x). (4.20) The sign of the term for the equality constraints do does es not matter; we may define it − with addition or subtraction as we wish, because the optimization is free to cho hoose ose The sign of the term for the equality constraints do es not matter; we may define it an any y sign for each λi. with addition or subtraction as we wish, is free to choose Xbecause the optimization X The inequality constrain constraints ts are particularly in interesting. teresting. We say that a constraint any sign for each λ . h(i) (x ) is active if h(i) ( x ∗) = 00.. If a constraint is not activ active, e, then the solution to The inequality constrain ts are particularly in teresting. W e sayathat constraint the problem found using that constrain constraintt would remain at least lo local cala solution if h ( x h ( x ) is active if ) = 0 . If a constraint is not activ e, then the solution to that constrain constraintt were remo remov ved. It is possible that an inactiv inactivee constrain constraintt excludes the problem found that constrain would remain least region a local of solution if other solutions. Forusing example, a conv convex ex tproblem with anatentire globally that constrain remoflat, ved. region It is possible an could inactivhav e constrain t excludes optimal poin oints tst w (aere wide, of equalthat cost) have e a subset of this other solutions. F or example, a conv ex problem with an entire region of globally region eliminated by constrain constraints, ts, or a non-conv non-convex ex problem could hav havee better lo local cal optimal p oin ts (a wide, flat, region of equal cost) could hav e a subset of this stationary poin oints ts excluded by a constraint that is inactiv inactivee at conv convergence. ergence. How Howev ev ever, er, region eliminated byconv constrain ts, remains or a non-conv ex problem could have bor etter cal the point found at convergence ergence a stationary point whether notlothe stationary points excluded by a constraint is inactiv ergence. Howthen ever, h(ei) at inactiv inactivee constrain constraints ts are included. Because that an inactive hasconv negativ negative e value, the solution point found at conv ergence remains a stationaryhave point notthus the the to min e αi whether = 0. Weorcan x maxλ max α,α≥0 L( x, λ, α) will hav h inactiv e constrain ts are included. Because an inactive has negativ e v alue, then αh((x ) = 0. In other words, for all i , we know that at observ observee that at the solution, αh min max max the solution to ) will havbee αactive = 0.atW e can thus α x,)α≤ least one of the constraints i ≥ 0 andLh( (xi),(λ 0 must the solution. αh(xidea, ) = 0we observ e that the solution, . Incan other all ithe , wesolution know that at T o gain someatintuition for this saywords, that for either is on α inequalit (x ) we 0mmust leastboundary one of theimp constraints 0 and hy and be its active at m the solution. the imposed osed by the inequality ust use KKT ultiplier to T o gain some intuition for this idea, we can say that either the solution is on ≥ ≤ influence the solution to x, or the inequalit inequality y has no influence on the solution and the b oundary imp osed b y the inequalit y and we must use its KKT multiplier to we represent this by zeroing out its KKT multiplier. influence the solution to x, or the inequality has no influence on the solution and The prop properties erties that the gradien gradientt of the generalized Lagrangian is zero, all we represent this by zeroing out its KKT multiplier. constrain constraints ts on both x and the KKT multipliers are satisfied, and α  h (x) = 0 The prop that the gradien t of theconditions generalized Lagrangian zero,and all are called theerties Karush-Kuhn-T Karush-Kuhn-Tuc uc uck ker (KKT) (Karush , 1939;isKuhn h (x) = 0 constrain ts on bogether, oth x and theprop KKT multipliers satisfied,poin andtsαof constrained Tuck ). T these describ the optimal ucker er, 1951 properties erties describee are oints are called theproblems. Karush-Kuhn-Tucker (KKT) conditions (Karush, 1939 ; Kuhn and optimization Tucker, 1951). Together, these properties describe the optimal points of constrained For more information about out the KKT approach, see No Nocedal cedal and Wrigh rightt (2006). optimization problems. ab For more information about the KKT approach, see Nocedal and Wright (2006). 95


4.5

Example: Linear Least Squares

Supp Suppose we wan wantt to find the value of x that minimizes 4.5 oseExample: Linear Least Squares Suppose we want to find the value of 1x that minimizes (4.21) f (x) = ||Ax − b||22 . 2 1 . solve this problem efficiently (4.21) . Ax that b can f (x)algorithms = There are sp specialized ecialized linear algebra efficiently. 2 || it − || gradien Ho How wev ever, er, we can also explore ho how w to solve using gradient-based t-based optimization as There are sp ecialized linear algebra algorithms that can solve this problem efficiently. a simple example of ho how w these techniques work. However, we can also explore how to solve it using gradient-based optimization as First, we need to obtain the gradient: a simple example of how these techniques work. First, we need to the gradient: ∇obtain (Ax − b) = A>Ax − A >b. A> xf (x) =

(4.22)

b) =taking A Ax (x)gradient = A (Ax b. See Algorithm (4.22) We can then follo follow w fthis do downhill, wnhill, smallAsteps. 4.1 ∇ − − for details. We can then follow this gradient downhill, taking small steps. See Algorithm 4.1 for details. 4.1 An algorithm to minimize f( x) = 12 ||Ax − b||22 with resp respect ect to x Algorithm using gradient descent. Ax b with respect to x Algorithm 4.1 An algorithm to minimize f( x) = Set the step size () and tolerance (δ ) to small, positive num umb bers. using gradient descent.> || − || > while || ||A A Ax b|| 2 > δ do  >−(A Setx the ) and (δ ) to small, positive numbers. >b A Ax ← xstep −  size − Atolerance while A Ax A b > δ do end while x x ||  A −Ax ||A b end ← while − problem using Newton’s metho One can−also solve this method. d. In this case, because the true function is quadratic, the quadratic approximation employ employed ed by Newton’s One can also solve this problem using Newton’s metho d. In this case, metho method d is exact, conv converges erges to the global minimum in b a ecause single  and the algorithm  the true function is quadratic, the quadratic approximation employ ed by Newton’s step. method is exact, and the algorithm converges to the global minimum in a single No Now w supp suppose ose w wee wish to minimize the same function, but sub subject ject to the step. > x ≤ 1. To do so, we introduce the Lagrangian constrain constraintt x Now suppose we wish to minimize the same function, but sub ject to the  > Lagrangian so, we introduce the constraint x x 1. To do L(x, λ) = f (x) + λ x x − 1 . (4.23) ≤ L(x, λ) = f (x) + λ x x 1 . (4.23) We can now solve the problem − We can now solve the problem min max L(x, λ). (4.24) x

λ,λ≥0

 min max L( x, λ). (4.24) The smallest-norm solution to the unconstrained least squares problem may be found using the Mo Moore-Penrose ore-Penrose pseudoinv pseudoinverse: erse: x = A+ b. If this point is feasible, The smallest-norm solution to the unconstrained squares problem then it is the solution to the constrained problem.least Otherwise, we mustmay findbae found using the Moore-Penrose pseudoinverse: x = A b. If this point is feasible, 96 problem. Otherwise, we must find a then it is the solution to the constrained


solution where the constraint is active. By differentiating the Lagrangian with resp respect ect to x, we obtain the equation solution where the constraint is active. By differentiating the Lagrangian with > respect to x, we obtain theA equation (4.25) Ax − A> b + 2λx = 00.. b +form 2λx = 0. A will Ax take A the This tells us that the solution − This tells us that the solution take x =will (A > A +the 2λIform ) −1A >b.

(4.25) (4.26)

= (A suc A + λI ) theAresult b. ob (4.26) The magnitude of λ must bex chosen such h 2that obeys eys the constrain constraint. t. We can find this value by performing gradient ascent on λ. To do so, observ observee The magnitude of λ must be chosen such that the result obeys the constraint. We ∂ gradient ascent can find this value by performing on λ. To do so, observe L(x, λ) = x> x − 1. (4.27) ∂λ ∂ L(x , λ) ative = x isx positiv 1. e, so to follow the deriv (4.27) When the norm of x exceeds 1,∂ this derivative ositive, derivativ ativ ativee λ deriv − uphill and increase the Lagrangian with resp respect ect to λ, we increase λ . Because the When thet norm exceeds 1, this increased, derivative solving is positiv e, so to follow the deriv co coefficien efficien efficient on theofxx> x penalt enalty y has the linear equation for xativ wille uphill andaincrease Lagrangian withThe resppro ect cess to λof , we increase . Because the no now w yield solutionthe with smaller norm. process solving theλlinear equation x x x coefficien t on the penalt y has increased, solving theand linear for will λ con x has and adjusting contin tin tinues ues until the correct norm theequation deriv derivativ ativ ative e on λ is no w yield a solution with smaller norm. The pro cess of solving the linear equation 0. and adjusting λ continues until x has the correct norm and the derivative on λ is This concludes the mathematical preliminaries that we use to dev develop elop machine 0. learning algorithms. We are no now w ready to build and analyze some full-fledged This concludes the mathematical preliminaries that we use to develop machine learning systems. learning algorithms. We are now ready to build and analyze some full-fledged learning systems.

97

Chapter 5 Chapter 5

Mac Machine hine Learning Basics hine Mac Learning Basics Deep learning is a sp specific ecific kind of mac machine hine learning. In order to understand deep learning well, one must ha have ve a solid understanding of the basic principles Deep learning is a spThis ecificchapter kind ofpro mac hinea brief learning. to understand of mac machine hine learning. provides vides courseIninorder the most important deep learning w ell, one must ha ve a solid understanding of the basic principles general principles that will be applied throughout the rest of the bo book. ok. No Novice vice of mac hine learning. This c hapter pro vides a brief course in the most important readers or those who wan antt a wider persp perspectiv ectiv ectivee are encouraged to consider mac machine hine general principles that will b e applied throughout the rest of the bo ok. No vice learning textb textbo ooks with a more comprehensive co cov verage of the fundamen fundamentals, tals, suc such h readers or ythose who ant a wider persp are encouraged to consider machine hine as Murph Murphy (2012 ) orwBishop (2006 ). Ifectiv youe are already familiar with mac machine learning textb ooks coverage the fundamen tals, suc h learning basics, feelwith free atomore skip comprehensive ahead to Sec. 5.11 . Thatofsection cov covers ers some peras Murph y (traditional 2012) or Bishop (2006 ). If tec youhniques are already familiar withinfluenced machine sp spectiv ectiv ectives es on mac machine hine learning techniques that hav have e strongly learning basics, feel free to skip ahead to Sec. 5.11 . That section cov ers some perthe dev developmen elopmen elopmentt of deep learning algorithms. spectives on traditional machine learning techniques that have strongly influenced e elopmen begin with definition what a learning algorithm is, and present an the W dev t of adeep learningofalgorithms. example: the linear regression algorithm. W Wee then pro proceed ceed to describ describee how the W e b egin with a definition of what a learning algorithm is, and present an challenge of fitting the training data differs from the challenge of finding patterns example: the linear regression algorithm. We learning then proceed to describ ee how the that generalize to new data. Most mac machine hine algorithms hav have settings challenge of fitting the training data differs fromexternal the challenge findingalgorithm patterns called hyp yperparameters erparameters that must be determined to the of learning that generalize new Mostusing machine learning algorithms havlearning e settings itself; we discusstoho how w todata. set these additio additional nal data. Mac Machine hine is called h yp erparameters that m ust b e determined external to the learning algorithm essen essentially tially a form of applied statistics with increased emphasis on the use of itself; we discuss how toestimate set these using additio nal data. hine learning is computers to statistically complicated functions and aMac decreased emphasis essen tially form of applied with functions; increased w emphasis on present the usethe of on pro proving ving aconfidence in interv terv tervals alsstatistics around these e therefore statistically and a decreased emphasis tcomputers wo centraltoapproac approaches hes to estimate statistics:complicated frequen frequentist tistfunctions estimators and Ba Bay yesian inference. on promachine ving confidence tervals around functions; e thereforeofpresent the Most learning in algorithms can bethese divided in into to thewcategories sup supervised ervised tlearning wo central approac hes to statistics: frequen tist estimators and Ba y esian inference. and unsup unsupervised ervised learning; we describ describee these categories and give some Most machine learning algorithms can befrom divided the categories of sup ervised examples of simple learning algorithms eachinto category category. . Most deep learning learning andare unsup ervised learning; we describ e thesecalled categories and give somet algorithms based on an optimization algorithm stochastic gradien gradient examples ofe simple learning from each category . Most deep learning descen descent. t. W describe how toalgorithms com combine bine various algorithm comp components onents suc such h as an algorithms are based on an optimization algorithm called stochastic gradient descent. We describe how to combine v98 arious algorithm components such as an 98

CHAPTER 5. MACHINE LEARNING BASICS

optimization algorithm, a cost function, a model, and a dataset to build a mac machine hine learning algorithm. Finally Finally,, in Sec. 5.11, we describe some of the factors that hav havee optimization algorithm, a cost function, model, and a dataset toThese build challenges a machine limited the ability of traditional mac machine hinea learning to generalize. learning algorithm. , in Sec. e describe some of the that these have ha hav ve motiv motivated ated the Finally dev developmen elopmen elopment t of 5.11 deep, w learning algorithms thatfactors ov overcome ercome limited the ability of traditional mac hine learning to generalize. These challenges obstacles. have motivated the development of deep learning algorithms that overcome these obstacles.

5.1

Learning Algorithms

A machine learning algorithm is an algorithm that is able to learn from data. But 5.1 Learning Algorithms what do we mean by learning? Mitchell (1997) provides the definition “A computer A machine algorithm is anerience algorithm thatresp is able learnclass fromofdata. E with T program is learning said to learn from exp experience respect ect totosome tasksBut what do w e mean b y learning? Mitchell ( 1997 ) provides the definition “A computer and performance measure P , if its performance at tasks in T , as measured by P , E witha resp program saidexp toerience learn from experience to v some of eriences tasks T E .” One impro improv ves is with experience can imagine veryect wide arietyclass of exp experiences P, and performance measure P ,measures if its performance in Te, any as measured , tasks dotasks not mak make attempt inbythis E T , and performance P , and weat E .” One can impro vesprovide with exp imagine wide ariety experiences b ook to a erience formal definition of what ma may yabveery used for veac each h ofofthese entities. , tasks , and p erformance measures , and we do not mak e any attempt E T P Instead, the follo following wing sections provide intuitiv intuitivee descriptions and examples in of this the b o ok to provide a formal definition of what ma y b e used for eac h of these entities. differen differentt kinds of tasks, performance measures and exp experiences eriences that can be used Instead, the follo wing sections provide intuitiv e descriptions and examples of the to construct machine learning algorithms. different kinds of tasks, performance measures and experiences that can be used to construct machine learning algorithms.

5.1.1

The Task, T T Mac Machine hine The learni learning allo allows ws us to tac tackle kle tasks that are too difficult to solv solvee with 5.1.1 Tng ask,

fixed programs written and designed by human beings. From a scien scientific tific and Mac hine learni ng allo ws us to tac kle tasks that are too difficult to solv e with philosophical point of view, machine learning is in interesting teresting because developing our fixed programs written and designed b y human b eings. F rom a scien tific and understanding of machine learning en entails tails developing our understanding of the philosophical of view, machine learning is interesting because developing our principles thatpoint underlie intelligence. understanding of machine learning entails developing our understanding of the In this that relativ relatively ely formal definition of the word “task,” the pro process cess of learning principles underlie intelligence. itself is not the task. Learning is our means of attaining the ability to perform the relatively of bthe word the pro cess of learning task.InFthis or example, if formal we wan antdefinition t a rob robot ot to e able to “task,” walk, then walking is the task. itself is not the task. our to means abilitytotodirectly performwrite the W e could program theLearning rob robot ot to islearn walk,oforattaining we couldthe attempt For example, if we w ant to a rob ot man to bually e able. to walk, then walking is the task. atask. program that sp specifies ecifies how walk manually ually. We could program the rob ot to learn to walk, or we could attempt to directly write Mac Machine hine learning tasks are usually describ described ed in terms of ho how w the mac machine hine a program that specifies how to walk manually. learning system should pro an example . An example is a collection of fe process cess featur atur atures es learning tasksely aremeasured usually describ ed inob terms ho w the mac hine thatMac ha hav vhine e been quantitativ quantitatively from some object ject orofev even en ent t that we w an antt learning system shouldsystem processtoan example . eAn examplerepresen is a collection of featur the mac machine hine learning pro process. cess. W typically represent t an example ases a that ha v e b een quantitativ ely measured from some ob ject or ev en t that w e w an n vector x ∈ R where eac each h en entry try xi of the vector is another feature. For example,t the features machine of learning system to process. We typically represent an example as a the R an image are usually the values of the pixels in the image. vector x where each entry x of the vector is another feature. For example, the features ∈ of an image are usually the values of the pixels in the image. 99


Man Many y kinds of tasks can be solved with machine learning. Some of the most common mac machine hine learning tasks include the following: Many kinds of tasks can be solved with machine learning. Some of the most common machine learning include the computer following: program is ask • Classific Classification ation ation: : In this tasks typ ypee of task, the asked ed to sp specify ecify

whic which h of k categories some input belongs to. To solv solvee this task, the learning Classification : In thisask t yp e of task, the computer program ecify {1ask , . . ed . , kto }. sp f : Rn → is algorithm is usually to pro a function When asked ed produce duce categories input belongs to. To solv evector this task, learning • ywhic = fh(of x)k, the x tothe mo model del some assigns an input described byR a category . , k . When f : ts of the 1, . .classification algorithm isyusually ask ed duce are a function iden identified tified b numeric co code detoy.pro There other varian ariants y = f ( x x ) , the mo del assigns an input described b y vector to aer →{ }category task, for example, where f outputs a probability distribution ov over classes. idenexample tified by of numeric code y. There areob other variants of the classification An a classification task is object ject recognition, where the input f outputs task, for example, where a probability distribution overand classes. is an image (usually described as a set of pixel brightness values), the An example of a classification task is ob ject recognition, where the input output is a numeric co code de identifying the ob object ject in the image. For example, is an image (usually described as a set of pixel values), and the the Willo Willow w Garage PR2 rob robot ot is able to act as abrightness waiter that can recognize output tis kinds a numeric code and identifying ob ject in the on image. For example, differen of drinks deliv them to people command (Go different deliver er the Goo o dthe Willo w Garage PR2 rob ot is able to act as a w aiter that can recognize fello fellow w et al. al.,, 2010). Mo Modern dern ob object ject recognition is best accomplished with differen t kinds of drinks and deliv er them to and people on command (Goject oddeep learning (Krizhevsky et al., 2012 ; Ioffe Szegedy , 2015). Ob Object fello w et al. , 2010 ). Mo dern ob ject recognition is b est accomplished with recognition is the same basic tec technology hnology that allo allows ws computers to recognize deep learning ( Krizhevsky et al. , 2012 ; Ioffe and Szegedy, 2015tag ). Ob ject faces (Taigman et al. al.,, 2014), which can be used to automatically people recognition is the same teccomputers hnology that ws computers to recognize in photo collections andbasic allow to allo interact more naturally with faces ( T aigman et al. , 2014 ), which can b e used to automatically tag p eople their users. in photo collections and allow computers to interact more naturally with • Classific Classification ation with missing inputs inputs:: Classification b ecomes more challenging if their users. the computer program is not guaranteed that every measurement in its input Classific ation with missing inputsIn : Classification ecomes more challenging if v ector will alwa always ys b e pro provided. vided. order to solv solveeb the classification task, the the computer program nottoguaranteed that function every measurement in its input • learning algorithm onlyishas define a single mapping from a vector vector to will ys be pro vided. When In order to solv theinputs classification the input a alwa categorical output. some of ethe may be task, missing, learningthan algorithm onlya has to define a singlefunction, function the mapping from a vector rather providing single classification learning algorithm input to a categorical output. When some of the inputs may b e missing, must learn a set of functions. Eac Each h function corresp corresponds onds to classifying x with than providing singlemissing. classification function, the learning algorithm arather differen different t subset of its ainputs This kind of situation arises frequen frequently tly x m ust learn a set of functions. Eac h function corresp onds to classifying with in medical diagnosis, because many kinds of medical tests are exp expensiv ensiv ensivee or ain differen t subset of its inputs missing. This kind of situation arises frequen tly inv vasive. One wa way y to efficien efficiently tly define suc such h a large set of functions is to learn in medical diagnosis, b ecause many kinds of medical tests are exp ensiv e or a probability distribution over all of the relev relevant ant variables, then solv solvee the invasive. Onetask way by to efficien tly define suc h amissing large setvariables. of functions is ntoinput learn classification marginalizing out the With aariables, probability distribution over the relev ant variables, then solv e the v we can no now w obtain allall 2n ofdifferen different t classification functions needed classification task set by of marginalizing out the With an single input for eac each h possible missing inputs, butmissing we onlyvariables. need to learn v ariables, we can no w obtain all 2 differen t classification functions needed function describing the join jointt probabilit probability y distribution. See Go Goo odfello dfellow w et al. for eac h p ossible set of missing inputs, but w e only need to learn a (2013b) for an example of a deep probabilistic mo model del applied to such asingle task function describing thethe join t probabilit y distribution. Goodfello et bal. in this way ay. . Man Many y of other tasks describ described ed in thisSee section can w also e (generalized 2013b) for an example of a deep probabilistic mo del applied to such a task to work with missing inputs; classification with missing inputs is in this w ay . Manyof of the machine other tasks describ eddo. in this section can also be just one example what learning can generalized to work with missing inputs; classification with missing inputs is just one example of what machine100learning can do.


• Regr gression ession ession:: In this type of task, the computer program is ask asked ed to predict a numerical value giv given en some input. To solv solvee this task, the learning algorithm Reask gression In this taype of task,f the program is ask edistosimilar predicttoa is asked ed to: output function type of task → R. This : Rncomputer numerical value given that somethe input. To solve this task, the learning algorithm • classification, except format is differen different. t. An example of R R of output is ask ed to output a function . This type of task is similar to f : a regression task is the prediction of the exp expected ected claim amount that an classification, that the format of outputpremiums), is different.orAn of → insured personexcept will mak make e (used to set insurance theexample prediction a regression taskofissecurities. the prediction the exp ected claim are amount that for an of future prices Theseofkinds of predictions also used insured person will make (used to set insurance premiums), or the prediction algorithmic trading. of future prices of securities. These kinds of predictions are also used for algorithmic trading. ranscription anscription: : In this type of task, the machine learning system is ask asked ed to • T observ observee a relativ relatively ely unstructured represen representation tation of some kind of data and T r anscription : In this type of task, the machine learninginsystem asked to transcrib transcribee it into discrete, textual form. For example, opticalischaracter observe a relativ unstructured represen tation of some kind of data and • recognition, the ely computer program is shown a photograph containing an transcrib e it into discrete, textual form. F or example, in optical c haracter image of text and is asked to return this text in the form of a sequence recognition, computer isdeshown a photograph containing an of charactersthe (e.g., in ASCIIprogram or Unico Unicode format). Go Google ogle Street View uses image of text and is asked to return this text in the form of a sequence deep learning to process address num numbers bers in this wa way y (Go Goo o dfello dfellow w et al. al.,, of c haracters (e.g., in ASCII or Unico de format). Go ogle Street View uses 2014d Another example is sp recognition, where the computer program 2014d). ). speec eec eech h deep learning process address bersa in this waof y (cGo o dfellow al., is pro provided vided an to audio waveform andnum emits sequence haracters or et word 2014d ). Another example is speec h recognition, computer program ID co codes des describing the words that were sp spoken oken where in the the audio recording. Deep is pro vided an audio w a v eform and emits a sequence of c haracters or ord learning is a crucial comp component onent of mo modern dern speech recognition systems w used ID ma codes words that were spoken theGo audio at major jor describing companiesthe including Microsoft, IBMinand Google oglerecording. (Hin Hinton ton etDeep al. al.,, learning is a crucial comp onent of mo dern speech recognition systems used 2012b 2012b). ). at ma jor companies including Microsoft, IBM and Google (Hinton et al., ). tr Machine translation anslation anslation:: In a mac machine hine translation task, the input already consists • 2012b of a sequence of symbols in some language, and the computer program must Machine translation : In a mac task, the inputThis already consists con conv vert this in into to a sequence of hine sym symb btranslation ols in another language. is commonly of a sequence of symbols in some language, and the computer program must • applied to natural languages, such as to translate from English to Frenc rench. h. convert this into a sequence symbto olsha in another language. This is commonly Deep learning has recently of begun e an imp impact on this kind hav v important ortant applied natural languages, as to translate from of task (to Sutskev Sutskever er et al. al.,, 2014; such Bahdanau et al. al.,, 2015 ). English to French. Deep learning has recently begun to have an important impact on this kind Structur d output tasks et inv any).task where the output task (eSutskev er: Structured et al., 2014output ; Bahdanau al. , 2015 • of Structure output: involve olve is a vector (or other data structure con containing taining multiple values) with important Structur e d output : Structured output tasks involve where the output relationships bet etw ween the differen differentt elemen elements. ts. Thisany is atask broad category category, , and is a v ector (or other data structure con taining m ultiple v alues) with important • subsumes the transcription and translation tasks describ described ed ab abov ov ove, e, but also relationships b et w een the differen t elemen ts. This is a broad category , and man many y other tasks. One example is parsing—mapping a natural language subsumes the and tasksstructure described abtagging ove, butno also sen sentence tence in into to a transcription tree that describ describes es translation its grammatical and nodes des man y other tasks. One example is parsing—mapping a natural language of the trees as b eing verbs, nouns, or adverbs, and so on. See Collob Collobert ert (2011) sen tence in to a tree that describ es its grammatical structure and tagging nodes for an example of deep learning applied to a parsing task. Another example of pixel-wise the trees assegmen b eing verbs, or adverbs, and computer so on. Seeprogram Collobertassigns (2011) is segmentation tationnouns, of images, where the for anpixel example deep to learning applied to a. parsing task. Another example ev every ery in anofimage a sp specific ecific category category. For example, deep learning can is pixel-wise segmentation of images, where the computer program assigns every pixel in an image to a specific 101category. For example, deep learning can


be used to annotate the lo locations cations of roads in aerial photographs (Mnih and Hin Hinton ton, 2010). The output need not hav havee its form mirror the structure of b e used to annotate the lo cations of roads inyleaerial (Mnih and the input as closely as in these annotation-st annotation-style tasks.photographs For example, in image Hinton, 2010 ). computer The output need not haves e its the structure of captioning, the program observ observes an form imagemirror and outputs a natural the input as closely as in these annotation-st yle tasks. F or example, in image language sen sentence tence describing the image (Kiros et al., 2014a,b; Mao et al., captioning, theetcomputer program observ es ,an image and outputs 2015 ; Vin Vinyals yals al. al.,, 2015b ; Donahue et al. al., 2014 ; Karpath Karpathy y and aLinatural , 2015; language sen tence describing the image ( Kiros et al. , 2014a , b ; Mao et al., Fang et al. al.,, 2015; Xu et al. al.,, 2015). These tasks are called structured output 2015 yals et , 2015b; must Donahue et al. , 2014v;alues Karpath andall Li,tightly 2015; tasks; Vin because theal.program output several thaty are F ang et al. , 2015 ; Xu et al. , 2015 ). These tasks are called structured output in inter-related. ter-related. For example, the words pro produced duced by an image captioning tasks because must output several values that are all tightly program must the formprogram a valid sen sentence. tence. inter-related. For example, the words produced by an image captioning • A nomaly m dete detection ction: : In of task, the computer program sifts through program ustction form a vthis alidtype sentence. a set of even events ts or ob objects, jects, and flags some of them as being unusual or at atypical. ypical. A nomaly dete ction : In this type of task, the computer program sifts through An example of an anomaly detection task is credit card fraud detection. By a set of even ts or ob jects, and flags asome of them as beingyunusual or atmisuse ypical. • mo modeling deling your purchasing habits, credit card compan company can detect An example of an detection task iscard credit fraud By of your cards. If aanomaly thief steals your credit or card credit carddetection. information, modeling our hases purchasing habits, a credit compan y can detect misuse the thief thief’s ’sypurc purchases will often come from acard different probabilit probability y distribution of your cards. If a thief steals y our credit card or credit card information, over purc purchase hase typ ypes es than your own. The credit card company can prev preven en entt the thief purchases will on often from different probabilit y distribution fraud by’splacing a hold an come account asaso as that card has been used soon on overanpurc hase types than your own. credit etcard preven for uncharacteristic purchase. See The Chandola al. company (2009) forcan a survey oft fraud by detection placing a metho hold on anomaly methods. ds.an account as soon as that card has been used for an uncharacteristic purchase. See Chandola et al. (2009) for a survey of • Synthesis and sampling sampling: : Inds.this type of task, the mac machine hine learning algorithm anomaly detection metho is ask asked ed to generate new examples that are similar to those in the training Synthesis and sampling : In thisvia type of task,learning the maccan hineblearning algorithm data. Syn Synthesis thesis and sampling machine e useful for media is asked to generate new examples that are similar to those in the training • applications where it can be exp expensiv ensiv ensivee or boring for an artist to generate large data. Syn thesis and sampling via machine learning can bcan e useful for media volumes of conten contentt by hand. F For or example, video games automatically applications where it can b e exp ensiv e or b oring for an artist to generate large generate textures for large ob objects jects or landscap landscapes, es, rather than requiring an vartist olumes conten by hand. or example, games cansome automatically to of man manually uallyt label eachFpixel (Luo etvideo al., 2013 ). In cases, we generate textures for large ob jects or landscap es, rather than requiring w an antt the sampling or synthesis pro procedure cedure to generate some specific kind an of artist to man ually label each pixel ( Luo et al. , 2013 ). In some cases, we output giv given en the input. For example, in a sp speec eec eech h synthesis task, we provide a w an t the sampling or synthesis pro cedure to generate specific kind of written sentence and ask the program to emit an audio some waveform con containing taining givversion en the input. or example, in aisspaeec h synthesis task, w e provide a aoutput sp spok ok oken en of thatFsentence. This kind of structured output task, written sentence and qualification ask the program emitisannoaudio wacorrect veform output containing but with the added thattothere single for a sp ok en version of that sentence. This is a kind of structured output task, eac each h input, and we explicitly desire a large amoun amountt of variation in the output, but with the added qualification that there no realistic. single correct output for in order for the output to seem more naturalisand each input, and we explicitly desire a large amount of variation in the output, • Imputation values values: : In thisnatural type ofand task, the machine learning in order for of themissing output to seem more realistic. algorithm is giv given en a new example x ∈ Rn, but with some entries xi of x Imputation missing must valuesprovide : In this type of task, machine missing. Theofalgorithm a prediction of thethe values of thelearning missing R x x algorithm is giv en a new example , but with some entries of x • en entries. tries. missing. The algorithm must provide a∈prediction of the values of the missing 102 entries.


• Denoising Denoising:: In this type of task, the machine learning algorithm is given in input a corrupte orrupted d example x˜ ∈ Rn obtained by an unknown corruption pro process cess Denoising : In this t ype of task, the machine learning algorithm is given in n from a cle clean an example x ∈ R .RThe learner must predict the clean example ˜ x input a cits orrupte d example obtained by an unknown corruption process • x ˜, or from corrupted version more generally predict the conditional R x from a cle an example . The learner m ust predict the clean example x ˜ ). probabilit probability y distribution p(x∈| x ˜, or more generally predict the conditional x from its corrupted version x ∈ ˜ ). mass function estimation: In the densit probabilit y distribution pob (xability • Density x estimation or pr prob obability density y estimation problem, the machine learning algorithm is asked to learn a | Density estimation abilityp mass(xfunction density pmodel : Rn or → pr Rob function , where ) can be estimation interpreted: asIna the probability model estimation problem, thecontin machine algorithm is asked to learn • densit density y function tin tinuous) uous)learning or a probabilit probability y mass function (if x isa R(if x isRcon p (xw function pon the: space that , where ) ere candrawn be interpreted probability discrete) the examples from. Toasdoa suc such h a task x densit y function (if is con tin uous) or a probabilit y mass function (if x is → well (w (wee will sp specify ecify exactly what that means when we discuss performance discrete) onP ), thethe space that the needs examples were drawn from. Toof dothe suchdata a task measures algorithm to learn the structure it w ell (w e will sp ecify exactly what that means when we discuss performance has seen. It must kno know w where examples cluster tigh tightly tly and where they P measures ), the algorithm needs to learn the structure of thethat datathe it are unlik unlikely ely to occur. Most of the tasks describ described ed ab abo ove require has seen. It m ust kno w where examples cluster tigh tly and where they learning algorithm has at least implicitly captured the structure of the are unlikely to occur. Most of y the tasks describ ed us abotoveexplicitly require that the probabilit Densit estimation allo capture probability y distribution. Density allows ws learning algorithmInhas at leastweimplicitly the structureon ofthat the that distribution. principle, can then captured p erform computations probability distribution. Densit y estimation us toFor explicitly capture distribution in order to solve the other tasksallo aswswell. example, if we that distribution. In principle, w e can then p erform computations on ( x), ha hav ve performed density estimation to obtain a probabilit probability y distribution pthat distribution in order to solvetothe other as vwalue ell. imputation For example, if wIfe w e can use that distribution solv solve e the tasks missing task. p( en, x), e performed densityand estimation to other obtainvalues, a probabilit y distribution ahavvalue all of the denoted given, x i is missing x −i, are giv w e can that distribution to ov solv task. If p(xi imputation | x −i). In practice, then weuse kno know w the distribution over ere itthe is missing given byvalue a v alue is missing and all of the other v alues, denoted , are en, x x densit density y estimation do does es not alw alwaays allow us to solve all of these relatedgiv tasks, (x) arex computationally then we in kno w thecases distribution overop iterations is givenon byp(px ). In practice, b ecause many the required operations densit y estimation does not always allow us to solve all of| these related tasks, in intractable. tractable. because in many cases the required operations on p( x) are computationally intractable. Of course, many other tasks and types of tasks are possible. The typ ypes es of tasks we list here are in intended tended only to pro provide vide examples of what machin machinee learning can Of course, many other tasks and types of tasks are p ossible. The types of tasks do, not to define a rigid taxonomy of tasks. we list here are intended only to provide examples of what machine learning can do, not to define a rigid taxonomy of tasks.

5.1.2

The Performance Measure, P P In order The to ev evaluate aluate the abilities of a mac machine hine learning algorithm, we must design 5.1.2 Performance Measure,

a quantitativ quantitativee measure of its performance. Usually this performance measure P is In orderto tothe evaluate abilities of aout mac sp specific ecific task Tthe b eing carried byhine thelearning system. algorithm, we must design a quantitative measure of its performance. Usually this performance measure P is For tasks such as classification, classification with missing inputs, and transcripspecific to the task T b eing carried out by the system. tion, we often measure the ac accur cur curacy acy of the mo model. del. Accuracy is just the prop proportion ortion F or tasks such as classification, classification with missing inputs, and transcripof examples for which the mo model del pro produces duces the correct output. We can also obtain tion, we often measure the accuracy of the model. Accuracy is just the proportion of examples for which the model produces 103 the correct output. We can also obtain


equiv equivalent alent information by measuring the err error or rate ate,, the prop proportion ortion of examples for whic which h the mo model del produces an incorrect output. We often refer to the error rate as equiv alent information theaerr or rate, the proportion for the exp expected ected 0-1 loss. by Themeasuring 0-1 loss on particular example is 0 ifofitexamples is correctly which theand model output. We often refer to itthe error ase classified 1 ifproduces it is not. an Forincorrect tasks suc such h as density estimation, do does es notrate mak make the exp ected 0-1 loss. The 0-1 loss on a particular example is 0 if it is correctly sense to measure accuracy accuracy,, error rate, or any other kind of 0-1 loss. Instead, we classified 1 if it is not. For tasks suchthat as density estimation, it does not alued make m ust use and a different performance metric gives the mo model del a contin continuous-v uous-v uous-valued sense for to measure accuracyThe , error rate, or any approach other kindisofto0-1 loss. Instead, we score eac each h example. most common rep report ort the average must use a different performance gives the model a continuous-valued log-probabilit log-probability y the mo model del assigns metric to somethat examples. score for each example. The most common approach is to report the average Usually we are in interested terested in ho how w well the mac machine hine learning algorithm performs log-probability the model assigns to some examples. on data that it has not seen before, since this determines ho how w well it will work when Usually we are in terested in ho w w ell the mac hine learning algorithm measures performs deplo deploy yed in the real world. We therefore ev evaluate aluate these performance on data thatset it has not seen sincefrom thisthe determines hofor w well it willthe work when using a test of data that b isefore, separate data used training mac machine hine deplo y ed in the real w orld. W e therefore ev aluate these p erformance measures learning system. using a test set of data that is separate from the data used for training the machine The choice of performance measure may seem straightforw straightforward ard and ob objectiv jectiv jective, e, learning system. but it is often difficult to choose a performance measure that corresponds well to choice of performance measure may seem straightforward and ob jective, the The desired behavior of the system. but it is often difficult to choose a performance measure that corresponds well to In some cases, this is because it is difficult to decide what should b e measured. the desired behavior of the system. For example, when performing a transcription task, should we measure the accuracy In system some cases, this is because is difficult or to should decide wwhat e measured. of the at transcribing en entire tireitsequences, e useshould a morebfine-grained F or example, when performing a transcription task, should we measure the accuracy performance measure that giv gives es partial credit for getting some elemen elements ts of the of the system at transcribing en tire sequences, or should w e use a more fine-grained sequence correct? When performing a regression task, should we penalize the performance that es partial credit for mistakes getting some ts mak of the system more measure if it frequen frequently tlygiv mak makes es medium-sized or if elemen it rarely makes es sequence correct? When p erforming a regression task, should we penalize the very large mistakes? These kinds of design choices dep depend end on the application. system more if it frequently makes medium-sized mistakes or if it rarely makes In other cases, we kno know w what quan quantit tit tity y we would ideally like to measure, but very large mistakes? These kinds of design choices depend on the application. measuring it is impractical. For example, this arises frequently in the context of In yother cases, wMany e knowofwhat tity we wouldmo ideally like to measure, but densit density estimation. the bquan est probabilistic models dels represen represent t probabilit probability y measuring it is impractical. F or example, this arises frequently in the context of distributions only implicitly implicitly.. Computing the actual probabilit probability y value assigned to densit y estimation. Many of the b est probabilistic mo dels represen t probabilit a sp specific ecific poin ointt in space in man many y suc such h mo models dels is in intractable. tractable. In these cases, oney distributions only implicitly . Computing the actual probabilit y v alue assigned to must design an alternativ alternativee criterion that still corresp corresponds onds to the design ob objectives, jectives, a sp ecific apoin space in manytosuc h desired models criterion. is intractable. In these cases, one or design go goo otdinapproximation the must design an alternative criterion that still corresponds to the design ob jectives, or design a good approximation to the desired criterion.

5.1.3

The Exp Experience, erience, E E Mac Machine hine The learning unsupervise ervise ervised d or su5.1.3 Expalgorithms erience, can be broadly categorized as unsup pervise ervised d by what kind of exp experience erience they are allow allowed ed to ha hav ve during the learning Mac hine learning algorithms can be broadly categorized as unsupervised or supro process. cess. pervised by what kind of experience they are allowed to have during the learning Most of the learning algorithms in this book can be understo understoood as being allow allowed ed process. to exp experience erience an en entire tire dataset. A dataset is a collection of many examples, as Most of the learning algorithms in this book can be understood as being allowed 104 is a collection of many examples, as to experience an entire dataset. A dataset


defined in Sec. 5.1.1. Sometimes we will also call examples data points oints.. One of the oldest datasets studied by statisticians and mac machine hine learning redefined in Sec. 5.1.1. Sometimes we will also call examples data points. searc searchers hers is the Iris dataset (Fisher, 1936). It is a collection of measuremen measurements ts of One of the oldest datasets studied b y statisticians and mac hine learning redifferen differentt parts of 150 iris plants. Eac Each h individual plant corresp corresponds onds to one example. searcfeatures hers is the Iris eac dataset (Fisher ). It is a collection ofofmeasuremen of The within each h example are, 1936 the measurements of each the parts oftsthe differen t parts of length, 150 iris sepal plants.width, Each individual plant to one plan plant: t: the sepal petal length andcorresp petal onds width. Theexample. dataset The features within eac h example are the measurements of each of the parts of are the also records which sp species ecies each plan plantt belonged to. Three differen differentt sp species ecies plan t: the sepal width, petal length and petal width. The dataset represen represented tedsepal in thelength, dataset. also records which species each plant belonged to. Three different species are Unsup Unsupervise ervise ervised d le learning arning algorithms exp experience erience a dataset containing many features, represen ted in the dataset. then learn useful prop properties erties of the structure of this dataset. In the con context text of deep Unsup ervise d le arning algorithms exp erience a dataset containing many features, learning, we usually wan antt to learn the entire probabilit probability y distribution that generated learn useful prop erties of as theinstructure of this dataset. In the con text of deep athen dataset, whether explicitly densit density y estimation or implicitly for tasks lik likee learning, w e usually w an t to learn the entire probabilit y distribution that generated syn synthesis thesis or denoising. Some other unsupervised learning algorithms perform other a dataset, whether explicitly as in densit y estimation or implicitly for tasks like roles, like clustering, whic which h consists of dividing the dataset into clusters of similar synthesis or denoising. Some other unsupervised learning algorithms perform other examples. roles, like clustering, which consists of dividing the dataset into clusters of similar Sup Supervise ervise ervised d le learning arning algorithms exp experience erience a dataset con containing taining features, but examples. eac each h example is also asso associated ciated with a lab label el or tar target get. For example, the Iris dataset Supervisedwith learning algorithms expiris erience dataset containing features, but is annotated the sp species ecies of each plant.a A supervised learning algorithm eachstudy example is also associated withtoa classify label or iris target . Ftsorinto example, the Irist dataset can the Iris dataset and learn plan plants three differen different sp species ecies is annotated with the sp ecies of each iris plant. A supervised learning algorithm based on their measurements. can study the Iris dataset and learn to classify iris plants into three different species Roughly sp speaking, eaking, unsup unsupervised ervised learning inv involves olves observing sev several eral examples based on their measurements. of a random vector x, and attempting to implicitly or explicitly learn the probaRoughly speaking, ervised learningprop involves several examples p(x),unsup bilit bility y distribution or some in interesting teresting properties ertiesobserving of that distribution, while x of a random v ector , and attempting to implicitly or explicitly learn the probasup supervised ervised learning inv involv olv olves es observing sev several eral examples of a random vector x and p ( x bilitasso y distribution ) , or some in teresting propto erties of that distribution, while an v alue or v ector , and learning predict by associated ciated y y from x, usually x sup ervised learning inv olv es observing sev eral examples of a random v ector and estimating p (y | x). The term sup supervise ervise ervised d le learning arning originates from the view of an asso ciated v alue or v ector , and learning predict usually by y y from the target y being pro provided vided by an instructor orto teac teacher her who sho shows wsx,the mac machine hine y x p ( estimating ). The term ervise d learning originates the view or of learning system what to do. In sup unsup unsupervised ervised learning, there isfrom no instructor y the target b eing pro vided b y an instructor or teac her who sho ws the mac hine teac teacher, her, and the |algorithm must learn to make sense of the data without this guide. learning system what to do. In unsupervised learning, there is no instructor or Unsup Unsupervised ervised learning and supervised learning are not formally defined terms. teacher, and the algorithm must learn to make sense of the data without this guide. The lines betw between een them are often blurred. Man Many y machine learning technologies can Unsup ervised learning and supervised learning are not formally defined states terms. be used to perform both tasks. For example, the chain rule of probability The lines eenxthem Manycan machine learning technologies can that for abetw vector ∈ Rnare , theoften jointblurred. distribution be decomposed as be used to perform both tasks. For example, the chain rule of probability states n R that for a vector x , the jointY distribution can be decomposed as p(x) = p(x i | x 1, . . . , xi−1). (5.1) ∈ i=1 p (x ) = p(x x , . . . , x ). (5.1) This decomp decomposition osition means that we can solve the ostensibly unsup unsupervised ervised problem of | mo modeling deling p( x) by splitting it in into to n sup supervised ervised learning problems. Alternativ Alternatively ely ely,, we This decomposition means that we can solve the ostensibly unsupervised problem of 105 modeling p( x) by splitting it into nY supervised learning problems. Alternatively, we


can solve the sup supervised ervised learning problem of learning p(y | x) by using traditional unsup unsupervised ervised learning technologies to learn the joint distribution p(x, y) and can solve the supervised learning problem of learning p(y x) by using traditional inferring p(x, y) and unsupervised learning technologies to learn | p(x, ythe ) joint distribution P p ( y | x ) = . (5.2) inferring 0 y0 p(x, y ) p(x, y) p ( y x ) = . (5.2) Though unsup unsupervised ervised learning and sup supervised ervisedp(learning x, y ) are not completely formal or distinct concepts, they do help to|roughly categorize some of the things we do with Though ervised learningTand supervised learning completely formal or mac machine hine unsup learning algorithms. raditionally raditionally, , people referare to not regression, classification distinct concepts,output they do help to roughly categorize some ofDensit the things we do with and structured problems as sup supervised ervised learning. Density y estimation in mac hine learning algorithms. T raditionally , p eople refer to regression, classification P unsupervised learning. supp support ort of other tasks is usually considered and structured output problems as supervised learning. Density estimation in Other arian ariants ts of the learningconsidered paradigmunsupervised are p ossible.learning. For example, in semisupp ort ofvother tasks is usually sup supervised ervised learning, some examples include a supervision target but others do ariants of thelearning, learningan paradigm are p ossible. For example, in seminot.Other In mvulti-instance en entire tire collection of examples is lab labeled eled as sup ervised learning, some examples include a supervision target but others do con containing taining or not containing an example of a class, but the individual mem members bers not. In m ulti-instance learning, an en tire collection of examples is lab eled as of the collection are not lab labeled. eled. For a recen recentt example of multi-instance learning containing or dels, not containing with deep mo models, see Kotziasan et example al. (2015of ). a class, but the individual members of the collection are not labeled. For a recent example of multi-instance learning Some machine learning algorithms do not just exp experience erience a fixed dataset. For with deep models, see Kotzias et al. (2015). example, reinfor einforccement le learning arning algorithms interact with an environmen environment, t, so there Some machine learning algorithms do not just exp erience a fixed dataset. For is a feedback lo loop op b et etw ween the learning system and its exp experiences. eriences. Suc Such h algorithms example, reinfor cement arning an environmen so there are bey eyond ond the scop scope e of le this book.algorithms Please seeinteract Sutton with and Barto (1998) ort,Bertsek Bertsekas as is a feedback lo op b et w een the learning system and its exp eriences. Suc h algorithms and Tsitsiklis (1996) for information ab about out reinforcement learning, and Mnih et al. are b ey ond the scop e of this b o ok. Please see Sutton and Barto (1998) or Bertsekas (2013) for the deep learning approach to reinforcemen reinforcement t learning. and Tsitsiklis (1996) for information about reinforcement learning, and Mnih et al. Most machine exp experience erience a dataset. A dataset can (2013 ) formac thehine deeplearning learningalgorithms approach simply to reinforcemen t learning. be described in many wa ways. ys. In all cases, a dataset is a collection of examples, Most hine collections learning algorithms simply exp erience a dataset. A dataset can whic which h aremac in turn of features. be described in many ways. In all cases, a dataset is a collection of examples, One common wa way y of describing a dataset is with a design matrix. A design which are in turn collections of features. matrix is a matrix containing a differen differentt example in each ro row. w. Each column of the One common wa y of describing a dataset is with a design . Acontains design matrix corresp corresponds onds to a differen differentt feature. For instance, the Irismatrix dataset matrix is a matrix a differen t example in This each ro w. Each column of the 150 examples with containing four features for each example. means we can represent matrix corresp onds to a differen t feature. F or instance, the Iris dataset contains the dataset with a design matrix X ∈ R150×4 , where Xi,1 is the sepal length of 150 four width features for each example. means we of can i ,Retc. plan plantexamples t i, Xi,2 is with the sepal of plant We willThis describ describe e most therepresent learning X design the datasetinwith matrixofXho , where is the sepal datasets. length of algorithms thisabdesign ook in terms how w they op operate erate on matrix plant i, X is the sepal width of plant∈i , etc. We will describe most of the learning Of course, to describ describee a dataset as a design matrix, it must b e possible to algorithms in this book in terms of how they operate on design matrix datasets. describ describee eac each h example as a vector, and each of these vectors must be the same size. Of course, e aFdataset as a ifdesign matrix, it mustofb ephotographs possible to This is not alw alwa ato ysdescrib possible. or example, you ha hav ve a collection describ e each example as aheigh vector, and each of these vectors must be the same size. with different widths and heights, ts, then different photographs will contain differen different t This is not alw a ys p ossible. F or example, if y ou ha v e a collection of photographs num umb bers of pixels, so not all of the photographs ma may y be describ described ed with the same with different widths and heigh ts, then different photographs will differen length of vector. Sec. 9.7 and Chapter 10 describe how to handlecontain different typest numbers of pixels, so not all of the photographs may be described with the same length of vector. Sec. 9.7 and Chapter106 10 describe how to handle different types


of such heterogeneous data. In cases lik likee these, rather than describing the dataset as a matrix with m ro rows, ws, we will describ describee it as a set containing m elemen elements: ts: of such heterogeneous data. In cases lik e these, rather than describing the dataset (1) (2) ( m ) does es not imply that any two example vectors {x , x , . . . , x } . This notation do m as a matrix with ro ws, we will describ e it as a set containing m elements: ( i ) ( j ) x and x ha hav ve the same size. . This notation does not imply that any two example vectors x ,x ,...,x In the case of sup supervised ervised learning, the example con contains tains a lab label el or target as x { and x have the } same size. well as a collection of features. For example, if we wan antt to use a learning algorithm the case of sup ervised learning, the example tains a ecify labelwhich or target as to pIn erform ob object ject recognition from photographs, we con need to sp specify ob object ject w ell ears as a in collection features. if wthis e wan t to a usenumeric a learning algorithm app appears each ofofthe photos.ForWexample, e migh mightt do with co code, de, with 0 to p erform ob ject recognition from photographs, w e need to sp ecify which ob ject signifying a person, 1 signifying a car, 2 signifying a cat, etc. Often when working appears in eachcon of taining the photos. We matrix might do withobserv a numeric with Xde, with a dataset containing a design of this feature observations ationsco , we also0 signifying a person, 1 signifying a car, 2 viding signifying cat, etc.example Often when working pro provide vide a vector of labels y , with yi pro providing the alab label el for i. with a dataset containing a design matrix of feature observations X, we also Of course, sometimes the label ma may y be more than just a single num umb ber. For provide a vector of labels y , with y providing the label for example i. example, if we want to train a sp speec eec eech h recognition system to transcribe en entire tire Of course, theeach labelexample may besen more than a single umber. For sen sentences, tences, thensometimes the lab label el for sentence tence is ajust sequence of n words. example, if we want to train a sp eech recognition system to transcribe entire Just asthen therethe is no definition of sen sup supervised ervised unsup unsupervised ervised learning, sentences, labformal el for each example tence is and a sequence of words. there is no rigid taxonom taxonomy y of datasets or exp experiences. eriences. The structures describ described ed here Just as there is no formal definition of sup ervised and unsup ervised learning, co cov ver most cases, but it is alwa always ys possible to design new ones for new applications. there is no rigid taxonomy of datasets or experiences. The structures described here cover most cases, but it is always possible to design new ones for new applications.

5.1.4

Example: Linear Regression

5.1.4definition Example: Linear Regression Our of a mac machine hine learning algorithm as an algorithm that is capable of improving a computer program’s performance at some task via exp experience erience is Our definition of a mac hine learning algorithm as an algorithm that is somewhat abstract. To make this more concrete, we presen presentt an example of acapable simple of improving a computer program’s somereturn task via experience is mac machine hine learning algorithm: line linear ar rpeerformance gr gression ession ession.. Wat e will to this example somewhat To make thismac more concrete, presentthat an example of a simple rep repeatedly eatedly abstract. as we introduce more machine hine learningwe concepts help to understand macbhine learning algorithm: linear regression. We will return to this example its ehavior. repeatedly as we introduce more machine learning concepts that help to understand As the name implies, linear regression solv solves es a regression problem. In other its behavior. words, the goal is to build a system that can tak takee a vector x ∈ Rn as input and As the linear solves In a regression In other y ∈ regression R as its output. predict thename value implies, of a scalar the case ofproblem. linear regression, R x w ords, the goal to build a system can Let takeyâbveector as our input and the output is a is linear function ofR thethat input. the value that mo model del y predict the v alue of a scalar as its output. In the case of linear regression, ∈ predicts y should take on. We define the output to be the output is a linear function∈of the input. Let yˆ be the value that our model > x (5.3) = woutput predicts y should take on. We defineyˆ the to be where w ∈ Rn is a vector of par arameters ameters ameters. (5.3) yˆ =. w x R are values that control the behavior of the system. In this case, wi is Parameters where w is a vector of parameters. the co coefficien efficien efficientt that we multiply by feature xi before summing up the con contributions tributions ∈ P arameters are v alues that control the b ehavior of the system. In this case, whow is from all the features. We can think of w as a set of weights that determine x the co efficien t that w e m ultiply b y feature b efore summing up the con tributions eac each h feature affects the prediction. If a feature xi receiv receives es a positive weigh weightt wi , from all the features. We can think of w as a set of weights that determine how each feature affects the prediction. If 107 a feature x receives a positive weight w ,


then increasing the value of that feature increases the value of our prediction yˆ. If a feature receiv receives es a negative weigh eight, t, then increasing the value of that feature yˆ. then increasing the v alue of that feature value of ourinprediction decreases the value of our prediction. If aincreases feature’s the weigh eight t is large magnitude, If a feature a negative eight, thenIfincreasing value thatit feature then it has areceiv largeeseffect on the w prediction. a feature’sthe weigh weight t is of zero, has no decreases the v alue of our prediction. If a feature’s w eigh t is large in magnitude, effect on the prediction. then it has a large effect on the prediction. If a feature’s weight is zero, it has no We thus hav havee a definition of our task T : to predict y from x by outputting effect on the prediction. > yˆ = w x. Next we need a definition of our performance measure, P . We thus have a definition of our task T : to predict y from x by outputting Supp that we ha hav ve a design matrix of m example inputs that we will not yˆ = Suppose w xose . Next we need a definition of our performance measure, P . use for training, only for ev evaluating aluating ho how w well the model performs. We also hav havee Suppose that we hatargets ve a design matrix of correct that we of will not m example a vector of regression pro providing viding the valueinputs of y for each these use for training, only evaluating howbwell thefor model performs. We italso examples. Because thisfordataset will only e used ev evaluation, aluation, we call thehav teste y a vector of regression targets providing theascorrect aluethe of vector for each of these X (test) vand set set. . We refer to the design matrix of inputs of regression examples. Because this dataset will only b e used for ev aluation, we call it the test targets as y(test). set. We refer to the design matrix of inputs as X and the vector of regression One way of measuring the performance of the mo model del is to compute the me mean an targets as y . ( test ) squar squareed err error or of the mo model del on the test set. If yˆ giv gives es the predictions of the One w a y of measuring the p erformance of the mo del is to mo model del on the test set, then the mean squared error is given bycompute the mean squared error of the model on the test set. If yˆ gives the predictions of the X 1 ) 2 model on the test set, then the = mean squared is given MSEtest (5.4) ( yˆ (test)error ) i . by − y (test m 1 i MSE (5.4) = ( yˆ ) . y m ( test ) ( = y test) . In Intuitiv tuitiv tuitively ely ely,, one can see that this error measure decreases to 0 when yˆ − We can also see that =y Intuitively, one can see that this error measure decreases to 0 when yˆ . 1 ( test ) ( test ) 2 We can also see that || ||ˆ −y ||2 , yˆ MSE test = X (5.5) m 1 y = Euclidean yˆ MSE , bet (5.5) so the error increases whenever the distance etween ween the predictions m || − || and the targets increases. so the error increases whenever the Euclidean distance between the predictions o make a mac machine hine learning algorithm, we need to design an algorithm that andTthe targets increases. will improv improvee the weigh weights ts w in a way that reduces MSEtest when the algorithm T o make a mac hine learning need toset design an),algorithm that y(train) ). One is allo allow wed to gain exp experience erience by algorithm, observing awe training (X (train w will improv eythe weights in a wawe y that reduceslater, MSE in Sec. when5.5.1 the )algorithm in intuitiv tuitiv tuitive e wa way of doing this (which will justify is just to X , y is allo w ed to gain exp erience by observing a training set ( ). One minimize the mean squared error on the training set, MSEtrain . intuitive way of doing this (which we will justify later, in Sec. 5.5.1) is just to To minimize MSE e can on simply solve forset, where its gradien gradient t is 0: train, werror minimize the mean squared the training MSE . To minimize MSE

, we can simply solve for where its gradient is 0: ∇w MSEtrain = 0 (5.6) 1 (train) (train 0 )|| 22 = 0 (5.6) ⇒ ∇ w || ||ˆ yˆMSE − y= (5.7) m 1∇ 1 y yˆ =0 (5.7) ⇒ ∇wm || ||X X(train)w − y (train)|| 22 = 0 (5.8) m ⇒1∇ || − || 108 X w y =0 (5.8) m ⇒ ∇ || − ||


3 2

Linear regression example Linear regression example

0.50 MSE(train)

y

1 0 −1 −2 −3

0.55

Optimization of w Optimization of w

0.45 0.40 0.35 0.30 0.25

−1.0 −0.5

0.0 x1

0 .5

0.20

1.0

0.5

1.0 w1

1.5

w w w

y =w x

w

y =w x

w

w w w

 >   (train) (train) (train) (train) =0 X ⇒ ∇w X w−y w−y

(5.9)

  )> (train =y0(train) = (5.9) X )> y (train w )+ y y (train)> X w) w y 0 ⇒ ∇w w> X (trainX − 2w> X (train ⇒∇ − − (5.10) w X X y +y y =0 w 2w X ⇒ 2X (train)> X (train)w − 2X (train)> y(train) = 0 (5.11) ⇒∇ − (5.10)    −1 (train)>  ( train ) > ( train ) ( train ) X= X X (5.11) Xw 2X X y y = 0 (5.12) ⇒ 2w ⇒ −   X Xen by Eq. y 5.12 is known (5.12) w = Xwhose solution The system of equations is giv given as the normal equations quations..⇒Ev Evaluating aluating Eq. 5.12 constitutes a simple learning algorithm. system of solution is giv en by Eq.in5.12 is known as 5.1 the. For The an example of equations the linear whose regression learning algorithm action, see Fig. normal equations. Evaluating Eq. 5.12 constitutes a simple learning algorithm. worth noting term line linear ar regr gression ession is often used to  the For Itanisexample of the that linear regression learning algorithm in action, seerefer Fig. to 5.1a. sligh slightly tly more sophisticated mo model del with one additional parameter—an intercept It is w orth noting that the term linear regression is often used to refer to a term b. In this mo model del slightly more sophisticated modelyˆ with parameter—an intercept = w >one x +additional b (5.13) term b. In this model so the mapping from parameters yˆto=predictions the w x + b is still a linear function but (5.13) mapping from features to predictions is now an affine function. This extension to so the functions mapping means from parameters to predictions is still a linear function the affine that the plot of the mo model’s del’s predictions still lo looks oksbut like a mapping from features to predictions is now an affine function. This extension to line, but it need not pass through the origin. Instead of adding the bias parameter means thatthe themo plot theonly model’s predictions still tloxoks likeana baffine , one functions can contin continue ue to use model del of with weigh eights ts but augmen augment with line, but it need not pass through the origin. Instead of adding the bias parameter 109 only weights but augment x with an b, one can continue to use the model with


extra entry that is alwa always ys set to 1. The weigh eightt corresp corresponding onding to the extra 1 entry pla plays ys the role of the bias parameter. We will frequen frequently tly use the term “linear” when extra entry always setthroughout to 1. The this weigh corresponding to the extra 1 entry referring to that affineisfunctions bto ok. plays the role of the bias parameter. We will frequently use the term “linear” when b is often The in intercept called this the bias referring totercept affine term functions throughout bo ok.parameter of the affine transformation. This terminology derives from the point of view that the output of the b is The intercept often biasabsence parameter ofythe affine transfortransformation is term biased to tow w ard bcalled eing bthe in the of an any input. This term mation. This terminology derives from the point of view that the output of the is differen differentt from the idea of a statistical bias, in which a statistical estimation b transformation is biased toward iny the absence input. This yterm algorithm’s exp expected ected estimate of abeing quantit quantity is not equal of to an they true quantit quantity . is different from the idea of a statistical bias, in which a statistical estimation Linear regression of courseofanaextremely simple and limited algorithm, algorithm’s expectedisestimate quantity is not equal to the learning true quantit y. but it provides an example of how a learning algorithm can work. In the subsequen subsequentt Linear is of courseofan extremely simple and limited learning sections weregression will describ describe e some the basic principles underlying learning algorithm, algorithm but it provides an example of how a learning algorithm can work. In the subsequen design and demonstrate ho how w these principles can be used to build more complicatedt sections we will describ e some of the basic principles underlying learning algorithm learning algorithms. design and demonstrate how these principles can be used to build more complicated learning algorithms.

5.2

Capacit Capacity y, Ov Overfitting erfitting and Underfitting

The central tral challenge mac machine hine learning is that we must perform well on 5.2 cen Capacit y, in Ov erfitting and Underfitting inputs—not just those on whic which h our mo model del was trained. The The cen tral c hallenge in mac hine learning is that we m ust perform well on abilit ability y to perform well on previously unobserv unobserved ed inputs is called gener generalization alization alization. . inputs—not just those on which our mo del was trained. The Typically ypically, , whenwell training a machineunobserv learninged mo model, del, weishav have e access toalization a training abilit y to perform on previously inputs called gener . set, we can compute some error measure on the training set called the tr training aining T,ypically a machine learning del, w we hav access to err error or or, and we, when reducetraining this training error. So far,mo what e ha hav vee describ described edaistraining simply set, w e can compute some error measure on the training set called the tr aining an optimization problem. What separates machine learning from optimization is err or , and w e reduce this training error. So far, what w e ha v e describ ed is simply that we wan wantt the gener generalization alization err error or, also called the test err error or, to be lo low w as well. an optimization problem. What separates machine learning from optimization is The generalization error is defined as the exp expected ected value of the error on a new that weHere wantthe theexp gener alization error,across also called the test errorinputs, , to be dra lowwn as from well. input. is taken differen expectation ectation different t possible drawn The generalization error iswedefined expected valueter of in thepractice. error on a new the distribution of inputs expect as thethe system to encoun encounter input. Here the expectation is taken across different possible inputs, drawn from We typically estimate the generalization error of a machine learning mo model del by the distribution of inputs we expect the system to encounter in practice. measuring its performance on a test set of examples that were collected separately e typically the generalization error of a machine learning model by fromWthe trainingestimate set. measuring its performance on a test set of examples that were collected separately ourtraining linear regression example, we trained the model by minimizing the fromInthe set. training error, In our linear regression 1example, we trained the model by minimizing the (train) (train) 2 || ||X X w − y ||2, (5.14) training error, m(train) 1 X error,w 1 y || (5.14) but we actually care ab about out the test ||X X (test),w − y (test) ||22. m m || − set when || we get to observ How can we affect performance the test but Ho wewactually care ab out the testonerror, . e only the X w y observe training set? The field of statistic statistical al le learning arning the theory ory pro provides vides some answ answers. ers. If the || e only the How can we affect performance on the test set|| when we − get to observ 110 training set? The field of statistical learning theory provides some answers. If the


training and the test set are collected arbitrarily arbitrarily,, there is indeed little we can do. If we are allo allowed wed to make some assumptions about ho how w the training and test set training and the test set are collected arbitrarily , there is indeed little we can do. are collected, then we can mak makee some progress. If we are allowed to make some assumptions about how the training and test set train and are generated by a probability distribution ov over er datasets are The collected, thentest wedata can mak e some progress. called the data gener generating ating pr pro ocess ess.. We typically mak makee a set of assumptions kno known wn The train dataassumptions are generated by a probability distribution overexamples datasets collectiv collectively ely asand thetest i.i.d. These assumptions are that the called data gener atingendent processfrom . We each typically mak e athat set of knotest wn in eac each hthe dataset are indep independent other, and theassumptions train set and collectiv ely as al the i.i.d. assumptions Thesethe assumptions are thatdistribution the examples set are identic identical ally ly distribute distributed d, dra drawn wn from same probability as in eac h dataset are indep endent from each other, and that the train set and test eac each h other. This assumption allo allows ws us to describe the data generating process set are identical lyy distribution distributed, dra wna from same probability with a probabilit probability ov over er singlethe example. The same distribution distribution as is eac h other. This assumption allo ws us to describe the data generating process then used to generate ev every ery train example and every test example. We call that with a probabilit y distribution ovdata er a gener singleating example. The same distribution is p data. This shared underlying distribution the generating distribution distribution, , denoted then used to generate ev ery train example and every test example. W e call that probabilistic framework and the i.i.d. assumptions allo allow w us to mathematically sharedthe underlying distribution data gener . This study relationship betw etween eenthe training errorating and distribution test error. , denoted p probabilistic framework and the i.i.d. assumptions allow us to mathematically One immediate connection can observe betw between een error. the training and test error study the relationship betweenwe training error and test is that the expected training error of a randomly selected mo model del is equal to the One immediate connection we can observe betw een the training test error exp expected ected test error of that mo model. del. Suppose we ha hav ve a probabilityand distribution the expected error of a randomly moset deland is equal to the pis(xthat , y ) and we sampletraining from it rep repeatedly eatedly to generateselected the train the test set. exp ected test error of that mo del. Suppose w e ha v e a probability distribution For some fixed value w , then the exp expected ected training set error is exactly the same as p ( x , y ) and w e sample from it rep eatedly generate theare train set and thethe test set. the exp expected ected test set error, b ecause both to exp expectations ectations formed using same w F or some fixed v alue , then the exp ected training set error is exactly the same as dataset sampling process. The only difference betw etween een the tw two o conditions is the the exp testtoset b ecause both expectations are formed using the same name wected e assign theerror, dataset we sample. dataset sampling process. The only difference between the two conditions is the Ofwcourse, we e use awe machine name e assignwhen to thew dataset sample.learning algorithm, we do not fix the parameters ahead of time, then sample b oth datasets. We sample the training set, course, when the we parameters use a machine learning algorithm, wethen do not fix the thenOfuse it to choose to reduce training set error, sample the parameters aheadthis of time, thenthe sample b othtest datasets. Wegreater samplethan the training test set. Under pro process, cess, exp expected ected error is or equalset, to then use it to c hoose the parameters to reduce training set error, then sample the the exp expected ected value of training error. The factors determining ho how w well a machine test set. Under this pro cess, the exp ected test error is greater than or equal to learning algorithm will perform are its abilit ability y to: the expected value of training error. The factors determining how well a machine learning algorithm will perror erform are its ability to: 1. Mak Make e the training small. 1. Mak training error small. and test error small. 2. Makee the gap betw etween een training 2.These Maketwthe gap betw een ond training and error small. o factors corresp correspond to the twotest cen central tral challenges in machine learning: underfitting and overfitting. Underfitting occurs when the model is not able to These two factors corresp ond to the wo training central challenges in machine learning: obtain a sufficien sufficiently tly lo low w error value on tthe set. Ov Overfitting erfitting occurs when underfitting and overfitting . Underfitting occurs when the model is not able to the gap betw etween een the training error and test error is to too o large. obtain a sufficiently low error value on the training set. Overfitting occurs when We can con control trol whether a mo model del is more lik likely ely to ov overfit erfit or underfit by altering the gap between the training error and test error is too large. its cap apacity acity acity.. Informally Informally,, a mo model’s del’s capacit capacity y is its abilit ability y to fit a wide variety of We can control whether a model is more likely to overfit or underfit by altering 111 y is its ability to fit a wide variety of its capacity. Informally, a mo del’s capacit


functions. Mo Models dels with lo low w capacit capacity y ma may y struggle to fit the training set. Mo Models dels with high capacit capacity y can overfit by memorizing properties of the training set that do functions. Models lowtest capacit not serv servee them wellwith on the set.y may struggle to fit the training set. Models with high capacity can overfit by memorizing properties of the training set that do way y towell con control trol thetest capacity not One servewa them on the set. of a learning algorithm is by choosing its hyp hypothesis othesis sp spac ac acee, the set of functions that the learning algorithm is allow allowed ed to One y tothe consolution. trol the F capacity of a the learning by choosing its select as wa being or example, linear algorithm regression is algorithm has the hypothesis spacefunctions , the set of functions that learningspace. algorithm is allow ed to set of all linear of its input as its the hypothesis We can generalize select as being the For example,rather the linear has linear regression tosolution. include polynomials, thanregression just linearalgorithm functions, inthe its set of all linear functions of its input as its hypothesis space. W e can generalize hyp ypothesis othesis space. Doing so increases the mo model’s del’s capacit capacity y. linear regression to include polynomials, rather than just linear functions, in its polynomial degree giv gives es us regression model del with whic which h we hypA othesis space. ofDoing so one increases thethe molinear del’s capacit y. mo are already familiar, with prediction A polynomial of degree one gives us the linear regression model with which we are already familiar, with prediction yˆ = b + wx. (5.15) By in intro tro troducing ducing x2 can learn a mo model del By introducing x can learn a model

yˆ =provided b + wx. to the linear regression model, (5.15) as another feature we that is quadratic as a function of x: as another feature provided to the linear regression model, we that is quadratic aswa1function yˆ = b + x + w2 x2 of . x: (5.16)

yˆ = b + w x + function w x . of its (5.16) Though this mo model del implements a quadratic , the output is still a linear function of the , so we can still use the normal equations Though model del in implements a quadratic function of add its more ,pthe output is to train this the mo model closed form. We can contin continue ue to ow owers ers of x as still a linearfeatures, functionfor of example the , so can still of usedegree the normal equations additional to obtain a we polynomial 9: to train the model in closed form. We can continue to add more powers of x as 9 a polynomial of degree 9: additional features, for example to obtain X (5.17) yˆ = b + wi xi . yˆ = b +

i=1

wx.

(5.17)

Mac Machine hine learning algorithms will generally perform best when their capacity is appropriate in regard to the true complexit complexity y of the task they need to perform Mac hine learning algorithms will generally perform est when their capacityt and the amount of training data they are provided with.bModels with insufficien insufficient X is appropriate in regard to the true complexit y of the task they need to perform capacit capacity y are unable to solv solvee complex tasks. Mo Models dels with high capacit capacity y can solve and the amount of training data they are provided with. Models with insufficien complex tasks, but when their capacit capacity y is higher than needed to solve the presen presenttt capacit y are unable to solve complex tasks. Models with high capacity can solve task they may ov overfit. erfit. complex tasks, but when their capacity is higher than needed to solve the present Fig. 5.2 sho shows ws this principle in action. We compare a linear, quadratic and task they may overfit. degree-9 predictor attempting to fit a problem where the true underlying function Fig. 5.2 sho wslinear this principle action.toW e compare a linear, is quadratic. The function isinunable capture the curv curvature ature quadratic in the trueand undegree-9 problem, predictor so attempting to fit problem where theistrue underlying function derlying it underfits. Thea degree-9 predictor capable of represen representing ting is quadratic. The linear function is unable the curv ature in many the true unthe correct function, but it is also capabletoofcapture representing infinitely other derlying problem, so it underfits. The degree-9 predictor is capable of represen ting functions that pass exactly through the training p oin oints, ts, because we hav havee more the correct function, but it is also capable of representing infinitely many other 112training p oints, b ecause we have more functions that pass exactly through the


parameters than training examples. We hav havee little chance of choosing a solution that generalizes well when so man many y wildly different solutions exist. In this example, parameters than training examples. We hav littletrue chance of choosing solution the quadratic mo model del is perfectly matched toe the structure of the atask so it that generalizes well when so man y wildly different solutions exist. In this example, generalizes well to new data. the quadratic model is perfectly matched to the true structure of the task so it generalizes well to new data.

x

y

x

y

So far we hav havee only describ described ed changing a mo model’s del’s capacity by changing the num umb ber of input features it has (and simultaneously adding new parameters So far w e hav e only describThere ed changing a mo del’s wcapacity by changing the asso associated ciated with those features). are in fact many ays of changing a mo model’s del’s n umbery.ofCapacit input yfeatures it has (andonly simultaneously adding new parameters capacit capacity Capacity is not determined by the choice of model. The mo model del asso ciated with those features). There are in fact many w a ys of changing a mo del’s sp specifies ecifies whic which h family of functions the learning algorithm can cho hoose ose from when capacit y . Capacit y is not determined only by the c hoice of model. mothe del varying the parameters in order to reduce a training ob objectiv jectiv jective. e. This isThe called ecifies which family of functions the can cthe hoose from when rsp epr epresentational esentational cap apacity acity of the mo model. del. learning In man many yalgorithm cases, finding best function varyingthis thefamily parameters in order to optimization reduce a training ob jectiv e. This is called the within is a very difficult problem. In practice, the learning representational cap acity offind thethe mobdel. In many but cases, finding function algorithm do does es not actually est function, merely onethe thatbest significantly within this family is a very difficult optimization problem. In practice, the learning reduces the training error. These additional limitations, such as the imp imperfection erfection algorithm do es not actually find the best function, but merely one that significantly reduces the training error. These additional 113 limitations, such as the imp erfection


of the optimization algorithm, mean that the learning algorithm’s effe effective ctive cap apacity acity ma may y be less than the representational capacit capacity y of the mo model del family family.. of the optimization algorithm, mean that the learning algorithm’s effective capacity Our mo modern dern ideas ab about out impro improving ving the generalization of mac machine hine learning may be less than the representational capacity of the model family. mo models dels are refinements of though thoughtt dating bac back k to philosophers at least as early Our mo dern ideas ab out impro ving the generalization of machine as Ptolem Ptolemy y. Man Many y early scholars inv invok ok okee a principle of parsimony thatlearning is now models are refinements ofcthough t dating back to philosophers at least as early most widely kno known wn as Oc Occ am’s razor (c. 1287-1347). This principle states that as Ptolem y . Man y early scholars inv ok e a principle of parsimony that is among comp competing eting hyp ypotheses otheses that explain kno known wn observ observations ations equally well, now one most widely kno wn as Oc c am’s r azor (c. 1287-1347). This principle states that should cho hoose ose the “simplest” one. This idea was formalized and made more precise among comp eting hyp that explain known observ ations equally well, and one in the 20th century byotheses the founders of statistical learning theory (Vapnik shouldonenkis choose, the “simplest” idea et wasal. formalized and ,made more precise Cherv Chervonenkis 1971 ; Vapnik,one. 1982This ; Blumer al., , 1989; Vapnik 1995). in the 20th century by the founders of statistical learning theory (Vapnik and Statistical learning theory provides various means of quan quantifying tifying mo model del capacity capacity.. Chervonenkis, 1971; Vapnik, 1982; Blumer et al., 1989; Vapnik, 1995). Among these, the most well-kno well-known wn is the Vapnik-Chervonenkis dimension dimension,, or VC Statistical learning theory provides v arious means of quan tifying del capacity dimension. The VC dimension measures the capacity of a binary mo classifier. The. Among these, the most well-kno wnthe is the Vapnik-Chervonenkis dimension , or VC V C dimension is defined as being largest possible value of m for whic which h there dimension. The set VC of dimension measures capacity of a binary classifier. The. m differen exists a training different t x poin oints tsthe that the classifier can lab label el arbitrarily arbitrarily. VC dimension is defined as being the largest possible value of m for which there Quan Quantifying tifying the capacit capacity y of the mo model del allo allows ws statistical learning theory to exists a training set of m different x points that the classifier can label arbitrarily. mak makee quan quantitativ titativ titativee predictions. The most imp important ortant results in statistical learning Quan tifying y of the mo deltraining allows statistical learning theory to theory sho show w thatthe thecapacit discrepancy betw etween een error and generalization error mak e quantitativ e predictions. The ymost ortant results in capacity statisticalgro learning is bounded from ab abov ov ovee by a quantit quantity that imp gro grows ws as the mo model del grows ws but theory sho w that the discrepancy b etw een training error and generalization error, shrinks as the num umb ber of training examples increases (Vapnik and Cherv Chervonenkis onenkis onenkis, is bounded from ab;ovBlumer e by a quantit that gro ws as the ). moThese del capacity ws but 1971 ; Vapnik , 1982 et al. al.,, y1989 ;V apnik , 1995 boundsgro provide shrinks as the number ofthat training examples increases (Vapnik and Cherv onenkis in intellectual tellectual justification machine learning algorithms can work, but they are, 1971 ; Vused apnikin, 1982 ; Blumer al., 1989 ; Vapnik 1995). These boundsThis provide rarely practice when etworking with deep ,learning algorithms. is in in tellectual justification that machine learning algorithms can work, but they are part because the bounds are often quite lo loose ose and in part because it can be quite rarely used in practice when working with deep learning algorithms. This is difficult to determine the capacity of deep learning algorithms. The problem in of part becausethe thecapacity bounds of area often quite loose andisinesp part because it can be quite determining deep learning mo model del especially ecially difficult because the difficult determine the capacity of deep learning problemand of effectiv effectivee to capacity is limited by the capabilities of the algorithms. optimizationThe algorithm, determining capacityunderstanding of a deep learning movery del general is esp ecially difficult because the w e ha hav ve littlethe theoretical of the non-con non-conv vex optimization effectiv e capacity is limited b y the capabilities of the optimization algorithm, and problems in inv volv olved ed in deep learning. we have little theoretical understanding of the very general non-convex optimization We must remem rememb ber that while simpler functions are more likely to generalize problems involved in deep learning. (to hav havee a small gap betw etween een training and test error) we must still choose a W e must remem b er that while areerror. more likely to generalize sufficien sufficiently tly complex hyp ypothesis othesis to simpler ac achiev hiev hievee functions lo low w training Typically ypically, , training (to hav e a small gap b etw een training and test error) we m ust still a error decreases un until til it asymptotes to the minimum possible error value choose as mo model del sufficien complex hypothesis achiev e low training error. , training, capacit capacity ytly increases (assuming thetoerror measure has a minim minimum umTvypically alue). Typically Typically, error decreaseserror untilhas it asymptotes thee minimum possible errorcapacit value yas model generalization a U-shaped to curv curve as a function of model capacity . This is capacit y increases (assuming the error measure has a minim um v alue). Typically , illustrated in Fig. 5.3. generalization error has a U-shaped curve as a function of model capacity. This is To reach capacity y, we in intro tro troduce duce illustrated in the Fig. most 5.3. extreme case of arbitrarily high capacit the concept of non-p non-par ar arametric ametric models. So far, w wee hav havee seen only parametric To reach the most extreme case of arbitrarily high capacity, we intro duce the concept of non-parametric models. 114 So far, we have seen only parametric


mo models, dels, suc such h as linear regression. Parametric mo models dels learn a function describ described ed by a parameter vector whose size is finite and fixed before any data is observed. models, such as mo linear Pharametric models learn a function described Non-parametric models delsregression. hav havee no suc such limitation. by a parameter vector whose size is finite and fixed before any data is observed. Sometimes, non-parametric are just theoretical abstractions (suc (such h as Non-parametric models have no models such limitation. an algorithm that searches over all possible probability distributions) that cannot Sometimes, non-parametric models abstractions (such as be implemented in practice. How However, ever, weare canjust alsotheoretical design practical non-parametric an algorithm that their searches over all distributions) thatexample cannot mo models dels by making complexit complexity y apossible functionprobability of the training set size. One be such implemented in practice. How ever,orwerecan also .design non-parametric of an algorithm is ne near ar arest est neighb neighbor gr gression ession ession. Unlik Unlikeepractical linear regression, whic which h mo dels b y making their complexit y a function of the training set size. One example has a fixed-length vector of weigh eights, ts, the nearest neighbor regression model simply of such an algorithm is ne ar est neighb regrWhen ession.ask Unlik e linear regression, whicxh, stores the X and y from the trainingorset. asked ed to classify a test point has mo a fixed-length of weigh ts, the nearest neighbor regression model simply the model del lo looks oks upvector the nearest en entry try in the training set and returns the asso associated ciated x, stores the X and yInfrom thewords, training set.y When ask edarg to min classify a− test point 2 || ||X X x || yˆ = i = regression target. other where . i i,: 2 The the model can looks upbthe nearest ento trydistance in the training andthan returns associated 2 norm, Lthe algorithm also e generalized metrics set other the theL such arg min X x y ˆ = y i = regression target. In other words, where . The as learned distance metrics (Goldb Goldberger erger et al., 2005). If the algorithm is allo allow wed L algorithm can also b e generalized to distance metrics other than the norm, such || − || to break ties by averaging the yi values for all Xi,: that are tied for nearest, then as learned distance (Goldb etum al.p, ossible 2005). training If the algorithm is hallo wedt this algorithm is ablemetrics to achiev achieve e theerger minim minimum error (whic (which migh might y tical toe break by zero, averaging values for all that arewith tied differen for nearest, then b greaterties than if twothe iden identical inputs areXasso associated ciated different t outputs) thisan algorithm is able to achieve the minimum p ossible training error (which might on any y regression dataset. be greater than zero, if two identical inputs are associated with different outputs) Finally Finally,, we can also create a non-parametric learning algorithm by wrapping a on any regression dataset. parametric learning algorithm inside another algorithm that increases the num numb ber Finally, we can also create a non-parametric learning algorithm by wrapping a 115 parametric learning algorithm inside another algorithm that increases the number


of parameters as needed. For example, we could imagine an outer lo loop op of learning that changes the degree of the polynomial learned by linear regression on top of a ofolynomial parameters as needed. For input. example, we could imagine an outer loop of learning p expansion of the that changes the degree of the polynomial learned by linear regression on top of a The idealexpansion mo model del is an that simply knows the true probability distribution polynomial of oracle the input. that generates the data. Even suc such h a mo model del will still incur some error on man many y The ideal modelthere is an oracle thatbsimply thethe true probability distribution problems, because ma may y still e some knows noise in distribution. In the case that generateslearning, the data.the Even such afrom model willy still some errorsto on many x to of supervised mapping ma may yincur be inherently stoc chastic, problems, there mayfunction still be some noise in other the distribution. In thethose case y ma or may y bbeecause a deterministic that inv involv olv olves es variables besides x y of supervised learning, the mapping from to ma y be inherently sto c hastic, included in x. The error incurred by an oracle making predictions from the true or y may be pa(xdeterministic function inv distribution , y) is called the Bayesthat err error or or. . olves other variables besides those included in x. The error incurred by an oracle making predictions from the true Training and generalization error vary as the size of the training set varies. distribution p(x, y) is called the Bayes error. Exp Expected ected generalization error can nev never er increase as the num numb ber of training examples T raining and generalization error v ary as the size of the generalization training set varies. increases. For non-parametric mo models, dels, more data yields better until Expected caned. never increase the number ofdel training examples the best pgeneralization ossible error error is achiev achieved. Any fixed as parametric mo model with less than increases. For non-parametric morevdata yields betterthe generalization optimal capacit capacity y will asymptotemo todels, an error alue that exceeds Bay Bayes es error.until See the b est p ossible error is achiev ed. Any fixed parametric mo del with less than Fig. 5.4 for an illustration. Note that it is possible for the mo model del to hav havee optimal optimal capacit y will asymptote to an error v alue that exceeds the Bay es error.error. See capacit capacity y and yet still hav havee a large gap b et etw ween training and generalization Fig.this 5.4situation, for an illustration. it is possible forbthe model tomore have training optimal In we may beNote ablethat to reduce this gap y gathering capacity and yet still have a large gap b etween training and generalization error. examples. In this situation, we may be able to reduce this gap by gathering more training examples.

5.2.1

The No Free Lunc Lunch h Theorem

Learning theory claims a machine learning algorithm can generalize well from 5.2.1 The No Freethat Lunc h Theorem a finite training set of examples. This seems to con contradict tradict some basic principles of Learning theory claims thatorainferring machine general learningrules algorithm generalize well from logic. Inductiv Inductive e reasoning, from acan limited set of examples, a finite training vset of examples. This seems to describing contradict some principles of is not logically alid. T To o logically infer a rule ev every erybasic member of a set, logic.must Inductiv reasoning, orabout inferring general a limited set of examples, one hav haveeeinformation every mem memb brules er of from that set. is not logically valid. To logically infer a rule describing every member of a set, In part, mac machine hine learning av avoids oids this problem by offering only probabilistic rules, one must have information about every member of that set. rather than the entirely certain rules used in purely logical reasoning. Machine In part, machinetolearning avoids offeringab only rules, learning promises find rules thatthis areproblem pr prob ob obably ablybycorrect about out probabilistic most members of rather than the entirely certain rules used in purely logical reasoning. Machine the set they concern. learning promises to find rules that are probably correct about most members of Unfortunately Unfortunately,, even this do does es not resolve the en entire tire problem. The no fr freee lunch the set they concern. the theor or orem em for machine learning (Wolp olpert ert, 1996) states that, averaged ov over er all possible Unfortunately , even this do es not resolve the en tire problem. The free lunch data generating distributions, every classification algorithm has thenosame error the or em for machine learning ( W olp ert , 1996 ) states that, a v eraged ov er all p ossible rate when classifying previously unobserv unobserved ed points. In other words, in some sense, data generating distributions, every classification algorithm hasother. the same no machine learning algorithm is universally an any y better than any The error most rate when classifying previously unobserv points. In other words, in some sense, sophisticated algorithm we can conceive ofedhas the same average performance (o (ov ver no machine learning algorithm is universally an y b etter than any other. The most all possible tasks) as merely predicting that every point belongs to the same class. sophisticated algorithm we can conceive of has the same average performance (over all possible tasks) as merely predicting 116 that every point belongs to the same class.


117


Fortunately ortunately,, these results hold only when we average ov over er al alll possible data generating distributions. If we mak makee assumptions about the kinds of probability Fortunately these results hold orld only applications, when we average over possible data distributions we, encounter in real-w real-world then we canal ldesign learning generating that distributions. If won e mak e assumptions algorithms perform well these distributions.about the kinds of probability distributions we encounter in real-world applications, then we can design learning This means that the goal of mac machine hine learning research is not to seek a universal algorithms that perform well on these distributions. learning algorithm or the absolute best learning algorithm. Instead, our goal is to This means that theof goal of machineare learning research not to seek that a universal understand what kinds distributions relev relevant ant to theis“real world” an AI learning algorithm or the absolute b est learning algorithm. Instead, our goal is on to agen agentt exp experiences, eriences, and what kinds of mac machine hine learning algorithms perform well understand what kinds of distributions are relev ant to the “real world” that an AI data dra drawn wn from the kinds of data generating distributions we care ab about. out. agent experiences, and what kinds of machine learning algorithms perform well on data drawn from the kinds of data generating distributions we care about.

5.2.2

Regularization

5.2.2no free Regularization The lunc lunch h theorem implies that we must design our mac machine hine learning algorithms to perform well on a sp specific ecific task. We do so by building a set of The no free lunc h theorem implies that we mthese ust design our mac learning preferences in into to the learning algorithm. When preferences arehine aligned with algorithms to perform w ell on a sp ecific task. W e do so by building a set of the learning problems we ask the algorithm to solv solve, e, it performs better. preferences into the learning algorithm. When these preferences are aligned with far, theproblems only metho method of the mo modifying difying a learning algorithm we ha have ve discussed is the So learning we dask algorithm to solve, it performs better. to increase or decrease the mo model’s del’s capacit capacity y by adding or remo removing ving functions from So far, the only metho d of mo difying a learning algorithm we ve discussed the hypothesis space of solutions the learning algorithm is able to ha choose. We ga gav vise to increase decreaseofthe model’s or capacit y by adding or remo functions for from the sp specific ecificor example increasing decreasing the degree of ving a polynomial a the hypothesis space The of solutions algorithm to choose. We gave regression problem. view wethe ha hav vlearning e described so far is is able oversimplified. the specific example of increasing or decreasing the degree of a polynomial for a The b eha ehavior vior of The our algorithm affected not by how large we regression problem. view we haisvestrongly described so far is ovjust ersimplified. mak makee the set of functions allow allowed ed in its hyp ypothesis othesis space, but by the sp specific ecific iden identit tit tity y The b eha vior of our algorithm is strongly affected not just by how large we of those functions. The learning algorithm we ha hav ve studied so far, linear regression, makae the set of functions allowed inofitsthe hyp othesis space, but by the specific tity has hypothesis space consisting set of linear functions of its input.iden These of thosefunctions functions. algorithm we havewhere studied far, linear regression, linear canThe be learning very useful for problems thesorelationship betw etween een has a hypothesis space consisting of the set of linear functions of its input. These inputs and outputs truly is close to linear. They are less useful for problems linear be nonlinear very useful for problems where the relationship betw een that bfunctions eha very fashion. For example, linear regression would ehav ve in acan inputs and outputs closetotouse linear. They are useful sin sin((less x ) from x . for not perform very welltruly if weistried it to predict We problems can thus that b eha v e in a very nonlinear fashion. F or example, linear regression would con control trol the performance of our algorithms by cho hoosing osing what kind of functions we sin ( x x not p erform v ery w ell if w e tried to use it to predict ) from . W e can thus allo allow w them to dra draw w solutions from, as well as by con controlling trolling the amoun amountt of these con trol the p erformance of our algorithms b y c ho osing what kind of functions we functions. allow them to draw solutions from, as well as by controlling the amount of these We can also give a learning algorithm a preference for one solution in its functions. hyp ypothesis othesis space to another. This means that both functions are eligible, but one W e can also a learning algorithm a preference its is preferred. Thegive unpreferred solution be chosen only if itfor fitsone thesolution trainingindata hypothesis to than another. This means that both functions are eligible, but one significan significantly tlyspace better the preferred solution. is preferred. The unpreferred solution be chosen only if it fits the training data For example, w wee can modify the training criterion for linear regression to significantly better than the preferred solution. include weight de deccay ay.. To perform linear regression with weigh weightt deca decay y, we minimize For example, we can modify the training criterion for linear regression to include weight decay. To perform linear118 regression with weight decay, we minimize


a sum comprising both the mean squared error on the training and a criterion J (w) that expresses a preference for the weigh weights ts to hav havee smaller squared L2 norm. a sum comprising both the mean squared error on the training and a criterion Sp Specifically ecifically ecifically, , J (w) that expresses a preference forMSE the weigh ts to>have smaller squared L (5.18) norm. J (w) = train + λw w , Specifically, where λ is a value chosen ahead that con controls the + λtrols w w J (w)of=time MSE , strength of our preference (5.18) for smaller weigh weights. ts. When λ = 00,, we imp impose ose no preference, and larger λ forces the where is abvecome alue chosen ahead of time that con trols theinstrength preference J (w w eigh eights tsλ to smaller. Minimizing ) results a choiceofofour weigh weights ts that λ λ for smaller weigh ts. When = 0 , we imp ose no preference, and larger forces mak makee a tradeoff bet etw ween fitting the training data and being small. This giv gives esthe us J (w w eights tothat become ) results in a cof hoice weightsAsthat solutions hav havee smaller. a smallerMinimizing slope, or put weigh eight t on fewer the of features. an make a tradeoff betcan ween fittinga the training data and berfit eingorsmall. This es ust example of ho how w we control mo model’s del’s tendency to ov overfit underfit viagiv weigh weight solutions have aa high-degree smaller slope, or put wregression eight on fewer of with the features. an deca decay y, we that can train polynomial mo model del differen differentt vAs alues example of ho w we can control a mo del’s tendency to ov erfit or underfit via weigh t of λ. See Fig. 5.5 for the results. decay, we can train a high-degree polynomial regression model with different values of λ. See Fig. 5.5 for the results.

λ λ

λ

λ

More generally generally,, we can regularize a mo model del that learns a function f ( x; θ ) by adding a penalty called a regularizer to the cost function. In the case of weigh eightt f ( x ; θ More generally , we can regularize a mo del that learns a function ) by > w) = w w. In Chapter 7, we will see that man deca decay y, the regularizer is Ω( Ω(w many y other adding a penalty called a regularizer to the cost function. In the case of weight decay, the regularizer is Ω(w) = w w. 119 In Chapter 7, we will see that many other


regularizers are possible. Expressing preferences for one function over another is a more general wa way y regularizers are possible. of con controlling trolling a model’s capacity than including or excluding members from the Expressing preferences for one function aovfunction er another general waasy hyp ypothesis othesis space. We can think of excluding fromisaahmore yp ypothesis othesis space of controlling a model’sstrong capacity than including excluding expressing an infinitely preference against or that function.members from the hypothesis space. We can think of excluding a function from a hypothesis space as In our weigh weightt deca decay y example, we expressed our preference for linear functions expressing an infinitely strong preference against that function. defined with smaller weigh eights ts explicitly explicitly,, v via ia an extra term in the criterion we In our weigh t deca y example, w e expressed our preference for linear minimize. There are man many y other ways of expressing preferences for functions differen differentt defined with smaller w eigh ts explicitly , v ia an extra term in the criterion we solutions, both implicitly and explicitly explicitly.. Together, these differen differentt approac approaches hes are minimize. are man kno known wn as reThere gularization gularization. . y other ways of expressing preferences for different solutions, both implicitly and explicitly. Together, these different approaches are known as regularization. Regularization is one of the cen central tral concerns of the field of machine learning, riv rivaled aled in its imp importance ortance only by optimization. Regularization is one of the central concerns of the The no free lunc lunch h theorem has made it clear that there is no best machine field of machine learning, rivaled in its importance only by optimization. learning algorithm, and, in particular, no best form of regularization. Instead The free alunc h theorem has madethat it clear that there best machine we m ust no choose form of regularization is well-suited to isthenoparticular task learning algorithm, and, in particular, no b est form of regularization. Instead we wan wantt to solv solve. e. The philosophy of deep learning in general and this book in w e must choose a form regularization that (suc is well-suited particulartasks task particular is that a veryofwide range of tasks (such h as all of to thethe in intellectual tellectual w e wan t to can solve. of effectiv deep learning general and thisose boforms ok in that people do)The mayphilosophy all be solved effectively ely usinginvery general-purp general-purpose particular is that a v ery wide range of tasks (suc h as all of the in tellectual tasks of regularization. that people can do) may all be solved effectively using very general-purpose forms of regularization.

5.3

Hyp Hyperparameters erparameters and Validation Sets

Most learning algorithms and hav havee sev several settings that we can use to con control trol 5.3 machine Hyperparameters Veral alidation Sets the behavior of the learning algorithm. These settings are called hyp hyperp erp erpar ar arameters ameters ameters.. Mostvalues machine learning algorithms e sev eral settings we can use to con trol The of hyperparameters arehav not adapted by thethat learning algorithm itself the behavior of the learning algorithm. settings are one called hyperpar ameters. (though we can design a nested learningThese pro procedure cedure where learning algorithm The values adapted by the learning algorithm itself learns the bof esthyperparameters hyperparametersare for not another learning algorithm). (though we can design a nested learning procedure where one learning algorithm In the regression example we saw in Fig.algorithm). 5.2, there is a single hyperlearns the pbolynomial est hyperparameters for another learning parameter: the degree of the polynomial, whic which h acts as a cap apacity acity hyp yperparameter. erparameter. In the p olynomial regression example w e saw in Fig. 5.2 , there is a single hyperThe λ value used to control the strength of weigh weightt decay is another example of a parameter: the degree of the p olynomial, whic h acts as a c ap acity h yp erparameter. hyp yperparameter. erparameter. The λ value used to control the strength of weight decay is another example of a a setting is chosen to b e a hyp yperparameter erparameter that the learning algohypSometimes erparameter. rithm does not learn because it is difficult to optimize. More frequently frequently,, we do Sometimes a setting is chosen to b e a h yp erparameter that learning algonot learn the hyp yperparameter erparameter because it is not appropriate to the learn that hyp ypererrithm does not learn b ecause it is difficult to optimize. More frequently , we do parameter on the training set. This applies to all hyperparameters that control not learn the h. yp because it set, is not appropriate to learnwould that alw hypaermo model del capacity capacity. If erparameter learned on the training such hyperparameters alwa ys parameter on the training set. This applies to all hyperparameters that control model capacity. If learned on the training 120set, such hyperparameters would always


cho hoose ose the maxim maximum um possible mo model del capacity capacity,, resulting in ov overfitting erfitting (refer to Fig. 5.3). For example, we can alw alwa ays fit the training set better with a higher choose pthe maximand um possible del setting capacity erfitting degree olynomial a weigh eightt mo decay of, λresulting = 0 thaninweovcould with(refer a lo low wto er Fig. 5.3 ). F or example, we can alw a ys fit the training set better with a higher degree polynomial and a positive weigh eightt deca decay y setting. degree polynomial and a weight decay setting of λ = 0 than we could with a lower To solve this problem, we need a validation set of examples that the training degree polynomial and a positive weight decay setting. algorithm do does es not observe. To solve this problem, we need a validation set of examples that the training Earlier we discussed ho how w a held-out test set, comp composed osed of examples coming from algorithm does not observe. the same distribution as the training set, can be used to estimate the generalization Earlier we discussed howlearning a held-out test has set, completed. composed ofItexamples coming error of a learner, after the pro process cess is imp important ortant thatfrom the the same distribution theintraining bee choices used to about estimate generalization test examples are not as used any waset, y tocan mak make thethe mo model, del, including error of a learner, afterF the learning process has completed. It is impset ortant the its hyperparameters. For or this reason, no example from the test can that be used testthe examples are set. not used in anywe waalwa y to ys mak e choices the about the model, in validation Therefore, always construct validation set including from the its hyperparameters. F or this reason, no example from the test set can be used tr training aining data. Specifically Specifically,, we split the training data in into to tw twoo disjoin disjointt subsets. One in these the validation weparameters. always construct the vsubset alidation setvfrom the of subsets isset. usedTherefore, to learn the The other is our alidation training Specifically we split the training intoortw o disjoin t subsets. One set, useddata. to estimate the, generalization error data during after training, allowing of these subsets is used to learn the parameters. The other subset is our v alidation for the hyperparameters to be updated accordingly accordingly.. The subset of data used to set, used to estimate the generalization errorthe during or after allowing learn the parameters is still typically called training set, training, ev even en though this for be updated . The of data pro used to ma may ythe be hyperparameters confused with thetolarger po pool ol of accordingly data used for the subset entire training process. cess. learnsubset the parameters is still typically called the set, even isthough The of data used to guide the selection of training hyperparameters called this the ma y b e confused with the larger po ol of data used for the entire training pro cess. validation set. Typically Typically,, one uses ab about out 80% of the training data for training and The subset of data used to guide the of to hyperparameters is called the 20% for validation. Since the validationselection set is used “train” the hyperparameters, vthe alidation set. set Typically , one uses about 80% of the trainingerror, data though for training and validation error will underestimate the generalization typically 20% validation. alidationerror. set isAfter usedall to h“train” the hyperparameters, by a for smaller amountSince than the the vtraining yp yperparameter erparameter optimization the v alidation set error will underestimate the generalization error, though is complete, the generalization error may be estimated using the test set.typically by a smaller amount than the training error. After all hyperparameter optimization In practice, when the same test set has been used rep repeatedly eatedly to ev evaluate aluate is complete, the generalization error may be estimated using the test set. performance of different algorithms over many years, and esp especially ecially if we consider In practice, the scientific same testcomm set has eatedly to ev aluate all the attempts when from the communit unit unity ybeen at bused eatingrep the reported state-ofperformance of different algorithms ver end many ears, and especially if we consider the-art performance on that test set,owe upyhaving optimistic ev evaluations aluations with all the attempts from the scientific comm unit y at b eating the reported state-ofthe test set as well. Benc Benchmarks hmarks can th thus us b ecome stale and then do not reflect the the-art performance on of that test set,system. we end up having optimistic evaluations with true field performance a trained Thankfully Thankfully, , the communit community y tends to the test set as well. Benc hmarks can th us b ecome stale and then do not reflect the mo mov ve on to new (and usually more am ambitious bitious and larger) benc enchmark hmark datasets. true field performance of a trained system. Thankfully, the community tends to move on to new (and usually more ambitious and larger) benchmark datasets.

5.3.1

Cross-V Cross-Validation alidation

5.3.1 Cross-V alidation Dividing the dataset into a fixed training set and a fixed test set can be problematic if it results in the test set being small. A small test set implies statistical uncertaint uncertainty y Dividingthe theestimated dataset into a fixed training set and it a fixed testtosetclaim can bthat e problematic around average test error, making difficult algorithm if itwresults in thethan test algorithm set being small. small test set implies statistical uncertainty A orks better B on A the given task. around the estimated average test error, making it difficult to claim that algorithm A works better than algorithm B on the given task. 121


When the dataset has hundreds of thousands of examples or more, this is not a serious issue. When the dataset is to too o small, there are alternative procedures, When the dataset has hundreds of thousands of examples this istest not whic which h allow one to use all of the examples in the estimationorofmore, the mean a serious issue. When the dataset is too small, there arepro alternative procedures, error, at the price of increased computational cost. These procedures cedures are based on whic h allow one to use all of the examples in the estimation of the mean test the idea of rep repeating eating the training and testing computation on different randomly the price of increased computational cost. ceduresofare based on cerror, hosenatsubsets or splits of the original dataset. TheThese most pro common these is the the idea of repalidation eating the training testing computation k -fold cross-v cross-validation pro procedure, cedure, and shown in Algorithm 5.1,on in different which a randomly partition cofhosen subsets or splits of the original dataset. The most common of these the the dataset is formed by splitting it in into to k non-o non-ov verlapping subsets. Theistest k -fold cross-v alidation pro cedure, shown in Algorithm 5.1 , in which a partition error ma may y then be estimated by taking the average test error across k trials. On k non-o of thei, dataset formed by splitting it inas tothe erlapping The test trial the i-th is subset of the data is used test v set and the subsets. rest of the data is error ma y then b e estimated b y taking the a v erage test error across trials. On k used as the training set. One problem is that there exist no unbiased estimators of trialviariance , the i-th the data used as the test setand andGrandv the rest of, the data is the of subset such avoferage errorisestimators (Bengio Grandvalet alet 2004 ), but used as the training set. One problem is that there exist no unbiased estimators of appro approximations ximations are typically used. the variance of such average error estimators (Bengio and Grandvalet, 2004), but approximations are typically used.

5.4

Estimators, Bias and Variance

The of statistics giv gives esBias us man many y to tools ols can be used to ac achiev hiev hievee the machine 5.4fieldEstimators, and Vthat ariance learning goal of solving a task not only on the training set but also to generalize. The field of statistics es husasman y tools that can be used to acvhiev e theare machine Foundational conceptsgivsuc such parameter estimation, bias and ariance useful learning goal of solving a task not only on the training set but also to generalize. to formally characterize notions of generalization, underfitting and overfitting. Foundational concepts such as parameter estimation, bias and variance are useful to formally characterize notions of generalization, underfitting and overfitting.

5.4.1

Poin ointt Estimation

5.4.1 Point Estimation P oin ointt estimation is the attempt to provide the single “best” prediction of some quan quantit tit tity y of interest. In general the quan quantit tit tity y of interest can be a single parameter Poin estimation is the attempt provide themo single “best” or a tvector of parameters in sometoparametric model, del, such as prediction the weigh eights tsofinsome our quan tit y of interest. In general the quan tit y of interest can be a single parameter linear regression example in Sec. 5.1.4, but it can also be a whole function. or a vector of parameters in some parametric mo del, such as the weights in our In order to distinguish estimates of parameters from their true value, our linear regression example in Sec. 5.1.4, but it can also be a whole function. con conv ven ention tion will be to denote a point estimate of a parameter θ by θˆ. In order to distinguish estimates of parameters from their true value, our Let {x(1), . . . , x (m)} be a set of m indep independen enden endentt and identically distributed convention will be to denote a point estimate of a parameter θ by θˆ. (i.i.d.) data points. A point estimator or statistic is an any y function of the data: Let x , . . . , x be a set of m independent and identically distributed (1) statistic (i.i.d.) data θˆm = g(xor , . . . , x(m)is).any function of the data: { points. A p}oint estimator (5.19) θˆ = g(xg return , . . . , x a v)alue . The definition do does es not require that that is close to the(5.19) true even en that the range of g is the same as the set of allo allow wable values of θ. θ or ev The definition that return a value thatwsisthe close to theoftrue This definition do of es a pnot oin ointtrequire estimator is gvery general and allo allows designer an or ev en that the range of is the same as the set of allo w able v alues of θ g θ. estimator great flexibility flexibility.. While almost an any y function thus qualifies as an estimator, This definition of a point estimator is very general and allows the designer of an estimator great flexibility. While almost122 any function thus qualifies as an estimator,


The k -fold cross-v cross-validation alidation algorithm. It can be used to estimate generalization error of a learning algorithm A when the giv given en dataset D is to too o k The -fold cross-v alidation algorithm. It can b e used to estimate small for a simple train/test or train/v train/valid alid split to yield accurate estimation of D A generalization error of a learning algorithm when the giv en dataset is to oo generalization error, because the mean of a loss L on a small test set ma may y ha hav ve to too smallvariance. for a simple or train/v split to accurate estimation of z(i) (for high The train/test dataset D con contains tains asalid elements theyield abstract examples generalization error,which because the mean of a loss L on a small test set(i)may xha (iv ) ,eyto (i)o) i-th example), the could D stand for an (input,target) pair z = ((x z case high datasetlearning, contains elements abstract (for z(i) =examples x(i) in the in thevariance. case of The sup supervised ervised orasfor just anthe input i z = ( x , y the -th example), which could stand for an (input,target) pair of unsupervised learning. The algorithm returns the vector of errors e for eac each h) = x The in the case of, whose supervised or for just an input z error. in errors the case example in D meanlearning, is the estimated generalization on of unsupervised learning. The algorithm returns the vector of errors for eac h e individual examples can be used to compute a confidence interv interval al around the mean D example in While , whose mean is the estimated generalization error. after The the errors (Eq. 5.47). these confidence interv intervals als are not well-justified useon of individual examples can b e used to compute a confidence interv al around the mean cross-v cross-validation, alidation, it is still common practice to use them to declare that algorithm A (Eq. 5.47than ). While these B confidence interv als are not well-justified after use of is better algorithm only if the confidence in of the algorithm interv terv terval al of the error A cross-v is still practice to use them to al declare that algorithm A lies balidation, elow anditdoes notcommon in intersect tersect the confidence in interv terv terval of algorithm B. is better than algorithm B only if the confidence interval of the error of algorithm D, A,not L, kin ):tersect the confidence interval of algorithm B. A lies below and (does D, the given dataset, with elemen elements ts z (i) D ( , A, L, k ): takes es a dataset as A D , the learning algorithm, seen as a function that tak , the given dataset, with elemen ts z input and outputs a learned function the loss learning algorithm, as a function takes a dataset as A ,, the f and function, seen asseen a function from a that learned function L input and outputs a learned function an example z(i) ∈ D to a scalar ∈ R thenum lossber function, seen as a function from a learned function f and L, ,the k number D of folds R an example to aexclusiv scalar e subsets D , whose union is D. Split D into zk mutually exclusive i k, the iD from 1 tonum k∈ ber of folds ∈ D D Split k mutually exclusive subsets , whose union is . f i = Ainto (D\D i) i from 1 to k z (j)Din DD i f e= A ) (j ) ) ( j = L(fD i, z z in \ e = L(f , z ) e e

123


a go goo od estimator is a function whose output is close to the true underlying θ that generated the training data. a good estimator is a function whose output is close to the true underlying θ that For no now, w, we tak takee the frequentist persp perspective ective on statistics. That is, we assume generated the training data. that the true parameter value θ is fixed but unknown, while the point estimate now, weoftak e the frequentist on statistics. That is,process, we assume θˆ isFaorfunction the data. Since thepersp dataective is drawn from a random any that the of true value θTherefore is fixed but the point estimate function theparameter data is random. θˆ isunknown, a randomwhile variable. θˆ is a function of the data. Since the data is drawn from a random process, any Poin ointt estimation can also refer to the estimation of the relationship b et etw ween function of the data is random. Therefore θˆ is a random variable. input and target variables. We refer to these types of point estimates as function Point estimation can also refer to the estimation of the relationship b etween estimators. input and target variables. We refer to these types of point estimates as function estimators. As we mentioned ab abo ove, sometimes we are in interested terested in performing function estimation (or function appro approximation). ximation). Here we are trying to As we mentioned ab o v e, sometimes are is interested in predict a variable y giv given en an input vector x . We assume thatwe there a function p erforming function estimation (or function appro ximation). Here w e are trying to f (x ) that describ describes es the approximate relationship b et etw ween y and x. For example, y giv predict variable We assume that there a function y en = fan ( xinput ) + ,vector w e may aassume that where x. stands for the part of yisthat is not f ( x y x ) that describ es the approximate relationship b et w een and . F or example, predictable from x. In function estimation, we are interested in appro approximating ximating y = f ( x ) +   y w e may assume that , where stands for the part of that is not ˆ f with a model or estimate f . Function estimation is really just the same as x predictable from . In function estimation, w e are interested in appro ximating ˆ estimating a parameter θ; the function estimator f is simply a p oin ointt estimator in f with a space. fˆ. Function modelThe or estimate estimation is really just the5.1.4 same as function linear regression example (discussed abov abovee in Sec. ) and ˆ θ; the estimating a parameter function estimator is simply a pboth oint estimator the polynomial regression example (discussed in fSec. 5.2) are examples in of function space. The linear regression example (discussed abov e in Sec. 5.1.4 ) and w or estimating scenarios that ma may y be interpreted either as estimating a parameter parameterw the polynomial regression example (discussed in Sec. 5.2 ) are both examples of ˆ a function f mapping from x to y. scenarios that may be interpreted either as estimating a parameter w or estimating We no now w review the most commonly studied prop properties erties of poin ointt estimators and a function fˆ mapping from x to y. discuss what they tell us ab about out these estimators. We now review the most commonly studied properties of point estimators and discuss what they tell us about these estimators.

5.4.2

Bias

The bias Bias of an estimator is defined as: 5.4.2 (5.20) bias(θˆm The bias of an estimator is defined as:) = E(θˆm ) − θ E ˆ samples from a random variable) and where the exp expectation ectation is over the data (5.20) bias( θˆ (seen ) = as (θ ) θ θ is the true underlying value of θ used to define−the data generating distribution. where the expθectation ovbere the datad (seen as(θˆsamples from a random variable) ˆm is saidisto ˆ and bias( An estimator unbiase unbiased if bias m) = 0, which implies that E( θm ) = θ. isAn theestimator true underlying value of θasymptotic used to define the data distribution. ˆ to be asymptotical al ally ly unbiase unbiased d if generating θ θˆm is said limm→∞ bias bias( (θ Emˆ) = 0, ˆ ˆ θ bias ( θ ) = 0 (θ ) = An estimator is said to b e unbiase d if , which implies that ˆ whic which h implies that limm→∞ E(θm) = θ. ˆ ˆ θ. An estimator θ is said to be asymptotically unbiased if lim bias(θ ) = 0, E ˆ (θ ) = θ. which implies that lim Consider a set of samples {x (1), . . . , x(m) } that are indep independently endently and iden identically tically distributed according to a Bernoulli distriConsider a set of samples x , . . . , x 124 that are independently and identically distributed according to a {Bernoulli distri}


bution with mean θ:

P (x (i); θ) = θ x (1 − θ)(1−x

)

. (5.21) bution with mean θ: A common estimator for Pthe is the mean (5.21) of the (x θ ;parameter θ) = θ (1of this θ) distribution . training samples: mof−this distribution is the mean of the A common estimator for the θ parameter X 1 θˆm = x(i) . (5.22) training samples: m i=1 1 θˆ = x . (5.22) To determine whether this estimator ism biased, we can substitute Eq. 5.22 in into to Eq. 5.20 5.20:: To determine whether this estimator is biased, we can substitute Eq. 5.22 into Eq. 5.20: bias(ˆθm) = E[θˆm] − θ (5.23) X " # m E ˆ1 X bias(θˆ ) = [ θ ] θ x (i) − θ (5.23) =E (5.24) m i=1 E 1− = (5.24) m hx i θ 1 X m ( i) = − (5.25) E x −θ m i=1 1 E = "X (5.25) m X 1x # θ  m 1 (1−x ( i ) x ) X −θ (5.26) = x− θ (1 − θ) m 1 i=1 x =0 θ (5.26) = x θ (1 θ) m m 1 X h i X (θ) − θ = (5.27) − − m 1 i=1 = θ − θ =(θ0) θ (5.27) = (5.28) mX X   − =θ θ=0 (5.28) Since bias(ˆθ) = 00,, we say that our estimator θˆ is un unbiased. biased. − X ˆ Since bias(θ) = 0, we say that our estimator θˆ is unbiased. No Now, w, consider a set of samples {x(1) , . . . , x(m) } that are indep and iden independen enden endently tly identically tically distributed ( i ) ( i ) 2 ∈w, {1,consider . . . , m}. according to a Gaussian distribution p(x ) = N (x ; µ, σ ), where iNo , . . . , x x aRecall set ofthat samples that are indep enden tly and iden tically distributed the Gaussian probability densit density y function is giv given en by p(x ) = (x ; µ, σ ), where i 1, . . . , m . according to a Gaussian distribution { } ! Recall that the Gaussian probability1 density function N 1 (x (i) is } − giv µ)2en by ∈ { p(x(i); µ, σ2 ) = √ exp − . (5.29) 2 2 σ 2π σ 2 1 1 (x µ) p(x ; µ, σ ) = exp . (5.29) 2 σ− 2 π σ A common estimator of the Gaussian mean−parameter is kno known wn as the sample √ me mean an an:: m A common estimator of the Gaussian mean parameter is ! known as the sample 1X µ ˆ = x(i) (5.30) m mean : m 1 i=1 µ ˆ = x (5.30) m 125 X


To determine the bias of the sample mean, we are again interested in calculating its exp expectation: ectation: To determine the bias of the sample mean, we are again interested in calculating its expectation: bias( bias(ˆ µ ˆ m ) = E[µ ˆm] − µ (5.31) " # m E 1 X bias(µ ˆ )= = E [µ ˆ ] µx(i) − µ (5.31) (5.32) m− i=1 E 1 !µ = x (5.32) m 1mX h (i) i = (5.33) E x − −µ m 1 i=1 E! = " X x# µ (5.33) m m 1 X (5.34) = µ −µ − m 1 i=1 (5.34) = =µ− =µ 0 h µi! (5.35) m µX − = µ µ = 0 Th Thus us we find that the sample mean is an unbiased estimator of Gaussian (5.35) mean ! − parameter. Xunbiased estimator of Gaussian mean Thus we find that the sample mean is an parameter. As an 2 example, we compare two differen differentt estimators of the variance parameter σ of a As an Gaussian distribution. We are in interested terested in kno knowing wing if either estimator is biased. σ example, we compare two differen t estimators of the v ariance parameter of a 2 we consider is known as the sample varianc The first estimator of σ variance e : Gaussian distribution. We are interested in knowing if either estimator is biased. m as2 the sample variance : The first estimator of σ we consider 1 X  is(iknown 2 ) σ ˆm = x −µ ˆm , (5.36) m i=1 1 σ ˆ = x µ ˆ , (5.36) m abov ˆ m is the sample mean, defined where µ above. e.−More formally formally,, we are in interested terested in computing ˆ is the sample mean,bias( where µ defined abov 2 More2 formally, we are interested in bias(ˆ σ ˆ 2m) = E[σ ê.m (5.37) ]−σ computing   X 2 ]: E We begin by ev evaluating aluating the term ˆm bias(Eσ ˆ[σ )= [σ ˆ ] σ (5.37) # E" − 2 We begin by evaluating the term [σ ˆ1 X ]:m  2 ( i) E[σ ˆm ] = =E E x −µ ˆm (5.38) m E E 1 i=1 [σ ˆ ] =m − 1 x µ ˆ (5.38) m σ2 (5.39) = − m m 1 (5.39) = σ ˆ2m #is −σ2 /m. Therefore, the Returning to Eq. 5.37, we conclude"m −that the bias of σ  sample variance is a biased estimator. X  ˆ is σ /m. Therefore, the Returning to Eq. 5.37, we conclude that the bias of σ sample variance is a biased estimator. − 126


The unbiase unbiased d sample varianc variancee estimator m  2 X The unbiased sample varianc e 1estimator 2 (5.40) σ ˜m = x(i) − µ ˆm m−1 1 i=1 (5.40) σ ˜ = x µ ˆ pro provides vides an alternative approac approach. h.mAs 1the name suggests this estimator is un unbiased. biased. − That is, we find that E[σ ˜ 2m ] = σ 2: − provides an alternative approach. As the name suggests this estimator is unbiased. " # E m That is, we find that [σ ˜ ]=σ :   2 1 XX 2 (i) ˆm E[σ ˜m] = E x −µ (5.41) m−1 i=1 1 E E µ ˆ [σ ˜ ]= m x (5.41) 2 m E[1σ ] (5.42) = ˆm m−1 −   m − E ] 1 2 (5.42) = m [σ ˆm − = m" 1 σ (5.43) # m m−1  mX  1 m − 2 =σ . σ (5.43) (5.44) m m 1 − σ .− (5.44) We ha hav ve two estimators:=one is biased and the other is not. While unbiased estimators are clearly desirable, they  are not alw alwa ays the “b “best” est” estimators. As we  W e ha v e t w o estimators: one is biased and the other is not. While unbiased will see we often use biased estimators that possess other imp important ortant properties. estimators are clearly desirable, they are not always the “best” estimators. As we will see we often use biased estimators that possess other important properties.

5.4.3

Variance and Standard Error

Another prop propert ert erty y ofand the estimator that we migh mightt wan antt to consider is ho how w muc much h 5.4.3 V ariance Standard Error we exp expect ect it to vary as a function of the data sample. Just as we computed the Another propofert y ofestimator the estimator that we its migh t ww anetcan to consider w muche. exp expectation ectation the to determine bias, computeisitshovarianc variance w e exp ect it to v ary as a function of the data sample. Just as w e computed the The variance of an estimator is simply the variance expectation of the estimator to determine its bias, we can compute its variance. The variance of an estimator is simply theθˆ)variance Var( (5.45) Var(θˆset. ) Alternately where the random variable is the training Alternately,, the square ro root ot (5.45) of the ˆ variance is called the standar standard d err error or or,, denoted SE(θ). where the random variable is the training set. Alternately, the square root of the The variance or the standard error of an estimator pro provides vides a measure of ho how w variance is called the standard error, denoted SE(θˆ). we would exp expect ect the estimate we compute from data to vary as we indep independently endently The variance or thefrom standard error of andata estimator provides acess. measure resample the dataset the underlying generating pro process. Justofasho wwe w e would exp ect the estimate we compute from data to v ary as w e indep endently migh mightt like an estimator to exhibit low bias we would also like it to ha hav ve relativ relatively ely resample the dataset from the underlying data generating pro cess. Just as we lo low w variance. might like an estimator to exhibit low bias we would also like it to have relatively When we compute an any y statistic using a finite num numb ber of samples, our estimate low variance. of the true underlying parameter is uncertain, in the sense that we could ha hav ve When we compute an y statistic using a finite num b er of samples, our estimate obtained other samples from the same distribution and their statistics would hav havee of the true underlying parameter is uncertain, in the sense that we could have 127 obtained other samples from the same distribution and their statistics would have


been different. The exp expected ected degree of variation in any estimator is a source of error that we wan wantt to quan quantify tify tify.. been different. The exp ected degree of variation in any estimator is a source of The standard error of the mean is giv given en by error that we want to quantify. v u is givenmby The standard error of the mean u 1 X (i) σ SE( SE(ˆ µ ˆ m) = tVar[ x ]= √ , (5.46) m m 1 i=1 σ SE(µ ˆ ) = Var[ x ]= , (5.46) m m 2 i √ standard error is often where σ is the true variance of the samples x . The estimated by using an estimate of σ. Unfortunately Unfortunately,, neither the square ro root ot of σ isvariance x biased where the truenor variance ofvthe samples . The estimator standard of error isariance often the sample the square ro root ot of the un unbiased the v u X u estimated y using an estimate of Unfortunately , neither theapproaches square root of pro provide vide anbunbiased estimate of t theσ. standard deviation. Both tend theunderestimate sample variance thestandard square ro ot of the un estimator the variance to thenor true deviation, butbiased are still used inofpractice. The pro vide an unbiased estimate of the standard deviation. Both approaches tend square ro root ot of the unbiased estimator of the variance is less of an underestimate. toorunderestimate the true standard deviation, but are still used in practice. The F large m, the appro approximation ximation is quite reasonable. square root of the unbiased estimator of the variance is less of an underestimate. The standard error of the mean is very useful in machine learning exp experimen erimen eriments. ts. For large m, the approximation is quite reasonable. We often estimate the generalization error by computing the sample mean of the The thenum mean very useful ininmachine experimen ts. error onstandard the test error set. of The numb beris of examples the testlearning set determines the W e often estimate the generalization error by of computing thelimit sample mean of theh accuracy of this estimate. Taking adv advan an antage tage the central theorem, whic which errorusonthat thethe test set. will The ber of examples in the testa normal set determines the tells mean be num appro approximately ximately distributed with distribution, accuracy this estimate. Taking advantage of the central theorem, which w e can useofthe standard error to compute the probability thatlimit the true exp expectation ectation tells us that the mean will b e appro ximately distributed with a normal distribution, falls in any chosen in interv terv terval. al. For example, the 95% confidence in interv terv terval al centered on w e can useisthe error to compute the probability that the true expectation the mean µ ˆmstandard is falls in any chosen interval. For example, the 95% confidence interval centered on the mean is µ ˆ is (µ ˆ m − 1.96SE( ˆm + 1.96SE( 96SE(ˆ µ ˆm)) 96SE(ˆ µ ˆ m), µ )),, (5.47) (µ ˆ ), µ ˆµ 1.96SE( µ ˆ )),SE 1.96SE( ˆmean (5.47) ˆ m+and SE(( µ ˆm )2 . In mac under the normal distribution with µ variance machine hine A is better than algorithm learning exp experiments, eriments, it is − common to sa say y that algorithm algorithmA µ ˆ SE( µ ôf )algorithm under normal distribution with mean and v ariance . In machine B A is if thethe upp b ound of the 95% confidence in for the error upper er interv terv terval al A learning exp eriments, it is common to sa y that algorithm is b etter than algorithm less than the low lower er bound of the 95% confidence in interv terv terval al for the error of algorithm B if the upp er b ound of the 95% confidence in terv al for the error of algorithm A is B. less than the lower bound of the 95% confidence interval for the error of algorithm B. We once again consider a set of samples (1) ( m ) {x , . . . , x } dra drawn wn indep independently endently and iden identically tically from a Bernoulli distribution We once again consider a set of samples (1−x ( i ) x ) (recall P (x ; θ) = θ (1 − θ) ).PThis time we are in interested terested in computing x ,...,x drawn independently from a Bernoulli distribution 1 andmiden(itically ) ˆ the variance of the estimator θm = m i=1 x . (recall ). This time we are interested in computing P (x ; θ}) = θ (1 θ) { ! ˆ x1 X the variance of the estimator .m  − θ = Var θˆm = V Var ar x ( i) (5.48) m 1 i=1 Var θˆ = Var x (5.48) 128 m P 



X

!


m   1 X = 2 Var x (i) (5.49) m i=1 1 = (5.49) m Var x 1 X m = 2 θ(1 − θ) (5.50) m i=1 1 = 1 θ(1 θ) (5.50) (1 −  θ)  (5.51) = m2 mθ X − m 1 (1 θ) (5.51) = 1 θmθ = (5.52) m (1 − θ) m 1 X − = asθ(1 θ) (5.52) The variance of the estimator decreases a function of m, the num numb b er of examples m in the dataset. This is a common prop property erty−of popular estimators that we will The v ariance of the estimator decreases as a function return to when we discuss consistency (see Sec. 5.4.5of).m, the numb er of examples in the dataset. This is a common property of popular estimators that we will return to when we discuss consistency (see Sec. 5.4.5).

5.4.4

Trading off Bias and Variance to Minimize Mean Squared Error 5.4.4 Trading off Bias and Variance to Minimize Mean Squared Bias andError variance measure two different sources of error in an estimator. Bias

measures the exp expected ected deviation from the true value of the function or parameter. Bias and v ariance measure twovides different sources of error in anfrom estimator. Bias Variance on the other hand, pro provides a measure of the deviation the expected measures the expthat ected deviation from the trueofvalue of the parameter. estimator value any particular sampling the data is function likely to or cause. Variance on the other hand, provides a measure of the deviation from the expected What happ happens when e are given a choiceofbetw between estimators, one with estimator valueens that any w particular sampling the een datatw isolikely to cause. more bias and one with more variance? How do we choose betw etween een them? For What happ ens when w e are given a choice betw een t w o estimators, one with example, imagine that we are interes interested ted in approximating the function shown in more bias and one with more v ariance? How do w e choose b etw een them? F or Fig. 5.2 and we are only offered the choice betw between een a mo model del with large bias and example, imagine thatlarge we are interesHo tedwindoapproximating the one that suffers from variance. How we cho hoose ose betw etween eenfunction them? shown in Fig. 5.2 and we are only offered the choice between a model with large bias and The most common way to negotiate this trade-off is to use cross-v cross-validation. alidation. one that suffers from large variance. How do we choose between them? Empirically Empirically,, cross-v cross-validation alidation is highly successful on many real-w real-world orld tasks. AlterThe most common w a y to negotiate this trade-off is to use cross-v alidation. nativ w e can also compare the me squar d err (MSE) of the estimates: natively ely ely,, mean an squaree error or Empirically, cross-validation is highly successful on many real-world tasks. Alterθ) 2 ]ed error (MSE) of the estimates: MSEthe = Eme [(θân (5.53) natively, we can also compare squar m− E ˆ θˆm) 2 + Var(θˆm ) (5.54) θ) ] MSE = = Bias( [(θ (5.53) ˆ The MSE measures the overall=exp expected ected Bias( θ−)deviation—in + Var(θˆ ) a squared error sense— (5.54) bet etw ween the estimator and the true value of the parameter θ. As is clear from The MSE measures theMSE overall exp ectedbdeviation—in squared errorDesirable sense— Eq. 5.54 , ev evaluating aluating the incorp incorporates orates oth the bias anda the variance. θ. Asmanage between the theMSE trueand value of the parameter that is cleartofrom estimators areestimator those withand small these are estimators keep Eq. , evbias aluating MSE incorp oratesinbcoth the b oth5.54 their and the variance somewhat hec heck. k. bias and the variance. Desirable estimators are those with small MSE and these are estimators that manage to keep The relationship betw etween een bias and variance is tightly linked to the machine both their bias and variance somewhat in check. learning concepts of capacity capacity,, underfitting and ov overfitting. erfitting. In the case where genThe relationship between bias and variance is tightly linked to the machine 129 and overfitting. In the case where genlearning concepts of capacity, underfitting


x x

eralization error is measured by the MSE (where bias and variance are meaningful comp componen onen onents ts of generalization error), increasing capacity tends to increase variance eralization error is This measured by the MSE (where bias and ariance and decrease bias. is illustrated in Fig. 5.6, where we vsee againare themeaningful U-shap U-shaped ed comp onen ts of generalization error), increasing capacity tends to increase v ariance curv curvee of generalization error as a function of capacit capacity y. and decrease bias. This is illustrated in Fig. 5.6, where we see again the U-shaped curve of generalization error as a function of capacity.

5.4.5

Consistency

5.4.5 So far weConsistency hav havee discussed the prop properties erties of various estimators for a training set of fixed size. Usually Usually,, we are also concerned with the behavior of an estimator as the So far twe e discussed the prop of various estimators for a training setb er of amoun amount of hav training data grows. In erties particular, we usually wish that, as the num numb fixed size. Usually also concerned ehavior ofconv an estimator the of data points ourare dataset increases, with our pthe oin ointtbestimates converge erge to theastrue m in, we amoun t of training data grows. In particular, we usually wish that, as the num b er value of the corresp corresponding onding parameters. More formally formally,, we would lik likee that of data points m in our dataset increases, our p oint estimates converge to the true p value of the corresponding parameters. formally , we would like that (5.55) lim θˆMore . m→θ m→∞

lim θˆ θ. (5.55) p The sym symb bol → means that the conv convergence ergence is in probability probability,, i.e. for any  > 0, → P (|θˆm − θ| > ) → 0 as m → ∞ . The condition describ described ed b by y Eq. 5.55 is > 0, The sym b ol means that the conv ergence is in probability , i.e. for any , with kno known wn as consistency onsistency.. It is sometimes referred to as weak consistency consistency, ˆ θ > P ( θ consistency 0 as m . Thesur condition describ 5.55 is ˆ tobθy. Eq. → ) referring strong to the almost sure e con convergence vergence of θed Almost sur sure e kno| wn −as |consistency sometimes referred to as weak consistency, with → . It is → ∞ strong consistency referring to the almost 130 sure convergence of θˆ to θ . Almost sure


conver onvergenc genc gencee of a sequence of random variables x (1), x (2), . . . to a value x occurs when p(lim m→∞ x(m) = x) = 1. convergence of a sequence of random variables x , x , . . . to a value x occurs Consistency x ensures that the bias induced by the estimator is assured to when p(lim = x) = 1. diminish as the num umb ber of data examples gro grows. ws. How Howev ev ever, er, the rev reverse erse is not Consistency ensures that the bias induced b y the estimator is assured to true—asymptotic unbiasedness does not imply consistency consistency.. For example, consider diminish asthe themean numbparameter er of dataµexamples grows. However,Nthe rev is not 2 (x ; µ, σerse estimating of a normal distribution ), with a true—asymptotic unbiasedness does not imply consistency . F or example, consider (1) ( m ) dataset consisting of m samples: {x , . . . , x } . We could use the first sample ; µ, σE(),ˆ estimating the mean of a normalˆ distribution (1) of the dataset x θ= x(1). In that(xcase, θmwith ) = θa as parameter an unbiase unbiased dµ estimator: m , . . . , x x dataset consisting of samples: . W e could use the first sample N are seen. so the estimator is un unbiased biased no matter how man many y data p oin oints ts E ˆThis, of ˆ x θ = x ( θ )is=not θ of the dataset as an unbiase d estimator: . In that case, } un course, implies that the estimate is{ asymptotically unbiased. biased. Ho How w ev ever, er, this the estimator is unbiased howthat man p oin ˆ θmy data → θ as mts→are ∞.seen. This, of asoconsisten consistent t estimator as it isno notmatter the case course, implies that the estimate is asymptotically unbiased. However, this is not θ θ as m . a consistent estimator as it is not the case that ˆ 5.5 Maxim Maximum um Lik Likeliho eliho elihoo od Estimation → →∞ Previously Previously, , we hav have e seen some definitions of common estimators and analyzed 5.5 Maxim um Lik eliho od Estimation their properties. But where did these estimators come from? Rather than guessing Previously we havemight seen mak some of common estimators and that some ,function make e adefinitions go goo od estimator and then analyzing its analyzed bias and their properties. But where did these estimators come from? Rather than variance, we would lik likee to ha hav ve some principle from whic which h we can derive guessing specific that somethat function might make a gofor oddifferen estimator and then analyzing its bias and functions are go goo od estimators different t mo models. dels. variance, we would like to have some principle from which we can derive specific The most principle the maxim maximum likeliho eliho elihoo od principle. functions thatcommon are goodsuch estimators forisdifferen t moum dels.lik Consider a set of m examples X = {x (1), . . . , x(m) } dra drawn wn indep independently endently from The most common such principle is the maximum likelihood principle. the true but unknown data generating distribution p ( x ) . data X Consider a set of m examples = x , . . . , x drawn independently from Let pmodel ( x; θ) b e a parametric family of probability distributions over the the true but unknown data generating{distribution p} (x). same space indexed by θ. In other words, p model(x ; θ) maps an any y configuration x p ( x ; θ Let ) b e a parametric family of probability distributions over the to a real num numb b er estimating the true probability pdata (x). same space indexed by θ. In other words, p (x ; θ) maps any configuration x The maxim maximum um likelihoo likelihood d estimator for θ is then defined as to a real numb er estimating the true probability p (x). The maximum likelihoo estimator θ is θMLd = arg max for p model (Xthen ; θ) defined as θ X θ = arg max Y pm ( ; θ) = arg max pmodel(x (i); θ) θ

i=1

(5.56) (5.56) (5.57)

= arg max p (x ; θ) (5.57) This pro product duct over man many y probabilities can be inconv inconvenien enien enientt for a variet ariety y of reasons. For example, it is prone to numerical underflow. To obtain a more con conv venien enientt This pro duct o v er man y probabilities can b e inconv enien t for a v ariet y of reasons. but equiv equivalent alent optimization problem, we observe that taking the logarithm of the Y F or example, is prone umerical underflow. Tenien o obtain a more con venien lik likeliho eliho elihoo od do does esitnot changetoitsnarg do does es conv convenien eniently tly transform a pro product ductt max but but equivalent optimization problem, we observe that taking the logarithm of the likelihood does not change its arg max but does conveniently transform a product 131


in into to a sum: into a sum:

θ ML = arg max θ

m X i=1

log pmodel (x(i) ; θ).

(5.58)

θ es not = arg max when log pwe rescale (x ;the θ). cost function, w(5.58) Because the argmax do does change e can divide by m to obtain a version of the criterion that is expressed as an exp expectation ectation Because the argmax do es not c hange when w e rescale the cost function, we can with resp respect ect to the empirical distribution pˆdata defined by the training data: divide by m to obtain a version of the criterion that is expressed as an expectation θ ML = arg max EX p model(xb;yθthe ). training data: (5.59) ∼ˆ p p with respect to the empirical distribution ˆ logdefined θ E θ = arg max log p (x; θ). (5.59) One wa way y to interpret maximum lik likelihoo elihoo elihood d estimation is to view it as minimizing the dissimilarit dissimilarity y bet etw ween the empirical distribution pˆdata defined by the training One wa y to interpret maximum likelihoo d estimation is to view as minimizing set and the mo model del distribution, with the degree of dissimilarit dissimilarity y bitet etw ween the two p ˆ the dissimilarit y b et w een the empirical distribution defined b y the training measured by the KL divergence. The KL div divergence ergence is giv given en by set and the model distribution, with the degree of dissimilarity between the two D KL pˆdataKL ) = E ∼ˆpThe [log kpmodel (x) − log pmodel x)] . (5.60) data measured by (the divergence. KL pˆdiv ergence is giv en b(y E The term only data generating pro process, the D on (the pˆ left = [logofpˆ the(x p is a)function ) log p (x)] . cess, not (5.60) mo model. del. This means when we train the mo model del to minimize the KL divergence, we k is a function only of the data − generating process, not the The the left needterm only on minimize model. This means when we−train model to minimize the KL divergence, we (x)] (5.61) p model E ∼ˆpthe [log need only minimize whic which h is of course the same as E the maximization in Eq. 5.59. (x)] (5.61) [log p Minimizing this KL div divergence ergence corresp corresponds onds exactly to minimizing the cross− the maximization in Eq. 5.59. which is of course the same as en entrop trop tropy y betw etween een the distributions. Many authors use the term “cross-en “cross-entrop trop tropy” y” to Minimizing this KL div ergence corresp onds exactly to minimizing the crossiden identify tify sp specifically ecifically the negativ negativee log-lik log-likeliho eliho elihoood of a Bernoulli or softmax distribution, en trop y b etw een the distributions. Many authors use the log-lik term “cross-en to but that is a misnomer. Any loss consisting of a negative log-likelihoo elihoo elihood d trop is a y” cross iden tifyy sp ecifically theempirical negative log-lik elihood defined of a Bernoulli softmax distribution, en entrop trop tropy betw etween een the distribution by theortraining set and the but that is a misnomer. Any loss consisting of a negative log-lik elihoo d is a cross mo model. del. For example, mean squared error is the cross-entrop cross-entropy y b et etw ween the empirical en trop y b etw een the empirical distribution defined b y the training set and the distribution and a Gaussian mo model. del. model. For example, mean squared error is the cross-entropy b etween the empirical We can th thus us see maximum lik likeliho eliho elihoo o d as an attempt to make the mo model del disdistribution and a Gaussian model. tribution matc match h the empirical distribution pˆdata . Ideally Ideally,, we would lik likee to match W e can th us see maximum lik eliho o d as an attempt to make the model disthe true data generating distribution pdata , but we ha hav ve no direct access to this tribution matc h the empirical distribution . Ideally , we w ould lik e to match p ˆ distribution. the true data generating distribution p , but we have no direct access to this While the optimal θ is the same regardless of whether we are maximizing the distribution. lik likeliho eliho elihoo od or minimizing the KL divergence, the values of the ob objective jective functions θ While the optimal is the same regardless of whether w e are the are differen different. t. In soft software, ware, we often phrase both as minimizing amaximizing cost function. lik eliho o d or minimizing the KL divergence, the v alues of the ob jective functions Maxim Maximum um likelihoo likelihood d thus becomes minimization of the negative log-lik log-likeliho eliho elihoo od are differen t. In soft ware, we often phrase b oth as minimizing a cost function. (NLL), or equiv equivalently alently alently,, minimization of the cross entrop entropy y. The persp perspective ective of Maxim um lik likelihoo thus becomes minimization the negative eliho od maxim maximum um likelihoo elihoo elihood dd as minimu minimum m KL div divergence ergence of becomes helpfullog-lik in this case (NLL), , minimization of the cross . zero. The persp of becauseor theequiv KL alently div divergence ergence has a known minim minimum umentrop value yof The ective negative maxim um lik elihoo d as minimu m KL div ergence becomes helpful in this case log-lik log-likeliho eliho elihoo od can actually become negative when x is real-v real-valued. alued. because the KL divergence has a known minimum value of zero. The negative 132 when x is real-valued. log-likelihood can actually become negative


5.5.1

Conditional Log-Likelihoo Log-Likelihood d and Mean Squared Error

The maxim maximum um lik likeliho eliho elihoo od estimator can readily generalized to the case where 5.5.1 Conditional Log-Likelihoo d and be Mean Squared Error our goal is to estimate a conditional probabilit probability y P (y | x ; θ) in order to predict y The maxim um lik eliho o d estimator can readily be generalized to thethe case where giv given en x . This is actually the most common situation because it forms basis for our goal is to estimate a conditional probabilit y ) in order to predict P ( ; θ y y x X Y most sup supervised ervised learning. If represen represents ts all our inputs and all our observ observed ed given x . then This the is actually the most common situation b|ecause is it forms the basis for targets, conditional maximum lik likelihoo elihoo elihood d estimator most supervised learning. If X represents all our inputs and Y all our observed targets, then the conditional d estimator is θ MLmaximum = arg maxlikPelihoo (Y | X ; θ ). (5.62) θ

θ = arg max P (Y X ; θ). (5.62) If the examples are assumed to b e i.i.d., then this can be decomposed into | m X If the examples are assumed to b e i.i.d., then this can be decomposed into θ ML = arg max log P (y(i) | x(i) ; θ). (5.63) θ

θ

= arg max

i=1

log P (y

x ; θ).

(5.63)

|

Linear regression, in intro tro troduced duced earlier in Sec. 5.1.4, ma may y be justified as a maximum likelihoo likelihood d pro procedure. cedure. Linear regression, Previously Previously,, we motiv motivated ated linear regression as an algorithm that learns to tak take e an X in tro duced earlier in Sec. 5.1.4 , ma y b e justified as a maximum likelihoo d pro cedure. input x and pro produce duce an output value yˆ. The mapping from x to yˆ is chosen to Previously , we motiv ated linear regression aswe anin algorithm that learns take an. minimize mean squared error, a criterion that intro tro troduced duced more or less to arbitrarily arbitrarily. x y ˆ x y ˆ input and pro duce an output v alue . The mapping from to is chosen to We no now w revisit linear regression from the poin ointt of view of maximum likelihoo likelihood d minimize mean squared error, a criterion we introyˆduced more or less arbitrarily estimation. Instead of pro producing ducing a singlethat prediction , we now think of the mo model del . W now revisita linear regression from the maximum p( ypoin | xt). ofWview asepro producing ducing conditional distribution e canofimagine thatlikelihoo with and yˆ, we examples estimation. Instead of pro a single prediction now thinkwith of the del infinitely large training set,ducing we migh might t see several training themo same p ( y x as pro ducing a conditional distribution ) . W e can imagine that with input value x but differen differentt values of y . The goal of the learning algorithm is nowan to infinitely large training weallmigh t see differen several the same | ttraining p( y | set, y valuesexamples fit the distribution of those different that are with all compatible x) to input xv.alue but differen t values of yregression . The goalalgorithm of the learning algorithmbefore, is nowwe to with To xderive the same linear we obtained p ( y fit the pdistribution different yˆ(x values that compatible (y | x ) = N (yy; yˆ(xx); to w)all , σ2of ; w) giv define ). those The function gives es are the all prediction of with . T o derive the same linear regression algorithm w e obtained b efore, we x | the mean of the Gaussian. In this example, we assume that the variance is fixed to p(y xt) σ=2 chosen (y; yˆ(b xy; w ) , σuser. yˆ(x ;this w) cgiv defineconstan ). The es the prediction of some constant the We function will see that hoice of the functional the mean this example, weodassume that the variancetoisyield fixedthe to N y| |the x ) Gaussian. form of p(of causes the In maximum lik likeliho eliho elihoo estimation pro procedure cedure σ chosen b some constan thedev user. We will see that of the same learningt algorithm asy we developed eloped before. Sincethis thechoice examples arefunctional assumed p ( y x form of ) causes the maximum lik eliho o d estimation pro cedure to yield the to be i.i.d., the conditional log-likelihoo log-likelihood d (Eq. 5.63) is giv given en by same learning | algorithm as we developed before. Since the examples are assumed m X to be i.i.d., the conditional log-likelihood (Eq. 5.63) is given by log p(y (i) | x (i); θ) (5.64) i=1

log p(y x ; θ) m X |yˆ(i) − y (i)|| 2 m = − m log σ − | log(2π ) − 2 2σ2 i=1 y ˆ y m = m log σ log(2π ) 2 2 | −σ || X 133 − − − X

(5.64) (5.65) (5.65)


where yˆ(i) is the output of the linear regression on the i-th input x(i) and m is the num umb ber of the training examples. Comparing the log-likelihoo log-likelihood d with the mean y ˆ i x where is the output of the linear regression on the -th input and m is the squared error, m number of the training examples. Comparing d with the mean 1 X (i) the (log-likelihoo || ||ˆ yˆ − y i)||2, MSE train = (5.66) squared error, m 1 i=1 = yˆ y MSE , (5.66) we immediately see that maximizingmthe log-likelihoo log-likelihood d with respect to w yields ||es minimizing − || the mean squared error. the same estimate of the parameters w as do does w e immediately see that maximizing the log-likelihoo d with to w yields The tw two o criteria ha hav ve different values but the same lo location cation of respect the optimum. This w the same estimate of the parameters as do es minimizing the mean squared error. justifies the use of the MSE as a maxim maximum um likelihoo elihoo elihood d estimation pro procedure. cedure. As we X lik The tw o criteria ha v e different v alues but the same lo cation of the optimum. will see, the maximum lik likelihoo elihoo elihood d estimator has sev several eral desirable prop properties. erties. This justifies the use of the MSE as a maximum likelihood estimation procedure. As we will see, the maximum likelihood estimator has several desirable properties.

5.5.2

Prop Properties erties of Maxim Maximum um Lik Likeliho eliho elihoo od

5.5.2 erties Maxim Likdeliho od is that it can be sho The mainProp app appeal eal of theofmaxim maximum um um lik likelihoo elihoo elihood estimator shown wn to be the best estimator asymptotically asymptotically,, as the num umber ber of examples m → ∞ , in terms The eal of the maxim likelihoo d estimator is that it can be shown to of itsmain rate app of conv convergence ergence as m um increases. be the best estimator asymptotically, as the number of examples m , in terms Under appropriate conditions, maximum lik likeliho eliho elihoo od estimator has the prop propert ert erty y of its rate of convergence as m increases. →∞ of consistency (see Sec. 5.4.5 ab abov ov ove), e), meaning that as the number of training Under approac appropriate conditions, eliho ododestimator property examples approaches hes infinit infinity y, themaximum maxim maximum umliklik likeliho eliho elihoo estimatehas of athe parameter of consistency 5.4.5 abov e), meaning thatconditions as the number con conv verges to the(see trueSec. value of the parameter. These are: of training examples approaches infinity, the maximum likelihood estimate of a parameter con•verges thedistribution true value ofp data the parameter. Thesethe conditions are: pmodel(·; θ). The to true must lie within mo model del family Otherwise, no estimator can recov recover er pdata . ( ; θ). The true distribution p must lie within the model family p pdata •• The true distribution must corresp correspond Otherwise, no estimator can recov er p ond . to exactly one value of θ. Other· wise, maximum likelihoo likelihood d can recov recover er the correct pdata, but will not be able p ofmθ θcessing. Thedetermine true distribution ustwas corresp to exactly one value of . Otherto which value usedond by the data generating pro processing. • wise, maximum likelihoo d can recover the correct p , but will not be able to determine value of θ wasb esides used by dataum generating pro cessing. There are other which inductiv inductive e principles thethe maxim maximum lik likelihoo elihoo elihood d estimator, man many y of which share the prop property erty of being consistent estimators. Ho How wev ever, er, consisThere are other inductiv e principles b esides the maxim um lik elihoo d estimator, ten tentt estimators can differ in their statistic efficiency, meaning that one consisten consistentt man y of which share the prop erty of b eing consistent estimators. Ho w ev er, consism, estimator ma may y obtain low lower er generalization error for a fixed num umb ber of samples ten t estimators can differ in their statistic efficiency , meaning that one consisten or equiv equivalen alen alently tly tly,, ma may y require fewer examples to obtain a fixed lev level el of generalizationt estimator may obtain lower generalization error for a fixed number of samples m , error. or equivalently, may require fewer examples to obtain a fixed level of generalization Statistical efficiency is typically studied in the par arametric ametric case (lik (likee in linear error. regression) where our goal is to estimate the value of a parameter (and assuming is ttrue ypically studied in ametric case (like Ainwlinear it isStatistical possible toefficiency identify the parameter), notthe thepar value of a function. ay to regression) where our goal is to estimate the v alue of a parameter (and assuming measure ho how w close we are to the true parameter is by the exp expected ected mean squared it is possible to identify the true parameter), not the v alue of a function. A way to error, computing the squared difference betw etween een the estimated and true parameter measure how close we are to the true parameter is by the expected mean squared error, computing the squared difference 134 between the estimated and true parameter


values, where the exp expectation ectation is over m training samples from the data generating distribution. That parametric mean squared error decreases as m increases, and m training vfor alues, wherethe theCramér-Rao expectationlow is er over samples from, 1946 the data generating lower bound (Rao, 1945 ; Cramér ) shows that no m large, m distribution. That parametric mean squared error decreases as increases, and consisten consistentt estimator has a low lower er mean squared error than the maximum lik likeliho eliho elihoo od for large, the Cramér-Rao low er b ound ( Rao , 1945 ; Cramér , 1946 ) shows that no m estimator. consistent estimator has a lower mean squared error than the maximum likelihood For these reasons (consistency and efficiency), maximum lik likelihoo elihoo elihood d is often estimator. considered the preferred estimator to use for machine learning. When the num number ber F or these reasons (consistency and efficiency), maximum lik elihoo d is often of examples is small enough to yield overfitting behavior, regularization strategies considered the preferred use for machine learning. When the num ber suc such h as weigh weight t deca decay y mayestimator be used totoobtain a biased version of maximum lik likeliho eliho elihoo od of examples small enough to yield odata verfitting behavior, regularization strategies that has lessisvariance when training is limited. such as weight decay may be used to obtain a biased version of maximum likelihood that has less variance when training data is limited.

5.6

Ba Bay yesian Statistics

So e ha hav discussed fr freequentist statistics and approac approaches hes based on estimating a 5.6far wBa yveesian Statistics single value of θ, then making all predictions thereafter based on that one estimate. So far weapproach have discussed frequentist hes based on aestimating a Another is to consider all statistics p ossible vand aluesapproac of θ when making prediction. single v alue of , then making all predictions thereafter based on that one estimate. θ The latter is the domain of Bayesian statistics. Another approach is to consider all p ossible values of θ when making a prediction. As discussed in Sec. 5.4.1, the frequentist persp perspectiv ectiv ectivee is that the true parameter The latter is the domain of Bayesian statistics. value θ is fixed but unknown, while the poin ointt estimate θˆ is a random variable on As discussed in Sec. 5.4.1, the frequentist ectiv is that the true parameter accoun account t of it being a function of the dataset persp (whic (which h is eseen as random). value θ is fixed but unknown, while the point estimate θˆ is a random variable on Thet Ba Bay yesian erspective ective on statistics quiteh different. Ba Bay yesian uses accoun of it beingpaersp function of the datasetis(whic is seen as The random). probabilit probability y to reflect degrees of certaint certainty y of states of kno knowledge. wledge. The dataset is The observed Bayesian and persp statistics different. yesian uses θ directly soective is not on random. On is thequite other hand, theThe trueBaparameter probabilit y to reflect degrees of certaint y of states of kno wledge. The dataset is is unkno unknown wn or uncertain and thus is represen represented ted as a random variable. directly observed and so is not random. On the other hand, the true parameter θ Before observing the data, we represent our knowledge of θ using the prior is unknown or uncertain and thus is represented as a random variable. pr prob ob obability ability distribution distribution,, p (θ ) (sometimes referred to as simply “the prior”). GenBefore observing the data, we represent ouraknowledge of θ using theis prior erally erally,, the mac machine hine learning practitioner selects prior distribution that quite pr ob ability distribution , ) (sometimes referred to as simply “the prior”). Genp ( θ broad (i.e. with high en entrop trop tropy) y) to reflect a high degree of uncertain uncertaintty in the value of erally , the mac hine learning practitioner selects a prior distribution that is quite θ before observing an any y data. For example, one migh mightt assume a priori that θ lies broad (i.e. withrange high or entrop y) towith reflect a high degree of uncertain tyy in the vinstead alue of in some finite volume, a uniform distribution. Man Many priors θ before θ lies observing an data. Forsolutions example,(suc oneh migh t assume a priori co that reflect a preference fory “simpler” (such as smaller magnitude coefficients, efficients, in some finite that range volume, withconstant). a uniform distribution. Many priors instead or a function is or closer to being reflect a preference for “simpler” solutions (such as smaller(1) magnitude coefficients, No Now w consider that we hav havee a set of data samples {x , . . . , x (m) }. We can or a function that is closer to being constant). reco recov ver the effect of data on our belief ab about out θ by com combining bining the data lik likeliho eliho elihoo od x , . . . , x No w consider that we hav e a set of data samples . W e can (1) ( m ) | θ) with the prior via Bay Bayes’ es’ rule: p(x , . . . , x recover the effect of data on our belief about θ by combining the data} likelihood { (1),rule: (m) | θ )p(θ ) θ) with(1)the prior via Bay es’ p(x , . . . , x p ( x . . . , x p(θ | x , . . . , x (m) ) = (5.67) p(x(1), . . . , x(m)) | p(x , . . . , x θ)p(θ ) p(θ x , . . . , x ) =135 (5.67) p(x , . . . , x| ) |


In the scenarios where Ba Bay yesian estimation is typically used, the prior b egins as a relativ relatively ely uniform or Gaussian distribution with high en entrop trop tropy y, and the observ observation ation In the scenarios where Ba y esian estimation is t ypically used, the prior b egins as aa of the data usually causes the posterior to lose entrop entropy y and concen concentrate trate around relativ ely uniform or Gaussian distribution with high entropy, and the observation few highly likely values of the parameters. of the data usually causes the posterior to lose entropy and concentrate around a Relativ Relativee to maxim maximum um lik likelihoo elihoo elihood d estimation, Ba Bay yesian estimation offers two few highly likely values of the parameters. imp importan ortan ortantt differences. First, unlik unlikee the maxim maximum um likelihoo likelihood d approach that makes Relativ e to maxim um lik elihoo d estimation, Ba y esian offers two predictions using a point estimate of θ , the Ba Bay yesian approac approach hestimation is to make predictions importan t differences. First, e or theexample, maximum likelihoo d approach that makes θ. F m examples, using a full distribution ov over erunlik after observing the θ predictions using a p oint estimate of , the Ba y esian approac h is to make predictions ( m +1) predicted distribution ov over er the next data sample, x , is given by using a full distribution over θ.Z For example, after observing m examples, the predicted distribution over the next data sample, x , is given by p(x (m+1) | x (1) , . . . , x(m)) = p(x (m+1) | θ)p(θ | x(1), . . . , x (m) ) dθ. (5.68)

p(x (5.68) x , . . . , x ) = p(x θ)p(θ x , . . . , x ) dθ. Here eac each h value of θ with positive probability density contributes to the prediction | of the next example, with the contribution w|eigh eighted ted |by the posterior density itself. Here eac h value of θ ed with probability to the ab prediction (1) , . . . , x (m)} {xpositive After ha having ving observ observed , if we density are stillcontributes quite uncertain about out the of the next example, with the contribution w eigh ted by the p osterior density itself. Z value of θ , then this uncertain uncertaintty is incorporated directly into an any y predictions we x , . . . , x After ha ving observ ed , if we are still quite uncertain ab out the migh mightt mak make. e. value of θ , then this uncertain ty is incorporated directly into any predictions we { } In Sec. 5.4, we discussed ho how w the frequen frequentist tist approach addresses the uncertaint uncertainty y might make. in a giv given en point estimate of θ by ev evaluating aluating its variance. The variance of the In Sec. 5.4 , w e discussed ho w the frequen tist approach the uncertaint estimator is an assessment of how the estimate might addresses change with alternativey θ in a giv en p oint estimate of b y ev aluating its v ariance. The v ariance of deal the samplings of the observ observed ed data. The Ba Bay yesian answer to the question of ho how w to estimator is an assessment of how the might change alternative with the uncertaint uncertainty y in the estimator is toestimate simply in integrate tegrate over it,with whic which h tends to samplings of the observ ed data. The Ba y esian answer to the question of ho w to deal protect well against ov overfitting. erfitting. This integral is of course just an application of with the uncertaint y in the estimator is to simply in tegrate o v er it, whic h tends to the laws of probabilit probability y, making the Ba Bay yesian approac approach h simple to justify justify,, while the protect well against ov erfitting. This integral is of course just an ofc frequen frequentist tist machinery for constructing an estimator is based on the application rather ad ho hoc the laws to of summarize probability, all making the Bacon yesian approac simple to justify , while the decision knowledge contained tained in theh dataset with a single point frequentist machinery for constructing an estimator is based on the rather ad ho c estimate. decision to summarize all knowledge contained in the dataset with a single point The second imp importan ortan ortantt difference b et etw ween the Bay Bayesian esian approac approach h to estimation estimate. and the maximum likelihoo likelihood d approac approach h is due to the contribution of the Ba Bay yesian The second impThe ortanprior t difference b etween the Bayesianprobability approach to estimation prior distribution. has an influence by shifting mass densit density y and the maximum likelihoo d approac h is due to the contribution of the Ba y esian to tow wards regions of the parameter space that are preferred a priori. In practice, prior distribution. The prior has an influence by that shifting mass smo densit y the prior often expresses a preference for models are probability simpler or more smooth. oth. to w ards regions of the parameter space that are preferred a priori . In practice, Critics of the Bay Bayesian esian approach identify the prior as a source of sub subjective jective human the priortoften expresses a preference for models that are simpler or more smooth. judgmen judgment impacting the predictions. Critics of the Bayesian approach identify the prior as a source of sub jective human Ba Bay yesian metho methods ds typically generalize muc much h better when limited training data judgment impacting the predictions. is av available, ailable, but typically suffer from high computational cost when the num umb ber of Ba y esian metho ds typically generalize muc h b etter when limited training data training examples is large. is available, but typically suffer from high computational cost when the number of training examples is large. 136


Here we consider the Ba Bay yesian estimation approach to learning the linear regression parameters. In linear regression, Here weRconsider the the Bayv esian n to predict we learn a linear mapping from an input vector alue estiof a x∈ mation approach to learning the linear regression parameters. In linear regression, n scalar y ∈ R. The prediction is parametrized by the vRector w ∈ R : we learn a linear mapping from an input vector x to predict the value of a R R > scalar y . The prediction is parametrized : yˆ = w x. by the (5.69) ∈ vector w ∈ ∈ train w) , y x(.train) ), we can express the prediction (5.69) Giv Given en a set of m training samples (Xyˆ(= of y ov over er the en entire tire training set as: ,y Given a set of m training samples (X ), we can express the prediction ( train ) ( train ) of y over the entire training setyˆ as: = X w. (5.70) yˆ =X w.on y(train), we hav Expressed as a Gaussian conditional distribution havee

(5.70)

) (train) Expressed Gaussian distribution , we have p(y(train) | as X(atrain , w) = conditional N (y (train); X w, I ) on y (5.71)   1 (train) > ( train ) ( train ) ( train ) p(y ;X X , w) ∝ = exp (y − (y (5.71) w) , −wX, I ) w) (y −X 2 | N 1 (5.72) (y w) (y w) , exp X X 2 ∝ − − − (5.72) where we follow the standard MSE formulation in assuming that the Gaussian variance on y is one. In what follo follows, ws, to reduce the notational burden, we refer to where we follow the standard MSE   ( train ) ( train ) (X ,y ) as simply (X , y ). formulation in assuming that the Gaussian variance on y is one. In what follows, to reduce the notational burden, we refer to posterior over the mo model del parameter vector w, we (X To determine ,y ) the as simply (Xdistribution , y ). first need to sp a prior distribution. The prior should reflect our naive belief specify ecify w, we T o determine the p osterior distribution o v er the mo del parameter vector ab about out the value of these parameters. While it is sometimes difficult or unnatural first need to ecify baeliefs priorindistribution. prior should belief to express ourspprior terms of theThe parameters of thereflect mo model, del,our in naive practice we ab out the v alue of these parameters. While it is sometimes difficult or unnatural typically assume a fairly broad distribution expressing a high degree of uncertain uncertaintty to express our prior b eliefs in terms of the parameters of the mo del, in practice we θ ab about out . F For or real-v real-valued alued parameters it is common to use a Gaussian as a prior typically assume a fairly broad distribution expressing a high degree of uncertainty distribution: about θ . For real-valued parametersit is common to use a Gaussian as a prior  1 distribution: (5.73) p(w) = N (w; µ0 , Λ 0) ∝ exp − (w − µ0 )> Λ−1 0 (w − µ0) 2 1 (5.73) p(w) = (w; µ , Λ ) exp (w µ ) Λ (w µ ) where µ0 and Λ 0 are the prior distribution mean vector and cov covariance ariance matrix 2 ∝ − − − resp respectiv ectiv ectively ely ely..1 N where µ and Λ are the prior distribution mean vector and covariance matrix the. prior th thus us sp specified, ecified, we can no now w pro proceed ceed in determining the posterior respWith ectively   distribution over the mo model del parameters. With the prior thus specified, we can now proceed in determining the posterior distribution over p(w | X , y) ∝ p(ythe | Xmo , wdel )p(parameters. w) (5.74)

there to)p assume p(wUnless X, y ) isp(ayreason X, w (w) a particular covariance structure, we typically assume (5.74)a diagonal covariance matrix . | ∝ | 137




   1 1 > > −1 ∝ exp − (y − X w) (y − X w) exp − (w − µ 0) Λ 0 (w − µ 0 ) 2 2 1 1 exp (y X w) (y X w) exp (w µ ) Λ (w (5.75) µ )  2  2  ∝ −1 − − − − − > −1 (5.75) . ∝ exp − −2y >X w + w > X > X w + w> Λ−1 0 w − 2µ0 Λ 0 w 2 1  w +w Λ w 2µ Λ w(5.76) .  exp  2y X w + w X X 2 ∝ − −  −1  > − −1  (5.76) µ = Λ X y + Λ0 µ0 . Using We no now w define Λ m = X> X + Λ−1 and m m 0 these new variables, we find that the posterior ma may y be rewritten as a Gaussian  X+Λ  W e now define Λ = X and µ = Λ X y + Λ µ . Using distribution: these new variables, wefind that the posterior may be rewritten asa Gaussian 1 1 > −1 distribution: p(w | X , y) ∝ exp − (w − µm )> Λ−1 (5.77) m (w − µ m) + µm Λm µ m 2 2  1  1  (5.77) p(w X , y) exp 1 (w µ )> Λ−1(w µ ) +  µ Λ µ (5.78) ∝ exp − 2 (w − µm ) Λm (w − µ m) . 2 | ∝ −2 − − 1 (5.78) exp (w µ ) Λ (w µ ) . 2 the parameter vector w ha All terms that do not include hav ve been omitted; they   ∝ fact that − the − − be normalized to in are implied by the distribution must integrate tegrate to 1. All terms that do not include the parameter v ector ha v e b een omitted; they w Eq. 3.23 shows how to normalize a multiv multivariate ariate Gaussian distribution. are implied by the fact  that the distribution must be normalized to integrate to 1.  Examining this posterior distribution allows us to gain some in intuition tuition for the Eq. 3.23 shows how to normalize a multivariate Gaussian distribution. effect of Bay Bayesian esian inference. In most situations, we set µ0 to 0. If we set Λ0 = α1 I , thissame posterior distribution us to gain linear some in tuition forwith the µm giv thenExamining gives es the estimate of w as allows do does es frequentist regression µ tothe 0. Bay = I, oftBay esian inference. situations, we set If weesian set Λ w >most w . One aeffect weigh weight decay penalt enalty y of αIn difference is that Bayesian estimate µ givesif the then same estimate of w as are doesnot frequentist linear regression with alpha is undefined is set to zero—-we allo allow wed to begin the Ba Bay yesian α w w a weigh t decay p enalt y of . One difference is that the Bay esian estimate learning pro process cess with an infinitely wide prior on w. The more imp important ortant difference alphaestimate is that undefined ifyesian is set topro zero—-we are not allo wed to begin the yesian is the Ba Bay provides vides a cov covariance ariance matrix, showing ho how w Ba like likely ly all w learning pro cess with an infinitely wide prior on . The more imp ortant difference the differen differentt values of w are, rather than pro providing viding only the estimate µm. is that the Bayesian estimate provides a covariance matrix, showing how likely all the different values of w are, rather than providing only the estimate µ .

5.6.1

Maxim Maximum um

(MAP) Estimation

While most principled approac approach h is (MAP) to mak makee predictions using the full Bay Bayesian esian 5.6.1 theMaxim um Estimation posterior distribution over the parameter θ , it is still often desirable to ha hav ve a While most principled h isreason to makfor e predictions full Bay single the point estimate. Oneapproac common desiring a using point the estimate isesian that θ p osterior distribution o v er the parameter , it is still often desirable to ha v a most op operations erations inv involving olving the Ba Bay yesian posterior for most in interesting teresting mo models dels eare single point and estimate. common reason for desiring a point estimate is than that in intractable, tractable, a pointOne estimate offers a tractable approximation. Rather most op erations inv olving the Ba y esian posterior for most in teresting mo dels simply returning to the maxim maximum um likelihoo likelihood d estimate, we can still gain someare of intractable, point estimate offers a tractable Rather than the benefit ofand theaBay Bayesian esian approac approach h by allowing theapproximation. prior to influence the choice simply to theOne maxim um likelihoo can still some of of the preturning oin ointt estimate. rational way to ddoestimate, this is towe choose the gain maximum a the b enefit of the Bay esian approac h b y allowing the prior to influence the choice posteriori (MAP) point estimate. The MAP estimate chooses the p oin ointt of maximal of the point estimate. One rational way to do this is to choose the maximum a posteriori (MAP) point estimate. The MAP 138 estimate chooses the p oint of maximal


posterior probability (or maximal probability densit density y in the more common case of con contin tin tinuous uous θ): posterior probability (or maximal probability density in the more common case of continuous θ): = arg max p(θ | x) = arg max log p(x | θ) + log p(θ ). θ MAP (5.79) θ

θ

θ = arg max p(θ x) = arg max log p(x θ) + log p(θ ). (5.79) We recognize, ab abo ove on the right hand side, log p(x | θ ), i.e. the standard log| lik likeliho eliho elihoo od term, and log p(θ),| corresp corresponding onding to the prior distribution. We recognize, above on the right hand side, log p(x θ ), i.e. the standard logAs an regression model delprior withdistribution. a Gaussian prior on the likeliho od example, term, andconsider log p(θ)a, linear corresp onding 1tomo the | 2 weigh eights ts w . If this prior is giv given en by N ( w; 0, λ I ), then the log-prior term in Eq. As an example, consider a linearλregression mo with a Gaussian the w > w weigh 5.79 is prop proportional ortional to the familiar eight t del decay p enalt enalty y, plus aprior termonthat ( w; 0,theI learning w eigh ts wdep . Ifend thisonprior is giv en y affect ), then the log-prior term Eq. w and do does es not depend do does es bnot pro process. cess. MAP Ba Bay yinesian w wweigh 5.79 is prop ortional to the prior familiar weigh t us decay p enalt y, plus a term that. N inference with a Gaussian on λthe eights ts th thus corresp corresponds onds to weigh weight t decay decay. does not depend on w and does not affect the learning process. MAP Bayesian As with fullaBay Bayesian esian inference, MAP Bay Bayesian esian has the adv advan an antage of. inference with Gaussian prior on the weigh ts thusinference corresponds to weigh t tage decay lev leveraging eraging information that is brough broughtt by the prior and cannot be found in the As with full This Bayesian inference, MAP Bayhelps esian to inference advantage of training data. additional information reducehas thethe variance in the leveraging is brough by the and cannot ber, e found MAP poin ointtinformation estimate (inthat comparison tot the ML prior estimate). How Howev ev ever, it do does esinsothe at training data. This additional information helps to reduce the v ariance in the the price of increased bias. MAP point estimate (in comparison to the ML estimate). However, it does so at Many regularized strategies, such as maxim maximum um lik likelihoo elihoo elihood d learning the Man pricey of increased estimation bias. regularized with weigh weightt deca decay y, can be in interpreted terpreted as making the MAP appro approximaximaMan y regularized estimation strategies, such as maxim um lik elihoo d learning tion to Ba Bay yesian inference. This view applies when the regularization consists of regularized with term weighto t deca , can be in terpreted as making the MAP p(θ ).ximaadding an extra the yob objective jective function that corresponds to logappro Not tion to Ba y esian inference. This view applies when the regularization consists of all regularization penalties corresp correspond ond to MAP Ba Bay yesian inference. For example, p(θ ). Not addingregularizer an extra term tomay the ob jective function that of corresponds to log some terms not be the logarithm a probability distribution. all regularization penalties correspon ondthe to data, MAPwhich Bayesian inference. For example, Other regularization terms depend of course a prior probability some regularizer may distribution is notterms allow allowed ed to not do. be the logarithm of a probability distribution. Other regularization terms depend on the data, which of course a prior probability MAP Bay Bayesian inference provides vides a straigh straightforw tforw tforward ard way to design complicated distribution isesian not allow ed topro do. yet in interpretable terpretable regularization terms. For example, a more complicated penalty esian inference vides a of straigh tforward way than to design complicated termMAP can bBay e derived by using pro a mixture Gaussians, rather a single Gaussian ydistribution, et interpretable regularization terms. F or example, a more complicated penalty as the prior (Nowlan and Hin Hinton ton, 1992). term can be derived by using a mixture of Gaussians, rather than a single Gaussian distribution, as the prior (Nowlan and Hinton, 1992).

5.7

Sup Supervised ervised Learning Algorithms

Recall 5.1.3 that sup supervised ervised Algorithms learning algorithms are, roughly sp speaking, eaking, 5.7 from SupSec. ervised Learning learning algorithms that learn to asso associate ciate some input with some output, giv given en a Recall from Sec. 5.1.3 that sup ervised learning algorithms are, roughly sp eaking, x y training set of examples of inputs and outputs . In man many y cases the outputs learning that learnautomatically to asso ciate some some output, en a y ma may y balgorithms e difficult to collect and input must with be provided by a giv human x andeven training set ofbut examples of still inputs outputs . Intraining many cases the outputs “sup “supervisor,” ervisor,” the term applies whenythe set targets were y ma y b e difficult to collect automatically and must be provided b y a h uman collected automatically automatically.. “supervisor,” but the term still applies even when the training set targets were 139 collected automatically.


5.7.1

Probabilistic Sup Supervised ervised Learning

Most learningSup algorithms this bo book ok are based on estimating a 5.7.1 supervised Probabilistic ervised in Learning probabilit probability y distribution p(y | x). We can do this simply by using maxim maximum um Most supervised learning algorithms in this bo ok are based on estimating lik likeliho eliho elihoo od estimation to find the best parameter vector θ for a parametric familya probabilit y distribution of distributions p(y | x; θ)p.(y x). We can do this simply by using maximum likelihood estimation to find the | best parameter vector θ for a parametric family W e ha hav v e already seen that linear regression corresp corresponds onds to the family of distributions p(y x; θ). We have already| seen that p(y linear | x; θ)regression = N (y; θ>corresp x, I ). onds to the family (5.80) p(y x; θ)to = the(yclassificat ; θ x, I ). ion scenario by defining (5.80)a We can generalize linear regression classification differen differentt family of probability |distributions. If we ha hav ve tw two o classes, class 0 and N W e can generalize linear regression to the classificat y defining a class 1, then we need only specify the probabilit probability y ofion onescenario of thesebclasses. The differen t family of 1probability e hav0,e b tw o classes, and probabilit probability y of class determinesdistributions. the probabilityIfofwclass ecause theseclass tw twoo v0alues class we1.need only specify the probability of one of these classes. The m ust 1, addthen up to probability of class 1 determines the probability of class 0, because these two values The distribution over real-v real-valued alued num numb bers that we used for linear must addnormal up to 1. regression is parametrized in terms of a mean. An Any y value we supply for this mean The A normal distribution over real-v alued numbersmore thatcomplicated, we used forbecause linear is valid. distribution over a binary variable is slightly regression parametrized inwterms of a 1. mean. weethis supply for this mean its mean misust alw alwaays be bet etw een 0 and One An wayy vtoalue solv solve problem is to use is v alid. A distribution o v er a binary v ariable is slightly more complicated, b ecause the logistic sigmoid function to squash the output of the linear function in into to the itsterv mean must alwain ysterpret be betw een v0alue and as 1. aOne way to solve this problem is to use in interv terval al (0, 1) and interpret that probability: the logistic sigmoid function to squash the output of the linear function into the interval (0, 1) and interpretpthat (5.81) (y = v1alue | x; as θ) a=probability: σ (θ >x). x). (5.81) p(gistic y = 1regr xession ; θ) = σ(a(θsomewhat This approac approach h is known as lo logistic gression strange name since we use the mo model del for classification rather | than regression). This approach is known as logistic regression (a somewhat strange name since we In the case of linear regression, we were able to find the optimal weigh eights ts by use the model for classification rather than regression). solving the normal equations. Logistic regression is somewhat more difficult. There In closed-form the case of linear regression, we were ablets.to Instead, find the we optimal eights for by is no solution for its optimal weigh weights. must wsearch solving the normal equations. Logistic regression is somewhat more difficult. There them by maximizing the log-lik log-likelihoo elihoo elihood. d. We can do this by minimizing the negative is no closed-form solution for its optimal log-lik log-likeliho eliho elihoo od (NLL) using gradient descen descent. t.weights. Instead, we must search for them by maximizing the log-likelihood. We can do this by minimizing the negative This same can bgradient e applied descen to essen essentially any y sup supervised ervised learning problem, log-lik eliho od strategy (NLL) using t.tially an by writing down a parametric family of conditional probability distributions over appliedvariables. to essentially any supervised learning problem, the This righ righttsame kindstrategy of inputcan andbeoutput by writing down a parametric family of conditional probability distributions over the right kind of input and output variables.

5.7.2

Supp Support ort Vector Mac Machines hines

5.7.2of the Supp Vectorapproaches Machines One mostort influential to sup supervised ervised learning is the supp support ort vector mac machine hine (Boser et al., 1992; Cortes and Vapnik, 1995). This mo model del is similar to One of the most influential approaches to sup ervised learning is the supp ort vector > logistic regression in that it is driven by a linear function w x + b. Unlik Unlike e logistic machine (Boser et al., 1992; Cortes and Vapnik, 1995). This model is similar to logistic regression in that it is driven by 140 a linear function w x + b. Unlike logistic


regression, the support vector machine do does es not pro provide vide probabilities, but only outputs a class iden identit tit tity y. The SVM predicts that the positiv ositivee class is presen presentt when regression, the support vector machine do es not pro vide probabilities, but only > ositive. e. Likewise, it predicts that the negativ negativee class is presen present t when w x + b is positiv outputs identity. The SVM predicts that the positive class is present when > w x + baisclass negative. w x + b is positive. Likewise, it predicts that the negative class is present when One key inno innov vation asso associated ciated with supp support ort vector machines is the kernel trick. w x + b is negative. The kernel tric trick k consists of observing that many mac machine hine learning algorithms can One k ey inno v ation asso ciated with supp ort vector is the trick be written exclusiv exclusively ely in terms of dot pro products ducts betw between eenmachines examples. For kernel example, it. The tricthat k consists of observing hine algorithms can bkernel e shown the linear function that usedmany by themac supp support ortlearning vector mac machine hine cancan be b e written exclusiv ely in terms of dot pro ducts betw een examples. F or example, it re-written as m can be shown that the linear function usedX by the support vector machine can be > w x + b = b + αi x> x (i) (5.82) re-written as i=1

w x+b = b+ αx x (5.82) where x (i) is a training example and α is a vector of co coefficien efficien efficients. ts. Rewriting the learning algorithm this way allo allows ws us to replace x by the output of a giv given en feature x α where is a training example and is a vector of co efficien ts. Rewriting the ( i ) function φ(x ) and the dot pro product duct with a function k (x, x ) = φ( x) · φ (x(i) ) called xduct this warepresen y allowstsusan toinner replace by the output oftoaφgiv X pro alearning kernel.algorithm The · op operator erator represents product analogous (xen )>φfeature (x(i)). φ ( x k ( x , x ) = φ ( x ) ( x function ) and the dot pro duct with a function ) called φ For some feature spaces, we ma may y not use literally the vector inner pro product. duct. In a kernel . Thedimensional operator represen tse an inner proother duct analogous to· φpro ). ( x ) φ (x for some infinite spaces, w need to use kinds of inner products, ducts, For someinner feature spaces, we ma not use literally inner pro duct. In · pro example, products ducts based onyin integration tegration rather the thanvector summation. A complete some infinitet dimensional spaces, we pro need to use of inner dev developmen elopmen elopment of these kinds of inner products ducts is bother eyondkinds the scope of pro thisducts, bo book. ok.for example, inner pro ducts based on integration rather than summation. A complete After replacing dot products with kernel ev evaluations, aluations, we can make predictions development of these kinds of inner pro ducts is beyond the scope of this bo ok. using the function After replacing dot products with kX ernel evaluations, we can make predictions f (x) = b + α ik(x, x(i) ). (5.83) using the function i f (x) = b + α k(x, x ). (5.83) This function is nonlinear with resp respect ect to x, but the relationship betw between een φ( x) and f(x) is linear. Also, the relationship betw between een α and f (x) is linear. The This function is nonlinear with resp ect to , but thecessing relationship betw x φ( x) kernel-based function is exactly equiv equivalent alent to prepro preprocessing the data by een applying (xall and ) isinputs, linear.then Also, the relationship between α and f (x) is linear. The φ (x) fto learning a linear Xmodel in the new transformed space. kernel-based function is exactly equivalent to preprocessing the data by applying The kernel tric trick k is pow owerful erful for tw two o reasons. First, it allows us to learn mo models dels φ(x) to all inputs, then learning a linear model in the new transformed space. that are nonlinear as a function of x using conv convex ex optimization techniques that are Theteed kernel trickerge is pefficiently owerful for two reasons. it allows us to learn moand dels guaran guaranteed to conv converge efficiently. . This is possibleFirst, because we consider φ fixed that are nonlinear as athe function of x using convex optimization that are optimize only α, i.e., optimization algorithm can view thetechniques decision function guaran teed to conv erge efficiently . This is p ossible b ecause w e consider fixed and φ as being linear in a different space. Second, the kernel function k often admits α, i.e., optimize onlytation theisoptimization can view the decision an implemen that significantly algorithm more computational efficien naiv implementation efficient t thanfunction naively ely k as b eing linear in a different space. Second, the k ernel function often admits constructing two φ(x) vectors and explicitly taking their dot pro product. duct. an implementation that is significantly more computational efficient than naively In some cases, can even e infinite taking dimensional, whic which h duct. would result in constructing two φ(φx(x ) )vectors andbexplicitly their dot pro an infinite computational cost for the naiv naive, e, explicit approach. In man many y cases, can evenfunction be infinite whic h would result in φ(x)tractable k(xIn , x0 some φ ( x ) is a cases, nonlinear, of xdimensional, ev when ) is in even en intractable. tractable. As an infinite computational cost for the naive, explicit approach. In many cases, k(x, x ) is a nonlinear, tractable function 141 of x even when φ (x) is intractable. As


an example of an infinite-dimensional feature space with a tractable kernel, we construct a feature mapping φ (x ) over the non-negative in integers tegers x . Supp Suppose ose that an example of an infinite-dimensional feature space with a tractable kernel, we this mapping returns a vector con containing taining x ones follow followed ed by infinitely many zeros. φ ( x x construct a feature mapping over integers . Supp osealent that x(i)the ) = non-negative min min(( x, x(i) ) that W e can write a kernel function k)(x, is exactly equiv equivalent this mapping returns a v ector con taining ones follow ed by infinitely many zeros. x to the corresp corresponding onding infinite-dimensional dot pro product. duct. We can write a kernel function k (x, x ) = min( x, x ) that is exactly equivalent The most commonly used kernel is the Gaussian kernel to the corresponding infinite-dimensional dot product. The most commonly used k(ukernel , v) = is N the (u −Gaussian v ; 0, σ2I )kernel

(5.84)

) (5.84) k(u, v) = (u densit v ; 0,yσ. IThis where N( x; µ, Σ) is the standard normal density kernel is also known as the radial basis function (RBF) kernel, its value decreases along lines in N because − ( x ; µ , Σ where ) is the standard normal densit y. This kernel is also as outward ard from u . The Gaussian kernel corresp corresponds ondsknown to a dot v space radiating outw the duct radial function (RBF) kernel, because itsderiv value decreases in Ninbasis pro product an infinite-dimensional space, but the derivation ation of thisalong spacelines is less spacetforw radiating outw The kernel ondss. to a dot v u . of straigh straightforw tforward ard than in ard our from example theGaussian min kernel over corresp the integer integers. product in an infinite-dimensional space, but the derivation of this space is less We can think of the Gaussian kernel as performing a kind of template matching. straightforward than in our example of the min kernel over the integers. A training example x asso associated ciated with training lab label el y becomes a template for class W e can think of the Gaussian k ernel as p erforming a kind of template matching. 0 y . When a test poin ointt x is near x according to Euclidean distance, the Gaussian x asso y becomes A training ciated with training for class 0 isel kernel has example a large resp response, onse, indicating that xlab very similar atotemplate the x template. y x x . When a test p oin t is near according to Euclidean distance, the Gaussian The mo model del then puts a large weigh eightt on the asso associated ciated training lab label el y . Ov Overall, erall, k ernel has a large resp onse, indicating that is very similar to the template. x x the prediction will combin combinee many such training labels weigh eighted ted by the similarit similarity y The mocorresp del then putstraining a large w eight on the associated training label y . Overall, of the corresponding onding examples. the prediction will combine many such training labels weighted by the similarity Supp Support ort vector mac machines hines are not the only algorithm that can be enhanced of the corresponding training examples. using the kernel trick. Man Many y other linear mo models dels can be enhanced in this wa way y. The Supp ort v ector mac hines are not the only algorithm that can b e enhanced category of algorithms that emplo employ y the kernel tric trick k is kno known wn as kernel machines using the metho kernelds trick. Many other linear models can behölk enhanced in, this or kernel methods (Williams and Rasmussen , 1996 ; Sc Schölk hölkopf opf et al. al., 1999wa ). y. The category of algorithms that employ the kernel trick is known as kernel machines A ma major jor drawbac drawback k to kernel is that the; Sc cost ofopf ev evaluating aluating the ). decision or kernel metho ds (Williams andmachines Rasmussen , 1996 hölk et al., 1999 function is linear in the number of training examples, because the i-th example A ma jor adrawbac to kernel of evaluating themachines decision (i) machines is that the cost Support con term α k contributes tributes ort vector i k (x, x ) to the decision function. Supp i-th example function in the of training because the mostly are able is to linear mitigate thisnumber by learning an α examples, vector that contains zeros. α k ( x , x con tributes a term ) to the decision function. Supp ort vector machines Classifying a new example then requires ev evaluating aluating the kernel function only for are able to mitigate this by learning an vector training that contains mostly zeros. α the training examples that hav havee non-zero α i. These examples are known Classifying actors new. example then requires evaluating the kernel function only for as supp support ort ve vectors ctors. the training examples that have non-zero α . These training examples are known Kernel mac machines hines also suffer from a high computational cost of training when as support vectors. the dataset is large. We will revisit this idea in Sec. 5.9. Kernel mac machines hines with Kernel mac hines also suffer from a high computational cost of training generic kernels struggle to generalize well. We will explain why in Sec. 5.11.when The the dataset is large. W e will revisit this idea in Sec. 5.9 . Kernel mac hines with mo modern dern incarnation of deep learning was designed to overcome these limitations of generic kernels to generalize well. Wrenaissance e will explain whywhen in Sec. 5.11. etThe k ernel mac machines. hines.struggle The current deep learning began Hinton al. mo dern incarnation of deep learning was designed to o v ercome these limitations of (2006) demonstrated that a neural netw network ork could outperform the RBF kernel SVM k ernel mac hines. The current deep learning renaissance b egan when Hinton et al. on the MNIST benchmark. (2006) demonstrated that a neural network could outperform the RBF kernel SVM 142 on the MNIST benchmark.


5.7.3

Other Simple Supervised Learning Algorithms

W e ha hav ve Other already briefly encountered another non-probabilistic sup supervised ervised learning 5.7.3 Simple Supervised Learning Algorithms algorithm, nearest neigh neighb bor regression. More generally generally,, k-nearest neighbors is W e ha v e already briefly encountered another non-probabilistic ervised learning a family of tec techniques hniques that can be used for classification orsup regression. As a algorithm, nearest neigh b or regression. More generally , -nearest neighbors is k k non-parametric learning algorithm, -nearest neighbors is not restricted to a fixed aum family techniques that can be think used for classification or regression. As a k -nearest neigh n umb ber ofofparameters. We usually of the neighbors bors algorithm non-parametric learning algorithm, -nearest neighbors is anot restricted to aoffixed as not ha having ving an any y parameters, but krather implemen implementing ting simple function the k n um b er of parameters. W e usually think of the -nearest neigh bors algorithm training data. In fact, there is not even really a training stage or learning pro process. cess. as not ha ving an y parameters, but rather implemen ting a simple function of x, Instead, at test time, when we wan antt to pro produce duce an output y for a new test inputthe training data. In fact,neigh therebors is not even really a training ore learning process. k-nearest X. W w e find the neighb to x in the training datastage then return the y forwaorks atthe testcorresponding time, when we ant toinpro an output newfor test input x, yw aInstead, verage of values theduce training set. This essentially k x X w eyfind -nearest bors to thecan training . Weothen an any kindthe of sup supervised ervisedneigh learning whereinwe define data an average ver y return values.the In y a v erage of the corresponding v alues in the training set. This w orks for essentially the case of classification, we can av average erage ov over er one-hot code vectors c with cy = 1 any ckind offor supall ervised learning where we can define an average ver y ovvalues. In and = 0 other v alues of . W e can then in interpret terpret the avoerage er these i i c with c = 1 the caseco ofdes classification, we can average over one-hot code vectors one-hot codes as giving a probability distribution ov over er classes. As a non-parametric and c =algorithm, 0 for all other values of i. Wecan canac then terpret average ovexample, er these k-nearest learning neighbor achiev hiev hievee in very highthe capacit capacity y. For one-hot co des as giving a probability distribution ov er classes. As a non-parametric supp suppose ose we ha hav ve a multiclass classification task and measure performance with 0-1 learning algorithm, ac hieve to very high capacit or example, loss. In this setting,k-nearest 1-nearestneighbor neighborcan con conv verges double the Ba Bay yy.esFerror as the supp ose we ha v e a multiclass classification task and measure p erformance with num umb ber of training examples approac approaches hes infinit infinity y. The error in excess of the Ba Bay y0-1 es loss. In this setting, 1 -nearest neighbor con v erges to double the Ba y es error as the error results from cho hoosing osing a single neighbor by breaking ties betw etween een equally n um b er of training examples approac hes infinit y . The error in excess of Bayx es distan distantt neighbors randomly randomly.. When there is infinite training data, all testthe points ointsx errorha results from cman hoosing a single ties bIf etw een equally will hav ve infinitely many y training set neighbor neigh neighbors borsby at breaking distance zero. we allow the x distan t neighbors randomly . When there is infinite training data, all test p oints algorithm to use all of these neigh neighb bors to vote, rather than randomly choosing one will have the infinitely manyconv training neigh bors distance zero.high If wecapacity allow the of them, pro procedure cedure converges erges set to the Bay Bayes es at error rate. The of algorithm to use all of these neigh b ors to vote, rather than randomly choosing one k-nearest neigh neighbors bors allows it to obtain high accuracy giv given en a large training set. of them, pro convcomputational erges to the Bay es error high capacity of Ho How wev ever, er, the it do does es cedure so at high cost, and itrate. ma may y The generalize very badly k-nearest neighfinite bors allows to obtain accuracy given aneigh large training giv given en a small, trainingit set. One whigh eakness of k -nearest neighb bors is thatset. it Ho w ev er, it do es so at high computational cost, and it ma y generalize very badly cannot learn that one feature is more discriminativ discriminativee than another. For example, k -nearest giv en a small, finite training set. One eakness of wn bors is that it R100 dra imagine we hav havee a regression task with xw∈ drawn from anneigh isotropic Gaussian cannot learn that one feature discriminativ e than For Suppose example, x1Ris relev distribution, but only a singleis vmore ariable relevan an ant t to another. the output. imaginethat we hav regression task with drawn from, an further thise afeature simply enco encodes desx the output directly directly, i.e.isotropic that y = Gaussian x1 in all x distribution, but only a single v ariable is relev an t to the output. Suppose ∈ cases. Nearest neigh neighbor bor regression will not be able to detect this simple pattern. further that this feature simply enco theboutput directlyby , i.e. y =num x bin x will The nearest neighbor of most poin oints ts des e determined thethat large numb er all of cases. Nearest neighx bor ,regression will notfeature be ablex to detect this simple pattern. x 2 through features not by the lone . Th the output on small us 100 1 Thus The nearest of mostbepoin ts x will be determined by the large numb er of training setsneighbor will essentially random. features x through x , not by the lone feature x . Thus the output on small training sets will essentially be random.

143


R R

144


Another type of learning algorithm that also breaks the input space into regions and has separate parameters for each region is the de decision cision tr treee (Breiman et al., Another type of learning algorithm that also breaks the input into regions 1984 1984)) and its many varian ariants. ts. As shown in Fig. 5.7, eac each h no node de of space the decision tree and has separate parameters for each region is the de cision tr e e ( Breiman et al., is asso associated ciated with a region in the input space, and internal no nodes des break that region 1984 ) and its many v arian ts. As shown in Fig. 5.7 , eac h no de of the decision tree in into to one sub-region for each child of the no node de (typically using an axis-aligned is assoSpace ciated is with a region in the into inputnon-ov space,erlapping and internal nodeswith breaka that region cut). th thus us sub-divided non-overlapping regions, one-to-one in to one sub-region for leaf eachno cdes hildand of the noregions. de (typically using an usually axis-aligned corresp correspondence ondence b et etw ween nodes input Eac Each h leaf node maps cut). Space is th us sub-divided into non-ov erlapping regions, with a one-to-one ev every ery point in its input region to the same out output. put. Decision trees are usually corresp ondence b et w een leaf no des and input regions. Eachscope leaf node usually trained with sp specialized ecialized algorithms that are beyond the of this book.maps The every point in its can input to the same output.if it Decision trees are usually learning algorithm beregion considered non-parametric is allow allowed ed to learn a tree trained withsize, specialized algorithms that beyond the scope of this book. The of arbitrary though decision trees areare usually regularized with size constraints learning canparametric be considered non-parametric it is allowtrees ed to as learn a tree that turnalgorithm them into mo models dels in practice.if Decision they are of arbitrary size, though decision trees are usually regularized with size constraints typically used, with axis-aligned splits and constant outputs within eac each h no node, de, that turntothem parametric dels practice. Decision as they F are struggle solveinto some problems mo that areineasy even for logistictrees regression. or texample, ypically used, with axis-aligned splits and constant outputs within eac h no de, if we hav havee a two-class problem and the positive class occurs wherev wherever er struggle to solve some problems that are easy even for logistic regression. F or x2 > x1 , the decision boundary is not axis-aligned. The decision tree will th thus us example, if we hav e a t w o-class problem and the p ositive class o ccurs wherev er need to approximate the decision boundary with man many y no nodes, des, implementing a step x > x , that the decision boundary is not axis-aligned. tree will thus function constantly walks back and forth acrossThe the decision true decision function need axis-aligned to approximate the decision boundary with many nodes, implementing a step with steps. function that constantly walks back and forth across the true decision function we ha hav ve seen, havee man many y withAs axis-aligned steps.nearest neighbor predictors and decision trees hav limitations. Nonetheless, they are useful learning algorithms when computational As we are haveconstrained. seen, nearestWneighbor predictors and decision trees have many resources e can also build in intuition tuition for more sophisticated limitations. Nonetheless, they areab useful learning algorithms computational learning algorithms by thinking about out the similarities and when differences b et etw ween resources are constrained. W e can also build in tuition for more sophisticated sophisticated algorithms and k-NN or decision tree baselines. learning algorithms by thinking about the similarities and differences b etween See Murphy (2012), Bishop (2006), Hastie et al. (2001) or other machine sophisticated algorithms and k-NN or decision tree baselines. learning textb textboooks for more material on traditional sup supervised ervised learning algorithms. See Murphy (2012), Bishop (2006), Hastie et al. (2001) or other machine learning textbooks for more material on traditional supervised learning algorithms.

5.8

Unsup Unsupervised ervised Learning Algorithms

Recall Sec. 5.1.3 that unsup unsupervised ervised algorithms are those that exp experience erience only 5.8 from Unsup ervised Learning Algorithms “features” but not a sup supervision ervision signal. The distinction betw between een sup supervised ervised and Recall ervised from Sec. 5.1.3 thatisunsup ervised algorithms those that experience unsup unsupervised algorithms not formally and rigidlyaredefined because there isonly no “features” but not a sup ervision signal. The distinction betw een sup ervised and ob objective jective test for distinguishing whether a value is a feature or a target provided by unsup ervised Informally algorithms is notervised formally and rigidly defined ecause there is no a sup supervisor. ervisor. Informally, , unsup unsupervised learning refers to most battempts to extract ob jective testfrom for distinguishing whether a vnot alue require is a feature or a lab target provided by information a distribution that do human labor or to annotate a supervisor. Informally unsupervised learning to most attemptslearning to extract examples. The term is , usually asso associated ciated withrefers density estimation, to information from a distribution that do not require human lab or to annotate dra draw w samples from a distribution, learning to denoise data from some distribution, examples. The term is the usually with density the estimation, finding a manifold that dataasso liesciated near, or clustering data intolearning groups to of draw samples from a distribution, learning to denoise data from some distribution, 145 or clustering the data into groups of finding a manifold that the data lies near,


related examples. A classic unsup unsupervised ervised learning task is to find the “best” represen representation tation of the related examples. data. By ‘b ‘best’ est’ we can mean differen differentt things, but generally sp speaking eaking we are lo looking oking A classic unsup ervised learning task is to find the “best” represen tation of the for a representation that preserv preserves es as muc much h information ab about out x as possible while data. Bysome ‘best’penalty we canor mean differenaimed t things, but generally speaking we are looking ob obeying eying constraint at keeping the representation simpler or for a representation preserves as much information about x as possible while more accessible thanthat x itself. obeying some penalty or constraint aimed at keeping the representation simpler or There are multiple wa ways ys of defining a simpler represen representation. tation. Three of the more accessible than x itself. most common include lo low wer dimensional represen representations, tations, sparse representations There are m ultiple wa ys of defining a simpler tation. Three of the and indep independen enden endentt represen representations. tations. Low-dimensionalrepresen representations attempt to most common lower dimensional represen tations, sparse representations compress as minclude uc uch h information ab about out x as possible in a smaller represen representation. tation. and indep endentations t represen tations. Low-dimensional representations attempt to Sparse represen representations (Barlow , 1989 ; Olshausen and Field, 1996; Hin Hinton ton and x compress as ,m1997 uch )information out into as possible in atation smaller represen tation. Ghahramani embed the ab dataset a represen representation whose entries are Sparse represen tations ( Barlow , 1989 ; Olshausen and Field , 1996 ; Hin ton and mostly zero zeroes es for most inputs. The use of sparse representations typically requires Ghahramani , 1997 ) embed theofdataset into a represen whose entries are increasing the dimensionality the representation, so tation that the represen representation tation mostly zero es for zero mostesinputs. The use ofto sparse representations typically requires b ecoming mostly zeroes do does es not discard too o muc much h information. This results in an increasing the dimensionality of the representation, so that the represen tation overall structure of the represen representation tation that tends to distribute data along the axes b ecoming mostly zero es do es not discard o much tations information. Thistoresults in an of the represen representation tation space. Indep Independen enden endentttorepresen representations attempt disentangle overall structure of the represen tation to distribute data the along the axes the sources of variation underlying thethat datatends distribution suc such h that dimensions of the the representation representation are space. Independen t representations attempt to disentangle of statistically independent. the sources of variation underlying the data distribution such that the dimensions Of course these three criteria are certainly not mutually exclusive. Lo Lowwof the representation are statistically independent. dimensional representations often yield elements that hav havee fewer or weak eaker er deOf course these criteria are certainly not This mutually exclusive. pendencies than the three original high-dimensional data. is because one waLo y wto dimensional representations often yield elements that hav e fewer or w eak er dereduce the size of a representation is to find and remov removee redundancies. Identifying p endencies than the original high-dimensional data. This reduction is becausealgorithm one way to and remo removing ving more redundancy allows the dimensionality to reduce the size of a representation is to find and remov e redundancies. Identifying ac achiev hiev hievee more compression while discarding less information. and removing more redundancy allows the dimensionality reduction algorithm to The notioncompression of representation is one of the central themes of deep learning and achiev e more while discarding less information. therefore one of the central themes in this book. In this section, we dev develop elop some The notion of representation is one of the central themes of deep learning and simple examples of represen representation tation learning algorithms. Together, these example therefore one of the central themes in this b o ok. In this section, w e dev elop some algorithms show how to op operationalize erationalize all three of the criteria ab abov ov ove. e. Most of the simple examples of represen tation learning algorithms. T ogether,algorithms these example remaining chapters in intro tro troduce duce additional representation learning that algorithms show how to op erationalize all three of the criteria ab ov e. Most of the dev develop elop these criteria in different ways or in intro tro troduce duce other criteria. remaining chapters introduce additional representation learning algorithms that develop these criteria in different ways or introduce other criteria.

5.8.1

Principal Comp Componen onen onents ts Analysis

5.8.1 Principal ts Analysis In Sec. 2.12 , we sa saw w Comp that theonen principal comp components onents analysis algorithm pro provides vides a means of compressing data. We can also view PCA as an unsup unsupervised ervised learning In Sec. 2.12 , welearns saw that the principal onents analysis algorithm videsona algorithm that a represen representation tation comp of data. This represen representation tation ispro based We can also tation view PCA as ed an ab unsup tmeans wo of of thecompressing criteria for data. a simple represen representation describ described abov ov ove. e.ervised PCA learning learns a algorithm that learns a representation of data. This representation is based on two of the criteria for a simple represen 146tation describ ed ab ove. PCA learns a


x z= x W z= x W

z z

x

z

z

represen representation tation that has low lower er dimensionalit dimensionality y than the original input. It also learns a represen representation tation whose elemen elements ts hav havee no linear correlation with each other. This represen that er dimensionalit y than the original also learns is a firsttation step tow toward ardhas thelow criterion of learning represen representations tations input. whose It elemen elements ts are a represen tation whose elemen ts hav e no linear correlation with each other. This statistically indep independent. endent. To achiev achievee full independence, a represen representation tation learning is a first step tow ard the criterion of learning represen tations whose elemen ts are algorithm must also remov removee the nonlinear relationships bet etw ween variables. statistically independent. To achieve full independence, a representation learning PCA learns an orthogonal, linear transformation of the data that pro projects jects an algorithm must also remove the nonlinear relationships between variables. input x to a represen representation tation z as sho shown wn in Fig. 5.8. In Sec. 2.12, we saw that we PCA learns an orthogonal, linear transformation of the data that the pro jects an could learn a one-dimensional represen representation tation that best reconstructs original x z input to a represen tation as sho wn in Fig. 5.8 . In Sec. 2.12 , w e saw that w e data (in the sense of mean squared error) and that this represen representation tation actually could learn represen tation that est reconstructs theuse original corresp the first principal comp thebdata. Th PCA corresponds onds atoone-dimensional componen onen onent t of Thus us we can data (in the sense of mean squared error) and that this represen tation actually as a simple and effectiv effectivee dimensionalit dimensionality y reduction metho method d that preserv preserves es as muc much h corresp onds to the first principal comp onen t of the data. Th us w e can use PCA of the information in the data as possible (again, as measured by least-squares as a simple anderror). effectiv dimensionalit metho thatPCA preserv es as much reconstruction Ine the follo following, wing,ywreduction e will study ho how wdthe representation of the information in thedata datarepresentation as possible (again, as measured by least-squares decorrelates the original X. reconstruction error). In the following, we will study how the PCA representation Let us consider the m × n -dimensional design matrix X . We will assume that decorrelates the original data representation X . the data has a mean of zero, E[ x] = 0. If this is not the case, the data can easily n -dimensional Let us consider the m the design matrix X . preprocessing We willcessing assume that be centered by subtracting step. E mean from all examples in a prepro the data has a mean of zero, × [ x] = 0. If this is not the case, the data can easily The un unbiased biased sample cov covariance ariance asso associated ciated with is giv given en by: step. be centered by subtracting the meanmatrix from all examples in a X prepro cessing 1 asso>ciated with X is given by: The unbiased sample covariance matrix Var[x] = X X. (5.85) m−1 1 Var[x] = 147 X X . (5.85) m 1 −


PCA finds a represen representation tation (through linear transformation) z = x>W where Var[z ] is diagonal. PCA finds a representation (through linear transformation) z = x W where In Sec. 2.12, we sa saw w that the principal comp components onents of a design matrix X are Var[z ] is diagonal. > giv given en by the eigenv eigenvectors ectors of X X . From this view, In Sec. 2.12, we saw that the principal components of a design matrix X are > = Wthis ΛWview, given by the eigenvectors of XXX>.XFrom . (5.86) X X e=deriv WΛ W of . the principal components. (5.86) In this section, we exploit an alternativ alternative derivation ation The principal comp componen onen onents ts may also be obtained via the singular value decomp decomposition. osition. In this section, we exploit an alternativ e deriv ation of the principal components. Sp Specifically ecifically ecifically,, they are the righ rightt singular vectors of X . To see this, let W be The the principal comp onen ts may also b e obtained via the singular v alue decomp osition. > righ rightt singular vectors in the decomp decomposition osition X = U ΣW . W Wee then recov recover er the Specifically , they areequation the righwith t singular of X . Tobasis: see this, let W be the original eigen eigenv vector W asvectors the eigenv eigenvector ector right singular vectors in the decomposition X = U ΣW . We then recover the  > original eigenvector > > as the eigenv Xequation X = Uwith ΣWW U ΣW > = ector W Σ2basis: W >. (5.87)

X X = U ΣW U ΣW = W Σ W . (5.87) ar[[ z]. Using the The SVD is helpful to show that PCA results in a diagonal Var SVD of X , we can express the variance of X as: The SVD is helpful to show that PCA results in a diagonal Var[ z]. Using the 1 > X as: SVD of X , we can express Var[xthe X (5.88) ] =variance Xof m−1 1 Var[x] = = 1 (X XW > )>U ΣW > (5.88) UΣ (5.89) m m−1 1 1 > = 1− W (U Σ>W >) U ΣW (5.89) = (5.90) m − 11 Σ U U ΣW m 1 = 1− W W Σ2 U >U ΣW (5.90) (5.91) = m − 11 Σ W , m 1− (5.91) = WΣ W , > where we use the fact that U Um= I 1because the U matrix of the singular value definition is defined to be orthonormal. This shows that if we tak takee z = x> W , we −I because U U = U where w e use the fact that the matrix of the singular value can ensure that the cov covariance ariance of z is diagonal as required: definition is defined to be orthonormal. This shows that if we take z = x W , we 1 can ensure that the covariance Var[z ] =of z is diagonal Z >Z as required: (5.92) m−1 1 Var[z ] = = 1 W Z >ZX >X W (5.92) (5.93) m − 11 m 11− W X> X2 W > = (5.93) (5.94) = m 1WW Σ WW m−1 1 (5.94) = 1− Σ W2W Σ W W = (5.95) m − 11 , m 1 = − Σ> , (5.95) where this time we use the fact that m W 1 W = I , again from the definition of the SVD. where this time we use the fact that−W W = I , again from the definition of the SVD.

148


The ab abov ov ovee analysis shows that when we pro project ject the data x to z, via the linear transformation W , the resulting representation has a diagonal co cov variance matrix x z The ab ov e analysis shows that when we pro ject the data to , via ts theoflinear 2 (as giv given en by Σ ) whic which h immediately implies that the individual elemen elements z are W transformation , the resulting representation has a diagonal co v ariance matrix mutually uncorrelated. (as given by Σ ) which immediately implies that the individual elements of z are This ability of PCA to transform data into a representation where the elemen elements ts mutually uncorrelated. are mutually uncorrelated is a very imp important ortant prop property erty of PCA. It is a simple This ability of PCA to transform data to into a representation where the elements example of a represen representation tation that attempt are mutually underlying uncorrelated is data. a veryInimp prop ertythis of PCA. It is a simple the theortant case of PCA, disen disentangling tangling takes example of a represen tation that attempt to the form of finding a rotation of the input space (describ (described ed by W ) that aligns the the data. In the casenew of PCA, this disen tangling takes principal axes underlying of variance with the basis of the representation space asso associated ciated the form with z. of finding a rotation of the input space (described by W ) that aligns the principal axes of variance with the basis of the new representation space associated While correlation is an imp important ortant category of dep dependency endency b et etw ween elements of with z. the data, we are also in interested terested in learning represen representations tations that disentangle more While correlation is an imp ortant category depwe endency b etwmore een elements of complicated forms of feature dep dependencies. endencies. Forofthis, will need than what the data, we with are also interested learning representations that disentangle more can be done a simple linearintransformation. complicated forms of feature dependencies. For this, we will need more than what can be done with a simple linear transformation.

5.8.2

k-means Clustering k 5.8.2 example -means Another of aClustering simple representation learning algorithm is k -means clustering.

The k -means clustering algorithm divides the training set in into to k differen differentt clusters Another example of a simple representation learning algorithm clustering. of examples that are near eac each h other. We can thus think isofk -means the algorithm as k k The -means clustering algorithm divides the training set in to differen t clusters pro providing viding a k-dimensional one-hot co code de vector h represen representing ting an input x. If x of examples that are near eac h other. W e can thus think of the algorithm as belongs to cluster i , then h i = 1 and all other en entries tries of the represen representation tation h are providing a k-dimensional one-hot code vector h representing an input x. If x zero. belongs to cluster i , then h = 1 and all other entries of the representation h are The one-hot co code de provided by k-means clustering is an example of a sparse zero. represen representation, tation, because the ma majority jority of its entries are zero for ev every ery input. Later, k The one-hot co de provided by -means clustering is an example of a sparse we will dev develop elop other algorithms that learn more flexible sparse representations, represen tation, because the jority of its entries for xev. ery input. co Later, where more than one en can be non-zero for are eac input One-hot entry tryma each hzero codes des w e will develop example other algorithms learn more flexible sparse are an extreme of sparsethat representations that lose manyrepresentations, of the b enefits where more thanrepresentation. one entry can The be non-zero eacstill h input One-hot codes x . some of a distributed one-hot for co code de confers statistical are an extreme examplecon of vsparse that loseinmany of the b enefits adv advantages antages (it naturally conv eys therepresentations idea that all examples the same cluster are of a distributed representation. The one-hot co de still confers some statistical similar to each other) and it confers the computational adv advantage antage that the en entire tire advantages (it naturally conveys the that all examples in the same cluster are represen representation tation ma may y be captured by aidea single in integer. teger. similar to each other) and it confers the computational advantage that the entire The ktation -meansma algorithm works by differentt cen centroids troids {µ(1), . . . , µ(k) } k differen represen y be captured by initializing a single integer. to different values, then alternating betw etween een two different steps un until til con conv vergence. The -means algorithm works by initializing differen t cen troids k k , . . . , µ of µ i i In one step, each training example is assigned to cluster , where is the index to different alues, betw een eac twohdifferent convergence. {tildated µ (i)alternating µ(i) isunup the nearest vcen centroid troidthen . In the other step, each cen centroid troidsteps updated to the} i i In one step, each training example is assigned to cluster , where is the index of ( j ) mean of all training examples x assigned to cluster i. the nearest centroid µ . In the other step, each centroid µ is updated to the mean of all training examples x assigned 149 to cluster i.


One difficulty pertaining to clustering is that the clustering problem is inherently ill-p ill-posed, osed, in the sense that there is no single criterion that measures ho how w well a One difficulty pertaining to clustering is that the clustering problem is inherently clustering of the data corresp corresponds onds to the real world. We can measure properties of ill-p osed, in the sense that there is no single criterion that measures how well a the clustering suc such h as the average Euclidean distance from a cluster cen centroid troid to the clustering of the data corresp onds to the real w orld. W e can measure properties of mem memb bers of the cluster. This allows us to tell how well we are able to reconstruct the clustering such from as thethe average Euclidean distance a cluster troid to the the the training data cluster assignmen assignments. ts. Wefrom do not kno know wcen how well members of the cluster. Thisond allows us to tell how well weware able to reconstruct cluster assignments corresp correspond to properties of the real orld. Moreo Moreov ver, there the training data from the cluster assignmen ts. W e do not kno w how well ma may y be man many y different clusterings that all corresp correspond ond well to some prop propert ert erty ythe of cluster assignments corresp ond to properties of the real w orld. Moreo v er, there the real world. We may hop hopee to find a clustering that relates to one feature but may beaman y different clusterings that all that corresp ond relev well ant to some ertyFor of obtain differen different, t, equally valid clustering is not relevant to ourprop task. the real wsupp orld.ose Wthat e may to ofind a clustering that relates to oneconsisting feature but example, suppose wehop rune tw two clustering algorithms on a dataset of obtain a differen t, equally v alid clustering that is not relev ant to our task. F or images of red truc trucks, ks, images of red cars, images of gra gray y trucks, and images of gra gray y example, supp ose that w e run tw o clustering algorithms on a dataset consisting of cars. If we ask each clustering algorithm to find two clusters, one algorithm ma may y images of red truc ks, images of red cars, images of gra y trucks, and images of graofy find a cluster of cars and a cluster of trucks, while another ma may y find a cluster cars.vehicles If we ask clustering algorithm find twowclusters, algorithm may red andeach a cluster of gray vehicles.toSuppose e also runone a third clustering find a cluster of cars and cluster of trucks, while may This find amay cluster of algorithm, which is allo allow weda to determine the num number beranother of clusters. assign red vehicles and a cluster of gray v ehicles. Suppose w e also run a third clustering the examples to four clusters, red cars, red truc trucks, ks, gra gray y cars, and gra gray y trucks. This algorithm, which is at alloleast wed captures to determine the numab ber of clusters. This may new clustering now information b oth attributes, but assign it has about out the examples to four clusters, red cars, red truc ks, gra y cars, and gra y trucks. lost information about similarit similarity y. Red cars are in a different cluster from This gra gray y new clustering now at least captures information ab out b oth attributes, but it has cars, just as they are in a differen differentt cluster from gray trucks. The output of the lost information about similarit y. us Red cars in are a different cluster clustering algorithm do does es not tell that redare cars more similar to from gra gray y gra carsy cars, they just as are truc in aks. differen cluster from gray btrucks. The and output than arethey to gray trucks. Theyt are differen different t from oth things, thatofisthe all clustering w e kno know. w. algorithm does not tell us that red cars are more similar to gray cars than they are to gray trucks. They are different from both things, and that is all These may y prefer a distributed we kno w. issues illustrate some of the reasons that we ma represen representation tation to a one-hot represen representation. tation. A distributed represen representation tation could ha hav ve issuesforillustrate some of therepresenting reasons thatitswe mayand prefer distributed twoThese attributes each vehicle—one color one arepresenting representation a one-hot represen distributed tation could have whether it is atocar or a truck. Ittation. is stillAnot en entirely tirely represen clear what the optimal two attributes for each vehicle—one color andknow one representing distributed represen representation tation is (ho (how w canrepresenting the learningitsalgorithm whether the whether it is a car or a truck. It is still not en tirely clear what the optimal two attributes we are in interested terested in are color and car-v car-versus-truck ersus-truck rather than distributed represen tation is (ho w can the learning algorithm know whether the man and age?) but having man attributes reduces the burden on the manufacturer ufacturer many y talgorithm wo attributes we are in terested in are color and car-v ersus-truck rather than to guess whic which h single attribute we care ab about, out, and allo allows ws us to measure man ufacturer and age?) but having man y attributes reduces the burden on the similarit similarity y betw between een ob objects jects in a fine-grained way by comparing many attributes algorithm to guess whic h single attribute we care ab out, and allo ws us to measure instead of just testing whether one attribute matc matches. hes. similarity between objects in a fine-grained way by comparing many attributes instead of just testing whether one attribute matches.

5.9

Sto Stocchastic Gradient Descent

Nearly of cdeep learning is pow owered ered Descent by one very imp importan ortan ortantt algorithm: sto stochastic chastic 5.9 all Sto hastic Gradient gr gradient adient desc descent ent or SGD. Sto Stocchastic gradient descent is an extension of the gradient Nearly all of deep learning is powered by one very important algorithm: stochastic gradient descent or SGD. Stochastic gradient 150 descent is an extension of the gradient


descen descentt algorithm introduced in Sec. 4.3. A recurring problem in mac machine hine learning is that large training sets are necessary descent algorithm introduced in Sec. 4.3. for go goo o d generalization, but large training sets are also more computationally A recurring problem in machine learning is that large training sets are necessary exp expensiv ensiv ensive. e. for goo d generalization, but large training sets are also more computationally The cost function used by a machine learning algorithm often decomp decomposes oses as a expensive. sum over training examples of some per-example loss function. For example, the Theecost functionlog-likelihoo used by a machine decomp negativ negative conditional log-likelihood d of thelearning trainingalgorithm data can often be written asoses as a sum over training examples of some per-example loss function. For example, the m negative conditional log-likelihood of the training be written as 1 Xdata can J (θ) = E ,y∼ˆp L(x, y, θ) = L(x(i) , y (i), θ) (5.96) m 1 i=1 E J (θ) = L(x, y, θ) = L(x , y , θ) (5.96) mlog p(y | x; θ ). where L is the per-example loss L(x, y, θ) = −

ForLthese additive e cost functions, gradient where is theadditiv per-example loss L(x, ygradien , θ) = t descent log p(y requires x; θ ). computing m X |requires computing For these additive cost functions, gradien− t descent 1 X (5.97) ∇θ J (θ) = ∇ θ L(x(i) , y (i), θ). m 1 i=1 (5.97) J (θ ) = L(x , y , θ). m O ( m The computational cost ∇ of this op operation eration is ) . As the training set size grows to ∇ billions of examples, the time to take a single gradien gradientt step becomes prohibitiv prohibitively ely O ( m The computational cost of this op eration is ) . As the training set size grows to long. billions of examples, the time to takeX a single gradient step becomes prohibitively The insight of sto stocchastic gradient descen descentt is that the gradien gradientt is an exp expectation. ectation. long. The exp expectation ectation may be appro approximately ximately estimated using a small set of samples. The insight of stostep chastic gradient descenwteiscan that the gradien t isatch an exp ectation. Sp Specifically ecifically ecifically, , on each of the algorithm, sample a minib minibatch of examples 0 The ectation may ) } be approximately estimated using a small set of samples. , . . . , x (m x(1) B = {exp dra drawn wn uniformly from the training set. The minibatc minibatch h size Sp ecifically , on each step of the algorithm, w e can sample a minib atch of examples 0 m number ber of examples, ranging from B is typically chosen to be a relatively small num = a xfew, .h.undred. .,x drawn uniformly from the minibatc h size m 1 to Crucially Crucially, , m0 is usually heldtraining fixed asset. the The training set size m is t ypically c hosen to b e a relatively small num ber of examples, ranging from { } gro grows. ws. We may fit a training set with billions of examples using updates computed 1ontoonly a few hundred.examples. Crucially, m is usually held fixed as the training set size m a hundred grows. We may fit a training set with billions of examples using updates computed The estimate of the gradient is formed as on only a hundred examples. m0 The estimate of the gradient1 is formed as X g = 0 ∇θ L(x(i) , y(i) , θ). (5.98) m i=1 1 g= L(x , y , θ). (5.98) m B. The stochastic gradien using examples from the minibatch descen algorithm gradientt descentt ∇ then follo follows ws the estimated gradient B do downhill: wnhill: using examples from the minibatch . The stochastic gradient descent algorithm then follows the estimated gradientθ do wnhill: ← θ − g , (5.99) X

where  is the learning rate. where  is the learning rate.

θ

θ

← 151−

g ,

(5.99)


Gradien Gradientt descen descentt in general has often been regarded as slow or unreliable. In the past, the application of gradien gradientt descen descentt to non-conv non-convex ex optimization problems Gradien t descen t in general has often b een regarded slow or unreliable. In was regarded as foolhardy or unprincipled. Toda day y, we as kno know w that the mac machine hine the past, mo thedels application of in gradien ex optimization learning models describ described ed Part tIIdescen work tvto erynon-conv well when trained with problems gradient w as regarded as foolhardy or unprincipled. T o da y , w e kno w that the mac hinea descen descent. t. The optimization algorithm may not be guaranteed to arrive at even learning modelsindescrib ed in Pamount art II work verybut wellit when gradient lo local cal minimum a reasonable of time, often trained finds a vwith ery lo low w value descen t. The optimization algorithm may not b e guaranteed to arrive at even a of the cost function quickly enough to be useful. local minimum in a reasonable amount of time, but it often finds a very low value Sto Stocchastic gradient descen descentt has man many y imp important ortant uses outside the con context text of of the cost function quickly enough to be useful. deep learning. It is the main way to train large linear mo models dels on very large Stochastic descen t has manyper imp ortant usesdo outside context of datasets. For agradient fixed mo model del size, the cost SGD up update date does es not the dep depend end on the deep learning. It. is main we waoften y to train large linear mothe delstraining on very training set size m In the practice, use a larger mo model del as setlarge size datasets. F or a fixed mo del size, the cost p er SGD up date do es not dep end on the increases, but we are not forced to do so. The num number ber of up updates dates required to reach m training set size . In practice, we often use a larger mo del as the training set size con conv vergence usually increases with training set size. Ho How wev ever, er, as m approac approaches hes increases, butmo wedel arewill notev forced to do so. Theto num of pup dates required infinit infinity y, the model even en entually tually con converge verge itsber best ossible test errortobreach efore m approac convergence usually increases with training set set. size.Increasing However,masfurther hes SGD has sampled ev every ery example in the training will not infinity,the theamount model of willtraining eventually vergetotoreac itshbthe est p ossible error before extend time con needed reach mo model’s del’s btest est possible test m SGD has sampled ev ery example in the training set. Increasing further will not error. From this point of view, one can argue that the asymptotic cost of training amount of O training needed aextend mo model delthe with SGD is (1) as atime function of to m.reach the model’s best possible test error. From this point of view, one can argue that the asymptotic cost of training Prior to the adv adven en ent t (1) of deep themmain wa way y to learn nonlinear models a mo del with SGD is O as a learning, function of . was to use the kernel trick in com combination bination with a linear mo model. del. Man Many y kernel learning Prior to the adv en t of deep learning, the main wa y to learn nonlinear models ( i ) ( j ) algorithms require constructing an m × m matrix Gi,j = k (x , x ). Constructing w as to use the trick in com bination a linear model. Many kernel learning O (m 2)with this matrix haskernel computational cost , which is clearly undesirable for datasets m m matrix G in = k2006, (x , xdeep algorithms require constructing ). learning Constructing with billions of examples. In an academia, starting was O ( m this matrix has computational cost ) , which is clearly undesirable for datasets × initially interesting because it was able to generalize to new examples better with billions of algorithms examples. when In academia, starting in 2006,datasets deep learning was than comp competing eting trained on medium-sized with tens of initially interesting because was deep able learning to generalize to new examples better thousands of examples. Soon itafter, garnered additional interest in than comp eting algorithms trained datasets withon tens of industry industry, , because it pro provided videdwhen a scalable waon y ofmedium-sized training nonlinear models large thousands of examples. Soon after, deep learning garnered additional interest in datasets. industry, because it provided a scalable way of training nonlinear models on large Sto Stocchastic gradien gradientt descen descentt and man many y enhancements to it are describ described ed further datasets. in Chapter 8. Stochastic gradient descent and many enhancements to it are described further in Chapter 8.

5.10

Building a Mac Machine hine Learning Algorithm

Nearly deep learning algorithms be describ described ed as particular instances of 5.10 allBuilding a Mac hinecan Learning Algorithm a fairly simple recip recipe: e: combine a specification of a dataset, a cost function, an Nearly all deep learning algorithms optimization pro procedure cedure and a mo model. del. can be described as particular instances of a fairly simple recipe: combine a specification of a dataset, a cost function, an For example, the linear optimization procedure andregression a model. algorithm combines a dataset consisting of For example, the linear regression algorithm combines a dataset consisting of 152


X and y , the cost function X and y , the cost function J (w, b) = −E ,y∼ˆp (5.100) log pmodel(y | x), E J (w , b) =(y | x ) = N (ylog (y1) ), (5.100) pmodel ; x>pw + b, the mo model del sp specification ecification 1),x , and, in most cases, the optimization algorithm defined b− y solving for where the |gradien gradientt of the cost is zero p ( ) = ( y ; x w + b, the mo del sp ecification 1) , and, in most cases, the y x using the normal equations. optimization algorithm defined by| solving N for where the gradient of the cost is zero By realizing that w e can replace an any y of these comp componen onen onents ts mostly independently using the normal equations. from the others, we can obtain a very wide variety of algorithms. By realizing that we can replace any of these components mostly independently The cost function typically includes at least one term that causes the learning from the others, we can obtain a very wide variety of algorithms. pro process cess to perform statistical estimation. The most common cost function is the Theecost function d, typically at least term that causes learning negativ negative log-likelihoo log-likelihood, so thatincludes minimizing theone cost function causesthe maximum pro cessoto perform statistical estimation. The most common cost function is the lik likeliho eliho elihoo d estimation. negative log-likelihood, so that minimizing the cost function causes maximum The cost function ma may y also include additional terms, suc such h as regularization likelihood estimation. terms. For example, we can add weigh eightt deca decay y to the linear regression cost function The cost function may also include additional terms, such as regularization to obtain terms. For example, y tolog thep linear regression cost function J (ww,eb)can = λadd ||ww ||22eigh − Et deca (5.101) ,y∼p model(y | x). to obtain E This still allows closed-form log p J (w, b) = λoptimization. w (y x). (5.101) we callows hange closed-form the mo model del tooptimization. then most cost functions can no longer ||be ||nonlinear, − | ThisIfstill be optimized in closed form. This requires us to choose an iterativ iterativee numerical If w e c hange the mo del to b e nonlinear, then most cost functions can no longer optimization pro procedure, cedure, such as gradien gradientt descen descent. t. be optimized in closed form. This requires us to choose an iterative numerical The recip recipeepro forcedure, constructing a learning combining bining mo models, dels, costs, and optimization such as gradientalgorithm descent. by com optimization algorithms supp supports orts both sup supervised ervised and unsup unsupervised ervised learning. The The recipe forexample constructing algorithm bervised y combining models, costs, and linear regression sho shows wsa learning ho how w to supp support ort sup supervised learning. Unsup Unsupervised ervised optimization orts botha dataset supervised unsup ervised learning. The X and learning can balgorithms e supp supported ortedsupp by defining thatand con contains tains only onlyX providing linear regression example sho ws ho w to supp ort sup ervised learning. Unsup ervised an appropriate unsup unsupervised ervised cost and mo model. del. For example, we can obtain the first learning can bby e supp orted by defining a dataset that PCA vector sp specifying ecifying that our loss function is contains only X and providing an appropriate unsupervised cost and model. For example, we can obtain the first PCA vector by sp ecifyingJ (that loss function is; w)|| 22 ||x − r(x w) =our E ∼p (5.102) E r(x ; w)and reconstruction function J (to w)hav = e w with xnorm (5.102) while our mo model del is defined have one r(x) = w >xw xw.. || − || while our model is defined to have w with norm one and reconstruction function cases, the cost function may be a function that we cannot actually r(x)In=some w xw . ev evaluate, aluate, for computational reasons. In these cases, we can still approximately In some cases,iterativ the cost function optimization may be a function cannot minimize it using iterative e numerical so longthat as ww e ehav have e someactually wa way y of ev aluate, for computational reasons. In these cases, we can still approximately appro approximating ximating its gradients. minimize it using iterative numerical optimization so long as we have some way of Most mac machine hine learning algorithms make use of this recipe, though it ma may y not approximating its gradients. immediately be obvious. If a mac machine hine learning algorithm seems esp especially ecially unique or Most machine learning algorithms make use of this recipe, though it may not 153 immediately be obvious. If a machine learning algorithm seems esp ecially unique or


hand-designed, it can usually be understo understoood as using a sp special-case ecial-case optimizer. Some mo models dels suc such h as decision trees or k -means require sp special-case ecial-case optimizers because hand-designed, it can usually be understo od as them using inappropriate a special-case for optimizer. Some their cost functions ha hav ve flat regions that make minimization k mo dels such as decision trees or -means require special-case because b y gradient-based optimizers. Recognizing that most machine optimizers learning algorithms their cost functions ha v e flat regions that make them inappropriate for minimization can be describ described ed using this recipe helps to see the different algorithms as part of a by gradient-based optimizers. that most machine learning algorithms taxonom taxonomy y of metho methods ds for doingRecognizing related tasks that work for similar reasons, rather can b e describ ed using this recipe helps to see the different algorithms as part of a than as a long list of algorithms that eac each h ha hav ve separate justifications. taxonomy of methods for doing related tasks that work for similar reasons, rather than as a long list of algorithms that each have separate justifications.

5.11

Challenges Motiv Motivating ating Deep Learning

The simple mac machine hine learning algorithms describ described ed in Learning this chapter work very well on 5.11 Challenges Motiv ating Deep a wide variet ariety y of important problems. Ho How wev ever, er, they ha hav ve not succeeded in solving The simple mac hine learning algorithms describ ed in this chapter work very well on the cen central tral problems in AI, such as recognizing sp speec eec eech h or recognizing ob objects. jects. a wide variety of important problems. However, they have not succeeded in solving dev development elopmentinofAI, deep learning was motiv motivated ated inrecognizing part by theobfailure the The central problems such as recognizing speec h or jects. of traditional algorithms to generalize well on suc such h AI tasks. The development of deep learning was motivated in part by the failure of This section is ab about outtohow the challenge of suc generalizing to new examples becomes traditional algorithms generalize well on h AI tasks. exp exponen onen onentially tially more difficult when working with high-dimensional data, and how section isused abouttohow the echallenge of generalizing to new examples ecomes the This mec ac generalization in traditional mac learning mechanisms hanisms achiev hiev hieve machine hine b exponen tially tmore difficult when working withinhigh-dimensional andSuch how are insufficien insufficient to learn complicated functions high-dimensionaldata, spaces. the mec hanisms used tohigh achiev e generalization traditional hine learning spaces also often impose computational costs.in Deep learningmac was designed to are insufficien t to learn complicated functions in high-dimensional spaces. Such overcome these and other obstacles. spaces also often impose high computational costs. Deep learning was designed to overcome these and other obstacles.

5.11.1

The Curse of Dimensionalit Dimensionality y

5.11.1 The Curse of Dimensionalit y Man Many y mac machine hine learning problems become exceedingly difficult when the num numb b er of dimensions in the data is high. This phenomenon is kno known wn as the curse Man y mac hine learning problems b ecome exceedingly difficult when the distinct numb er of dimensionality dimensionality.. Of particular concern is that the num umb ber of possible of dimensions in the data is high. This phenomenon is kno wn as the curse configurations of a set of variables increases exp exponentially onentially as the num numb b er of variables of dimensionality. Of particular concern is that the number of possible distinct increases. configurations of a set of variables increases exponentially as the numb er of variables increases.

154


× O(× v )

= 1000 10 d

v

d

v

= 1000 10

O( v )

The curse of dimensionality arises in many places in computer science, and esp especially ecially so in machine learning. The curse of dimensionality arises in many places in computer science, and One posed learning. by the curse of dimensionality is a statistical challenge. especiallychallenge so in machine As illustrated in Fig. 5.9, a statistical challenge arises because the num number ber of One challenge posed by the curse of dimensionality is a statistical challenge. possible configurations of x is much larger than the num umb ber of training examples. As illustrated in Fig. 5.9 , a statistical c hallenge arises because the number To understand the issue, let us consider that the input space is organized in into to of a possible isw m uch larger w than thedescrib number of training examples. grid, lik likeeconfigurations in the figure. of Inxlo low dimensions e can describe e this space with a low T understand let mostly us consider that bthe input space is organized into no um umb ber of grid the cellsissue, that are occupied y the data. When generalizing to aa grid,data like in the low dimensions e can describ this spacethe with a low new poin oint, t, figure. we can In usually tell what towdo simply by einspecting training n um b er of grid cells that are mostly o ccupied b y the data. When generalizing to a examples that lie in the same cell as the new input. F For or example, if estimating newprobabilit data poinyt,densit we can do just simply by inspecting x, wetocan the probability density y atusually some ptell ointwhat return the num numb berthe of training training examples that lie in the same cell as the new input. F or example, if estimating examples in the same unit volume cell as x , divided by the total num umb ber of training x the probabilit y densit y at some p oint , w e can just return the num b er of training examples. If we wish to classify an example, we can return the most common class x examples in the same unit volume cell as , divided b y the total n um b er of training of training examples in the same cell. If we are doing regression we can av average erage examples. If w e wish to classify an example, we can return the most common the target values observ observed ed over the examples in that cell. But what ab about outclass the of training examples the same cell. If we are doing regression we can average cells for which we ha hav vin e seen no example? Because in high-dimensional spaces the the target v alues observ ed o v er the examples in that cell. But what ab out the num umb ber of configurations is going to be huge, muc much h larger than our num umb ber of cells for which we ha v e seen no example? Because in high-dimensional spaces examples, most configurations will hav havee no training example asso associated ciated withthe it. number of configurations is going to be huge, much larger than our number of examples, most configurations will hav155 e no training example associated with it.


Ho How w could we possibly say something meaningful ab about out these new configurations? Man Many y traditional machine learning algorithms simply assume that the output at a How pcould we possibly say something aboutput out these newnearest configurations? new oint should be approximately themeaningful same as the at the training Man y traditional machine learning algorithms simply assume that the output at a poin oint. t. new point should be approximately the same as the output at the nearest training point.

5.11.2

Lo Local cal Constancy and Smo Smoothness othness Regularization

In order toLo generalize well, mac machine hine algorithms need to be guided by prior 5.11.2 cal Constancy andlearning Smoothness Regularization beliefs ab about out what kind of function they should learn. Previously Previously,, we hav havee seen In order to generalize w ell, mac hine learning algorithms need to be guided by prior these priors incorp incorporated orated as explicit beliefs in the form of probability distributions about what kind of function they should Previously we hav e seen obveliefs er parameters of the mo model. del. More informally informally, , welearn. may also discuss ,prior beliefs as these priors incorp orated as explicit b eliefs in the form of probability distributions directly influencing the function itself and only indirectly acting on the parameters overtheir parameters thefunction. model. More informally e may also discuss discuss prior prior bbeliefs eliefs as as via effect onofthe Additionally dditionally, , w, ewinformally directly influencing the function itself and only indirectly theard parameters b eing expressed implicitly implicitly, , by choosing algorithms that areacting biasedontow toward choosing via their effect on the function. A dditionally , w e informally discuss prior b eliefs as some class of functions over another, even though these biases may not be expressed being expressed , byinchoosing thatdistribution are biased represen toward cting hoosing (or ev even en possibleimplicitly to express) terms ofalgorithms a probability representing our some class functions overfunctions. another, even though these biases may not be expressed degree of bof elief in various (or even possible to express) in terms of a probability distribution representing our Among the most widely used of these implicit “priors” is the smo smoothness othness prior degree of belief in various functions. or lo loccal constancy prior prior.. This prior states that the function we learn should not Among the most widely used of these implicit “priors” is the smoothness prior change very muc much h within a small region. or local constancy prior. This prior states that the function we learn should not Man Many simpler rely region. exclusiv exclusively ely on this prior to generalize well, and change vyery much algorithms within a small as a result they fail to scale to the statistical challenges inv involved olved in solving AIMan y simpler algorithms rely exclusiv ely on this prior to well, and lev level el tasks. Throughout this book, we will describ describee ho how w deep generalize learning introduces as a result (explicit they fail to to thepriors statistical challenges involved in solving AIadditional andscale implicit) in order to reduce the generalization level tasks. Throughout this bHere, ook, we describ howsmo deep learning introduces error on sophisticated tasks. we will explain wh why ye the smoothness othness prior alone is additional (explicit and implicit) priors in order to reduce the generalization insufficien insufficientt for these tasks. error on sophisticated tasks. Here, we explain why the smoothness prior alone is There are many differen differentt ways to implicitly or explicitly express a prior belief insufficient for these tasks. that the learned function should b e smooth or lo locally cally constan constant. t. All of these different There are many differen t w a ys to implicitly or explicitly express a prior elief f ∗bthat metho methods ds are designed to encourage the learning pro process cess to learn a function that the learned function should b e smooth or locally constant. All of these different satisfies the condition methods are designed to encourage the that f ∗ (x ) ≈learning f∗ (x + pro ) cess to learn a function f(5.103) satisfies the condition for most configurations x and small know w (5.103) a go goo od f (x)change f (x.+In ) other words, if we kno answ answer er for an input x (for example, if≈x is a lab labeled eled training example) then that x  for most configurations and small change . In words, if we go kno a good answ answer er is probably go goood in the neigh neighb borho orhoood of x. other If we hav have e several goo o dwanswers x is athem answ er for an b input if bine labeled example) then that in some neigh neighb orho orhoo oxd(for we example, would com combine (b (by ytraining some form of av averaging eraging or answ erolation) is probably goduce od inanthe neighbthat orhoagrees od of xwith . If we e yseveral gooasd m answers in interp terp terpolation) to pro produce answer as hav man many of them uc uch h as inossible. some neighborhood we would combine them (by some form of averaging or p interpolation) to pro duce an answer that agrees with as many of them as much as An extreme example of the lo local cal constancy approac approach h is the k -nearest neighbors possible. 156 An extreme example of the local constancy approach is the k -nearest neighbors


family of learning algorithms. These predictors are literally constan constantt over eac each h region containing all the points x that hav havee the same set of k nearest neighbors in family of learning algorithms. These are literally constan t ovbe er more each the training set. For number berpredictors of distinguishable regions cannot k = 11,, the num x k regionthe containing all training the points that have the same set of nearest neighbors in than num numb ber of examples. the training set. For k = 1, the number of distinguishable regions cannot be more While the k-nearest neighbors algorithm copies the output from nearby training than the number of training examples. examples, most kernel mac machines hines interpolate betw between een training set outputs asso associated ciated k While the -nearest neighbors algorithm copies the output from nearby training with nearby training examples. An imp important ortant class of kernels is the family of lo loccal examples, most kernel mac hines interpolate betw een training set outputs asso ciated kernels where k(u, v ) is large when u = v and decreases as u and v gro grow w farther with nearby training examples. An imp ortant class of kernels is the family of local apart from eac each h other. A lo local cal kernel can be thought of as a similarity function (u, v ) is large v and decreases v gro kernels where ktemplate when u as u and w farther x that performs matching, by=measuring ho how w closely a test example apart from eac h other. A lo cal kernel can b e thought of as a similarity function ( i ) resem resembles bles each training example x . Muc Much h of the modern motiv motivation ation for deep x that p erforms template matching, by measuring ho w closely a test example learning is deriv derived ed from studying the limitations of lo local cal template matching and x resem bles each training example . Muc h of the modern motiv ation for deep ho how w deep mo models dels are able to succeed in cases where lo local cal template matching fails deriv ed from (learning Bengio etis al. , 2006b ). studying the limitations of local template matching and how deep models are able to succeed in cases where local template matching fails Decision also (Bengio et al.trees , 2006b ). suffer from the limitations of exclusively smoothness-based learning because they break the input space into as many regions as there are trees also suffer from the exclusively smoothness-based lea leav vDecision es and use a separate parameter (orlimitations sometimes of man many y parameters for extensions learning because break the input as many regions as there of decision trees) they in eac each h region. If thespace targetinto function requires a tree withare at lea v es and use a separate parameter (or sometimes man y parameters for extensions least n lea leav ves to be represented accurately accurately,, then at least n training examples are of decision trees) in eac h region. target to function a of tree with at required to fit the tree. A multiple ofIf nthe is needed ac achiev hiev hievee requires some level statistical n n least lea v es to b e represented accurately , then at least training examples are confidence in the predicted output. required to fit the tree. A multiple of n is needed to achieve some level of statistical In general, to distinguish O( k) regions in input space, all of these metho methods ds confidence in the predicted output. require O (k ) examples. Typically there are O( k) parameters, with O (1) parameters O)( kregions. In general, distinguish ) regions in case inputofspace, all of these metho ds asso associated ciated with to each of the O( k The a nearest neighbor scenario, O ( k O ( k O require ) examples. T ypically there are ) parameters, with (1) parameters where each training example can be used to define at most one region, is illustrated asso ciated each of the O( k ) regions. The case of a nearest neighbor scenario, in Fig. 5.10with . where each training example can be used to define at most one region, is illustrated Is there way y to represen representt a complex function that has many more regions in Fig. 5.10.a wa to be distinguished than the num umb ber of training examples? Clearly Clearly,, assuming Is there a wa y to represen t a complex function that has many more regions only smo smoothness othness of the underlying function will not allow a learner to do that. toorbeexample, distinguished than thethe num ber offunction trainingisexamples? , assuming F imagine that target a kind of Clearly chec checkerboard. kerboard. A only smo othness of the underlying function will not allow a learner to do that. chec heck kerb erboard oard con contains tains man many y variations but there is a simple structure to them. F or example, imagine that is a kind of chec A Imagine what happens whenthe thetarget num umb bfunction er of training examples is kerboard. substan substantially tially checkerbthan oardthe conntains man variations but there is on a simple structure to Based them. smaller um umb ber of yblac black k and white squares the chec checkerboard. kerboard. Imagine what happens when umbothness er of training substan tially on only lo local cal generalization andthe thensmo smoothness or local examples constancy is prior, we would smaller than the umber ofguess blackthe and white ont the b e guaranteed to n correctly color of asquares new p oin oint if itchec lieskerboard. within theBased same on only lo cal generalization and the smo othness or local constancy prior, w e would chec heck kerb erboard oard square as a training example. There is no guaran guarantee tee that the learner b e guaranteed to correctly guess the color of a new p oin t if it lies same could correctly extend the chec heck kerb erboard oard pattern to poin oints ts lying in within squaresthe that do checcon kerbtain oardtraining square examples. as a training example. There is no tee that the that learner not contain With this prior alone, theguaran only information an could correctly extend the checkerboard pattern to points lying in squares that do not contain training examples. With this 157prior alone, the only information that an


y y

158


example tells us is the color of its square, and the only wa way y to get the colors of the en entire tire chec heck kerb erboard oard right is to co cov ver eac each h of its cells with at least one example. example tells us is the color of its square, and the only way to get the colors of the The smo smoothness othness assumption and the associated non-parametric learning algoentire checkerboard right is to cover each of its cells with at least one example. rithms work extremely well so long as there are enough examples for the learning The smo associated non-parametric algoalgorithm toothness observeassumption high pointsand on the most peaks and lo low w poin oints ts on learning most valleys rithms workunderlying extremely well so long are enough thewhen learning of the true function to as be there learned. This is examples generallyfor true the algorithm to observe high p oints on most p eaks and lo w p oin ts on most v alleys function to be learned is smo smooth oth enough and varies in few enough dimensions. of the true underlying function beoth learned. This generally true when the In high dimensions, even a very to smo smooth function canischange smoothly but in a function to be learned is smo oth enough and v aries in few enough dimensions. differen differentt way along each dimension. If the function additionally behav ehaves es differently In different high dimensions, very smo oth function can change smoothly in ofa in regions, iteven can ab ecome extremely complicated to describ describe e withbut a set different examples. way along each the function (w additionally behaves differently training If thedimension. function isIfcomplicated (we e wan antt to distinguish a huge in different regions, it can b ecome extremely complicated to describ e with a set to of num umb ber of regions compared to the number of examples), is there any hope training examples. If the function is complicated (w e w an t to distinguish a h uge generalize well? number of regions compared to the number of examples), is there any hope to The answer to both of these questions is yes. The key insigh insightt is that a very generalize well? k large num umb ber of regions, e.g., O(2 ), can be defined with O (k) examples, so long The answer to both of these questions The via keyadditional insight is assumptions that a very as we introduce some dep dependencies endencies bet etw ween is theyes. regions O(2 ), can O (wa k) yexamples, large umbunderlying er of regions, e.g., be defined with so long ab about outnthe data generating distribution. In this way , we can actually as we introduce some endencies etween the, regions via additional assumptions generalize non-lo non-locally callydep (Bengio and bMonp Monperrus errus 2005; Bengio et al., 2006c ). Man Many y ab out the underlying data generating distribution. In this wa y , we can actually differen differentt deep learning algorithms pro provide vide implicit or explicit assumptions that are generalize non-lo cally ( Bengio and Monp , 2005 Bengio these et al.,adv 2006c ). Many reasonable for a broad range of AI tasks errus in order to ;capture advantages. antages. different deep learning algorithms provide implicit or explicit assumptions that are Other approac approaches hes to machine make e stronger, task-specific ecific asreasonable for a broad range of AIlearning tasks in often order mak to capture thesetask-sp advantages. sumptions. For example, we could easily solv solvee the chec heck kerb erboard oard task by pro providing viding Other approac hes to machine learning often mak e stronger, task-sp ecific asthe assumption that the target function is perio periodic. dic. Usually we do not include suc such h sumptions. F or example, w e could easily solv e the c hec k erb oard task by pro viding strong, task-sp task-specific ecific assumptions in into to neural netw networks orks so that they can generalize the assumption that the target function is perio dic. Usually we dothat not include to a muc uch h wider variety of structures. AI tasks ha hav ve structure is muc uch hsuc tooh strong, task-sp assumptions into neural netw orks soerties that they complex to be ecific limited to simple, manually sp specified ecified prop properties suc such h can as pgeneralize erio eriodicity dicity dicity,, to a m uc h wider v ariety of structures. AI tasks ha v e structure that is m uc h too so we wan antt learning algorithms that embo embody dy more general-purpose assumptions. complex be in limited simple, ismanually ecified that properties suchwas eriodicity, The core to idea deep to learning that we sp assume the data as pgenerated so w e w an t learning algorithms that embo dy more general-purpose assumptions. by the or features, p oten otentially tially at multiple lev levels els in a The core in deep that weassumptions assume thatcan thefurther data wimpro as generated hierarc hierarch hy.idea Many otherlearning similarlyisgeneric improv ve deep b y the or features, p oten tially at multiple lev els a learning algorithms. These apparen apparently tly mild assumptions allo allow w an exp exponen onen onential tial in gain hierarc hy. Many other generic assumptions canthe further e deep in the relationship bet etw wsimilarly een the num number ber of examples and num numb bimpro er of vregions learning algorithms. These apparen tly mild assumptions allo w an exp onen tial gain that can be distinguished. These exp exponential onential gains are describ described ed more precisely in in the6.4.1 relationship ween num ber exp of examples andan the num ber of regions Sec. , Sec. 15.4b, et and Sec.the 15.5 . The exponential onential adv advan antages tages conferred by the thatofcan be distinguished. These exponential gains describ ed more precisely in use deep, distributed representations coun counter ter theare exp exponen onen onential tial challenges posed Sec. 6.4.1 , Sec. 15.4, and Sec. . 15.5. The exponential advantages conferred by the b y the curse of dimensionality dimensionality. use of deep, distributed representations counter the exponential challenges posed by the curse of dimensionality. 159


5.11.3

Manifold Learning

An imp important ortant conceptLearning underlying man many y ideas in machine learning is that of a 5.11.3 Manifold manifold. An important concept underlying many ideas in machine learning is that of a A manifold is a connected region. Mathematically Mathematically,, it is a set of p oin oints, ts, associated manifold. with a neighborho d around each p oint. F rom an given p oint, the manifold lo neighborhooo any y locally cally A manifold is a connected region. Mathematically , it is a set of p oin ts, associated app appears ears to be a Euclidean space. In everyda everyday y life, we exp experience erience the surface of the with neighborho o d around oint.a F rom any manifold given point, the manifold w orlda as a 2-D plane, but it each is in pfact spherical in 3-D space. locally appears to be a Euclidean space. In everyday life, we experience the surface of the The definition of a neigh neighb borho orhoo od surrounding each poin ointt implies the existence world as a 2-D plane, but it is in fact a spherical manifold in 3-D space. of transformations that can be applied to mo mov ve on the manifold from one position definitionone. of aInneigh orhood surrounding each pointasimplies the existence to aThe neighboring the bexample of the world’s surface a manifold, one can of transformations that can b e applied to mo v e on the manifold from one position walk north, south, east, or west. to a neighboring one. In the example of the world’s surface as a manifold, one can Although there is a formal mathematical meaning to the term “manifold,” walk north, south, east, or west. in mac machine hine learning it tends to be used more lo loosely osely to designate a connected Although there is a formal mathematical meaning to the “manifold,” set of poin oints ts that can be appro approximated ximated well by considering onlyterm a small num umber ber in mac hine learning it tends to b e used more lo osely to designate a connected of degrees of freedom, or dimensions, embedded in a higher-dimensional space. set of dimension points thatcorresp can bonds e appro well by considering small Eac Each h corresponds to ximated a lo local cal direction of variation.only See aFig. 5.11num forber an of degrees freedom, dimensions, embedded in amanifold higher-dimensional example of of training dataorlying near a one-dimensional em embedded bedded inspace. twoEach dimension corresp onds to a of local direction of variation. 5.11 for any dimensional space. In the context mac machine hine learning, we allowSee theFig. dimensionalit dimensionality example of training lying one nearp oin aoint one-dimensional manifold bedded in twoof the manifold to vdata ary from t to another. This often em happ happens ens when a dimensional space. In the context of mac hine learning, w e allow the dimensionalit manifold in intersects tersects itself. For example, a figure eigh eightt is a manifold that has a singley of the manifold to places vary from one p oint to another. This often happ enscenter. when a dimension in most but tw at the in at the two o dimensions intersection tersection manifold intersects itself. For example, a figure eight is a manifold that has a single dimension in most places but two dimensions at the intersection at the center.

160


Man Many y mac machine hine learning problems seem hop hopeless eless if we exp expect ect the machine learning algorithm to learn functions with interesting variations across all of y macle hine learning problems seem hopobstacle eless if wbey exp ect thethat machine Manifold learning arning algorithms surmoun surmount t this assuming most Rn. Man learning algorithm to learn functions with interesting v ariations across all of n of inv valid inputs, and that in interesting teresting inputs o ccur only along R R consists of in . Manifold le arning algorithms surmoun t this obstacle b y assuming that most a collection of manifolds con containing taining a small subset of poin oints, ts, with interesting R consists invalidofinputs, and that interesting inputs ccur only along vofariations in theofoutput the learned function occurring onlyoalong directions a collection of manifold, manifolds or con taining a smallvariations subset of happ poinening ts, with interesting that lie on the with interesting happening only when we vmo ariations in the output of the learned function o ccurring only along directions mov ve from one manifold to another. Manifold learning was introduced in the case that lie on the manifold, with interesting variations happ ening only whenthis we of con continuous-v tinuous-v tinuous-valued alued dataorand the unsup unsupervised ervised learning setting, although move fromyone manifold toidea another. learning was introduced in the probabilit probability concentration can bManifold e generalized to both discrete data andcase the of con tinuous-v alued data and the unsup ervised learning setting, although this sup supervised ervised learning setting: the key assumption remains that probability mass is probabilit y concentration idea can be generalized to both discrete data and the highly concen concentrated. trated. supervised learning setting: the key assumption remains that probability mass is The assumption that the data lies along a low-dimensional manifold may not highly concentrated. alw alwa ays be correct or useful. We argue that in the context of AI tasks, suc such h as The assumption that the data lies along a low-dimensional manifold may those that inv involve olve pro processing cessing images, sounds, or text, the manifold assumptionnot is alw a ys b e correct or useful. W e argue that in the context of AI tasks, suc h as at least appro approximately ximately correct. The evidence in fav favor or of this assumption consists those that inv olve pro cessing images, sounds, or text, the manifold assumption is of two categories of observ observations. ations. at least approximately correct. The evidence in favor of this assumption consists The first observ observation ation in fav favor or of the manifold hyp hypothesis othesis is that the probability of two categories of observations. distribution over images, text strings, and sounds that occur in real life is highly Thetrated. first observ ationnoise in favessentially or of the manifold hypothesisstructured is that theinputs probability concen concentrated. Uniform nev never er resembles from distribution o v er images, text strings, and sounds that o ccur in real life is highly these domains. Fig. 5.12 sho shows ws ho how, w, instead, uniformly sampled points lo look ok like the concentrated. Uniform noise never resembles structured from patterns of static that app appear ear essentially on analog television sets when no signalinputs is available. these domains. 5.12 sho wscumen how, tinstead, uniformly points look likewhat the Similarly Similarly, , if youFig. generate a do documen cument by picking letters sampled uniformly at random, patterns of staticy that onget analog television English-language sets when no signal is avAlmost ailable. is the probabilit probability that app youear will a meaningful text? Similarly , if you generate a do cumen t by picking letters uniformly at random, zero, again, because most of the long sequences of letters do not corresp correspond ondwhat to a is the probabilit that you the willdistribution get a meaningful English-language text?oAlmost natural languageysequence: of natural language sequences ccupies ecause most the space long sequences of letters do not correspond to a azero, veryagain, small bvolume in theoftotal of sequences of letters. natural language sequence: the distribution of natural language sequences occupies a very small volume in the total space of sequences of letters.

161


Of course, concentrated probabilit probability y distributions are not sufficien sufficientt to sho show w that the data lies on a reasonably small number of manifolds. We must also Of course, concentrated probabilit y distributions are to not sufficien to other show establish that the examples we encounter are connected eac each h othert by that the data lies on a reasonably small number of manifolds. We must also establish that the examples we encounter 162 are connected to each other by other


examples, with each example surrounded by other highly similar examples that ma may y be reached by applying transformations to trav traverse erse the manifold. The second examples, with each example surrounded by other highly similar that argumen argumentt in fa fav vor of the manifold hyp ypothesis othesis is that we can alsoexamples imagine suc such h ma y b e reached b y applying transformations to trav erse the manifold. The second neigh neighb borho orhoo ods and transformations, at least informally informally.. In the case of images, we argumen t in fa v or of the manifold h yp othesis is thatthat we allo canwalso imagine such can certainly think of many possible transformations allow us to trace out a neighborho and space: transformations, at leastdim informally . In the of images, we manifold inods image we can gradually or brighten thecase lights, gradually canvecertainly of many thatcolors allowon usthe to trace outofa mo mov or rotatethink ob objects jects in thepossible image, transformations gradually alter the surfaces manifold in image space:lik w e can or brighten the inv lights, gradually ob objects, jects, etc. It remains likely ely thatgradually there aredim multiple manifolds involved olved in most mo v e or rotate ob jects in the image, gradually alter the colors on the surfaces applications. For example, the manifold of images of human faces may not bofe ob jects, etc. likely that there arefaces. multiple manifolds involved in most connected to It theremains manifold of images of cat applications. For example, the manifold of images of human faces may not b e These thought exp experiments eriments supp supporting orting the manifold hypotheses conv convey ey some inconnected to the manifold of images of cat faces. tuitiv tuitivee reasons supp supporting orting it. More rigorous exp experimen erimen eriments ts ((Ca Ca Cayton yton, 2005; Nara Naray yanan These thought exp eriments supp orting the manifold hypotheses conv ey some in-, and Mitter, 2010; Sc Schölk hölk hölkopf opf et al. al.,, 1998; Ro Row weis and Saul, 2000; Tenen enenbaum baum et al. al., tuitiv reasons supp orting and it. More rigorous erimen ts (Grimes Cayton, ,2003 2005; ;W Nara yanan 2000 ; eBrand , 2003 ; Belkin Niy Niyogi ogi , 2003; exp Donoho and einberger and Mitter 2010) ; clearly Schölkopf etort al., the 1998h;yp Ro weis and , 2000class ; Tenen et al. and Saul, ,2004 supp support ypothesis othesis forSaul a large of baum datasets of, 2000 ; Brand , 2003; Belkin and Niyogi, 2003; Donoho and Grimes, 2003; Weinberger in interest terest in AI. and Saul, 2004) clearly support the hypothesis for a large class of datasets of When the data lies on a low-dimensional manifold, it can be most natural interest in AI. for mac machine hine learning algorithms to represent the data in terms of coordinates on When the rather data lies oninaterms low-dimensional manifold, it can be ymost the manifold, than of co coordinates ordinates in R n . In ev everyda eryda eryday life, natural we can for mac hine learning algorithms to represent the data in terms of coordinates think of roads as 1-D manifolds embedded in 3-D space. We give directions on to R the manifold, rather than in terms of co ordinates in . In ev eryda y life, we can sp specific ecific addresses in terms of address num umb bers along these 1-D roads, not in terms think of roads in as3-D 1-Dspace. manifolds embedded 3-D space. We giveisdirections to of co coordinates ordinates Extracting theseinmanifold co coordinates ordinates challenging, specific addresses in terms of address nyum bers along thesealgorithms. 1-D roads, not terms but holds the promise to impro improv ve man many machine learning Thisingeneral of coordinates in 3-D Extracting these coordinates challenging, principle is applied in space. man many y con contexts. texts. Fig. 5.13manifold shows the manifold is structure of a but holds the promise to impro v e man y machine learning algorithms. This general dataset consisting of faces. By the end of this book, we will ha hav ve dev develop elop eloped ed the principle applied in y con texts. Fig. 5.13 shows theInmanifold metho methods ds is necessary to man learn suc such h a manifold structure. Fig. 20.6structure , we will ofseea dataset consisting of faces. By thecan endsuccessfully of this book, we will ha ve dev eloped the ho how w a machine learning algorithm accomplish this goal. methods necessary to learn such a manifold structure. In Fig. 20.6, we will see concludes Part Ialgorithm , which has provided the basic concepts mathematics howThis a machine learning can successfully accomplish thisingoal. and mac machine hine learning which are emplo employ yed throughout the remaining parts of the This concludes art I, which has provided in mathematics book. You are no now wPprepared to embark up upon on ythe ourbasic studyconcepts of deep learning. and machine learning which are employed throughout the remaining parts of the book. You are now prepared to embark upon your study of deep learning.

163


164

Part II Part II

Deep Net Netw works: Mo Modern dern Practices Deep Networks: Modern Practices

165 165

This part of the book summarizes the state of mo modern dern deep learning as it is used to solv solvee practical applications. This part of the book summarizes the state of modern deep learning as it is Deep learning has a long history and man many y aspirations. Sev Several eral approac approaches hes used to solve practical applications. ha hav ve been proposed that hav havee yet to en entirely tirely bear fruit. Sev Several eral am ambitious bitious goals Deep learning has a long history and man y aspirations. Sev eral ha hav ve yet to be realized. These less-developed branches of deep learningapproac appearhes in havefinal beenpart proposed that the of the bo book. ok.have yet to entirely bear fruit. Several ambitious goals have yet to be realized. These less-developed branches of deep learning appear in This part focuses only on those approac approaches hes that are essen essentially tially working tec techhthe final part of the book. nologies that are already used hea heavily vily in industry industry.. This part focuses only on those approaches that are essentially working techMo Modern dern learningused pro provides vides pow powerful erful framew framework ork for supervised nologies thatdeep are already hea vilya invery industry . learning. By adding more lay layers ers and more units within a la lay yer, a deep netw network ork can Mo dern deep learning pro vides a v ery pow erful framew ork for supervised represen representt functions of increasing complexity complexity.. Most tasks that consist of mapping an learning. By adding more lay ers within yer, a to deep ork, can can input vector to an output vector,and andmore that units are easy for aa la person do netw rapidly rapidly, represen t functions increasing complexity . Most tly tasks thatmo consist of mapping an b e accomplished viaofdeep learning, giv large sufficiently given en sufficien sufficiently models dels and input datasets vector toofanlabeled outputtraining vector, examples. and that are easytasks, for a that person dobrapidly , can large Other cantonot e described be asso accomplished viavector deep learning, given sufficien large mo dels and as associating ciating one to another, or that aretly difficult enough thatsufficiently a p erson large datasets of labeled training examples. Other tasks, that can not b e described would require time to think and reflect in order to accomplish the task, remain asey asso one to another, orw. that are difficult enough that a p erson b eyond ondciating the scop scope e ofvector deep learning for no now. would require time to think and reflect in order to accomplish the task, remain This part of the book describ describes es the core parametric function approximation beyond the scope of deep learning for now. tec technology hnology that is behind nearly all modern practical applications of deep learning. the book the describ es theard core parametric approximation We This beginpart b by y of describing feedforw feedforward deep net network workfunction mo model del that is used to tec hnology that is b ehind nearly all modern practical applications of deep learning. represen representt these functions. Next, we present adv advanced anced tec techniques hniques for regularization We optimization begin by describing the feedforw ardthese deep net work model thatsuch is used to and of such mo models. dels. Scaling mo models dels to large inputs as high represent these functions. Next, wesequences present adv anced tec hniques for regularization resolution images or long temporal requires sp specialization. ecialization. We in intro tro troduce duce and optimization of such mo dels. Scaling these mo dels to large inputs such as high the con conv volutional net network work for scaling to large images and the recurren recurrentt neural resolution or long temporal sequencesFinally requires We guidelines introduce net pro temp e ecialization. present general netw work forimages processing cessing temporal oral sequences. Finally, , wsp the the conpractical volutionalmetho network forin scaling large images and the recurren t neural for methodology dology inv volved to in designing, building, and configuring an net w ork for pro cessing temp oral sequences. Finally , w e present general guidelines application in involving volving deep learning, and review some of the applications of deep for the practical methodology involved in designing, building, and configuring an learning. application involving deep learning, and review some of the applications of deep These chapters are the most imp important ortant for a practitioner—someone who wants learning. to begin implemen implementing ting and using deep learning algorithms to solv solvee real-w real-world orld These c hapters are the most imp ortant for a practitioner—someone who w ants problems to toda da day y. to begin implementing and using deep learning algorithms to solve real-world problems today.

166

Chapter 6 Chapter 6

Deep Feedforw eedforward ard Net Netw works Deep F eedforw ard Net w orks De Deep ep fe feeedforwar dforward d networks, also often called fe feeedforwar dforward d neur neural al networks, or multilayer per ercceptr eptrons ons (MLPs), are the quintessen quintessential tial deep learning mo models. dels. The goal Deep feedforward netw networks often called fesome edforwar d neurfal∗ .networks , or multiof a feedforward network ork ,isalso to approximate function F For or example, for layer p er c eptr ons ( MLPs ), are the quintessen tial deep learning mo dels. The goal ∗ a classifier, y = f (x) maps an input x to a category y. A feedforward netw network ork f of a feedforward netw ork is to approximate some function . F or example, for defines a mapping y = f (x; θ ) and learns the value of the parameters θ that result y = f (xapproximation. a classifier, ) maps an input x to a category y. A feedforward network in the b est function defines a mapping y = f (x; θ ) and learns the value of the parameters θ that result These models dels areapproximation. called fe feeedforwar dforward d b ecause information flo flows ws through the in the b estmo function function b eing ev evaluated aluated from x, through the intermediate computations used to These mo dels aretocalled feedforwar d b ecause flows through the y. There define f , and finally the output are noinformation fe feeedb dback ack connections in whic which h x function of b eing ev aluated from , through the intermediate computations used to outputs the mo are fed back in itself. When feedforward neural netw model del into to networks orks f , and finally define to the output . There are no feeare dback connections in neur whical h are extended to include feedbac feedback ky connections, they called recurr current ent neural outputs the mo delinare fed back networks networks,of , presented Chapter 10.into itself. When feedforward neural networks are extended to include feedback connections, they are called recurrent neural Feedforw eedforward ard netw networks orks are of extreme imp importance ortance to machine learning practinetworks, presented in Chapter 10. tioners. They form the basis of many important commercial applications. For Feedforw netw orks are of extreme impob ortance to machine learning example, theard conv convolutional olutional net networks works used for object ject recognition from photospractiare a tioners. They form the basis of many important commercial applications. For sp specialized ecialized kind of feedforw feedforward ard net network. work. Feedforw eedforward ard netw networks orks are a conceptual example, the conv olutional net works used for ob ject recognition from photos are stepping stone on the path to recurren recurrentt netw networks, orks, which p ower man many y naturala sp ecialized kind of feedforward network. Feedforward networks are a conceptual language applications. stepping stone on the path to recurrent networks, which p ower many natural Feedforw eedforward ard neural net networks works are called networks b ecause they are typically replanguage applications. resen resented ted by comp composing osing together many different functions. The mo model del is asso associated ciated F eedforw ard neural net works are called networks b ecause they are t ypically repwith a directed acyclic graph describing how the functions are comp composed osed together. resen ted by comp many differentffunctions. Thefmo (1) f (2) (3) del is asso ciated F or example, we osing migh mightttogether hav havee three functions , , and connected in a directed (2) (f (1) (x how the functions are comp osed together. (x ) = graph f (3) (f describing cwith hain,a to form facyclic ))) ))).. These chain structures are the most f case, f is connected For example, westructures might hav three functions , andf (1) in a commonly used ofe neural netw networks. orks.fIn ,this called the first f ( x ) = f ( f ( f ( x clayer hain,oftothe form ))) . These chain structures are the most netw network, ork, f (2) is called the se seccond layer layer,, and so on. The overall length commonly used structures of neural networks. In this case, f is called the first 167 layer of the network, f is called the se cond layer, and so on. The overall length 167

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

of the chain giv gives es the depth of the mo model. del. It is from this terminology that the name “deep learning” arises. The final lay layer er of a feedforward net network work is called the of the c hain giv es the depth of the mo del. It is from this terminology the output layer layer.. During neural netw network ork training, we drive f(x) to match f ∗ (that x). The name “deep The final layer of a feedforward isaluated called the f ∗work (x) ev training datalearning” providesarises. us with noisy noisy, , approximate examples ofnet evaluated at output layer . During neural netw ork training, we drive ) to match ) . The f ( x f ( x ∗ differen differentt training p oints. Eac Each h example x is accompanied by a lab label el y ≈ f (x). f (do x) at training dataexamples provides sp usecify withdirectly noisy, approximate examples evaluated att The training specify what the output la layer yer of must each p oin oint x yis. accompanied f (xis differen t training h example a lab el y la ). x ; it must pro produce ducep oints. a valueEac that is close to The b ehavior by of the other layers yers The directly training sp examples sp ecify directly what output layeralgorithm must do at each p oint ≈ decide not specified ecified by the training data. the The learning must x y ; it m ust pro duce a v alue that is close to . The b ehavior of the other la yers is ho how w to use those lay layers ers to pro produce duce the desired output, but the training data do does es not sa directly ecified by the lay training data. learning algorithm must decide not say y whatsp eac each h individual layer er should do. The Instead, the learning algorithm must ho w to use those lay ers to pro duce the desired output, but the training data do es ∗ decide how to use these lay layers ers to b est implement an approximation of f . Because not training say whatdata each do individual lay should do. output Instead, learning algorithm must the does es not sho show werthe desired forthe each of these lay layers, ers, these f decide how to use these lay ers to b est implement an approximation of . Because la layers yers are called hidden layers. the training data do es not show the desired output for each of these layers, these Finally Finally,, these netw networks orks are called neur neural al b ecause they are lo loosely osely inspired by layers are called hidden layers. neuroscience. Eac Each h hidden lay layer er of the net network work is typically vector-v ector-valued. alued. The Finally , these netw orks are called neur al b ecause they are lo osely inspired by dimensionalit dimensionality y of these hidden la layers yers determines the width of the mo model. del. Each neuroscience. Eac h hidden lay er of the net work is t ypically v ector-v alued. The elemen elementt of the vector ma may y b e in interpreted terpreted as playing a role analogous to a neuron. dimensionalit y of these hidden yers determines the width of the mo del. Each Rather than thinking of the lay layer erlaas represen representing ting a single vector-to-vector function, elemen of the vector b eerinterpreted as playing ay role analogous neuron. w e cant also think of ma theylay layer as consisting of man many units that acttoinaparallel, Rather than thinking of the layer as represen tingEac a single function, eac each h representing a vector-to-scalar function. Each h unitvector-to-vector resem resembles bles a neuron in w e can also think of the lay er as consisting of man y units that act in parallel, the sense that it receives input from many other units and computes its own eac h ation representing a vector-to-scalar function. unit resem bles a neuron in activ activation value. The idea of using man many y lay layers ersEac of hvector-v ector-valued alued representation thedrawn sensefrom thatneuroscience. it receives input unitsf (and i) (x computes its own is The from choicemany of theother functions ) used to compute activ ation v alue. The idea of using man y lay ers of v ector-v alued representation these representations is also lo loosely osely guided by neuroscien neuroscientific tific observ observations ations ab about out f ( x is drawn from neuroscience. The choice of the functions ) used to compute the functions that biological neurons compute. How Howev ev ever, er, mo modern dern neural netw network ork these representations alsoy lo osely guided b y neuroscien tificdisciplines, observations abthe out researc research h is guided by isman many mathematical and engineering and the functions biological compute. er, moIt dern network goal of neural that netw networks orks is notneurons to p erfectly mo model delHow theevbrain. is bneural est to think of researc h is guided b y man y mathematical and engineering disciplines, and the feedforw feedforward ard net netw works as function approximation machines that are designed to goal of neural netw orks is not to opccasionally erfectly modrawing del the brain. It is b est to what think we of ac achieve hieve statistical generalization, some insights from feedforw ard the netwbrain, orks as function machines that are designed to kno rather than approximation as mo function. know w ab about out models dels of brain achieve statistical generalization, o ccasionally drawing some insights from what we One wa way y to understand feedforward netw networks orks is to b egin with linear mo models dels know ab out the brain, rather than as mo dels of brain function. and consider ho how w to ov overcome ercome their limitations. Linear mo models, dels, suc such h as logistic One wa y to understand feedforward netw orks is to b egin with linear mo dels regression and linear regression, are app appealing ealing b ecause they may b e fit efficien efficiently tly and consider ho w to ov ercome their limitations. Linear mo dels, suc h as logistic and reliably either in closed form or with conv optimization. Linear mo also reliably,, convex ex models dels regression and linear regression, appcapacit ealing byecause theytomay b e functions, fit efficiently ha have ve the obvious defect that the are mo model del capacity is limited linear so and reliably , either in closed form or with conv ex optimization. Linear mo dels also the mo model del cannot understand the in interaction teraction b et etw ween an any y two input variables. have the obvious defect that the mo del capacity is limited to linear functions, so To extend linear mo models dels to represen representt nonlinear functions of x, we can apply the mo del cannot understand the interaction b etween any two input variables. the linear mo model del not to x itself but to a transformed input φ(x), where φ is a To extend linear mo dels to represent nonlinear functions of x, we can apply the linear mo del not to x itself but to a transformed input φ(x), where φ is a 168


nonlinear transformation. Equiv Equivalently alently alently,, we can apply the kernel tric trick k describ described ed in Sec. 5.7.2, to obtain a nonlinear learning algorithm based on implicitly applying nonlinear transformation. Equiv , we can apply kernel tric k describxed in the We can think ofalently a set the of features describing , or φ mapping. φ as providing Sec. 5.7.2, toaobtain a nonlinear learning as providing new representation for x. algorithm based on implicitly applying the φ mapping. We can think of φ as providing a set of features describing x, or The question is then how to choose the mapping φ. as providing a new representation for x. question is to then to choose mapping φ, such 1.The One option is usehow a very genericthe as theφ.infinite-dimensional φ that is implicitly used by kernel machines based on the RBF kernel. If φ(x ) is 1. of One option is to dimension, use a very generic , such the infinite-dimensional high enough we canφalw always ays as hav have e enough capacity to φ fitthat the φ ( x is implicitly used by kernel machines based on the RBF kernel. If ) is training set, but generalization to the test set often remains p oor. Very of high enough dimension,are we usually can alwbased ays hav e enough the generic feature mappings only on thecapacity principletooffitlo local cal training set, but generalization to the test set often remains p oor. V ery smo smoothness othness and do not enco encode de enough prior information to solve adv advanced anced generic feature mappings are usually based only on the principle of lo cal problems. smo othness and do not enco de enough prior information to solve advanced 2. Another adventt of deep learning, problems.option is to manually engineer φ. Until the adven this was the dominant approach. This approach requires decades of human φ. Until the sp 2. effort Another to manually adven t of deep foroption eac separate task, engineer with practitioners in learning, different each h is specializing ecializing this was the dominant approach. This approach requires decades of domains such as speech recognition or computer vision, and withhuman little effort for eac h separate task, with practitioners sp ecializing in different transfer b etw etween een domains. domains such as speech recognition or computer vision, and with little 3. The strategy learning is to learn φ. In this approach, we ha have ve a mo model del transfer b etwof eendeep domains. > now w hav havee parameters θ that we use to learn y = f (x ; θ , w ) = φ(x; θ) w. We no φ. In this approach, 3. φ The strategy of deep learning is to learn we from have aφ(mo w that map x )del from a broad class of functions, and parameters to . Wexample e now hav use to learn y = desired f (x ; θ , w ) = φ(xThis ; θ) iswan θ that the output. ofe aparameters deep feedforw feedforward ardwe net network, work, with φ defining w that φ( xthat from a broad class lay of er. functions, and parameters from ) to φ a hidden layer. This approach is the only onemap of the three the This a deep feedforw network, with giv gives esdesired up on output. the conv convexit exit exity y isofan theexample trainingofproblem, but the ard b enefits outw outweigh eigh φ defining a hidden lay er. This approach is the only one of the three that the harms. In this approac approach, h, we parametrize the represen representation tation as φ(x; θ) givesuse up the on the convexity algorithm of the training problem, butcorresp the b enefits eigh and optimization to find the θ that corresponds onds tooutw a go goo od φ ( x ; θ) the harms. In this approac h, w e parametrize the represen tation as represen representation. tation. If we wish, this approach can capture the b enefit of the first and use hthe to find onds to afamily go o d θ that approac approach byoptimization b eing highlyalgorithm generic—we do the so by usingcorresp a very broad represen tation. If we wish, this approach can capture the b enefit of the first φ(x; θ ). This approac approach h can also capture the b enefit of the second approac approach. h. approac by b eing highly generic—we do so by using a very broad family Human hpractitioners can enco their knowledge to help generalization by encode de φ(x; θ ). This approac also capture b enefit of the approac φ(x ;hθcan designing families ) that they exp expect ectthe will p erform well.second The adv advan an antage tageh. Human practitioners can enco de their to righ helpt generalization by is that the human designer only needsknowledge to find the right general function φ ( x ; θ designing families ) that they exp ect will p erform well. The adv an tage family rather than finding precisely the right function. is that the human designer only needs to find the right general function family rather than finding precisely the right function. This general principle of improving mo models dels by learning features extends b ey eyond ond the feedforward net networks works described in this chapter. It is a recurring theme of deep This general principle by learning features extends b ey learning that applies to allofofimproving the kindsmo of dels mo models dels describ described ed throughout this b oond ok. the feedforward net works described in this chapter. It is a recurring theme of deep Feedforward netw networks orks are the application of this principle to learning deterministic learning that applies to all of the kinds of mo dels describ ed throughout this b o ok. 169of this principle to learning deterministic Feedforward networks are the application


mappings from x to y that lack feedback connections. Other mo models dels presen presented ted later will apply these principles to learning sto stocchastic mappings, learning functions x tolearning y that probability mappings fromand lack feedback connections. movector. dels presented with feedback, distributions ov over erOther a single later will apply these principles to learning sto chastic mappings, learning functions e b egin this with a simple distributions example of a ov feedforward network. work. Next, withWfeedback, andchapter learning probability er a single net vector. we address each of the design decisions needed to deplo deploy y a feedforward netw network. ork. Wetraining b egin this chapterard withnetw a simple example of a feedforward netsame work.design Next, First, a feedforw feedforward network ork requires making many of the we addressaseach of the design needed deplo a feedforward ork. decisions are necessary for decisions a linear mo model: del: cto ho hoosing osingythe optimizer, netw the cost First, training a form feedforw ardoutput network requires making many of of the same design function, and the of the units. We review these basics gradient-based decisions as are necessary for a linear mo del: c ho osing the optimizer, cost learning, then pro proceed ceed to confront some of the design decisions that arethe unique function, and the form of the output units. Works e review basics gradient-based to feedforward net networks. works. Feedforward net netw w ha have vethese in intro tro troduced duced of the concept of a learning, then pro ceed to confront some of the design decisions that are unique hidden lay layer, er, and this requires us to cho hoose ose the activation functions that will b e to feedforward net works. F eedforward net w orks ha ve in tro duced the concept ofofa used to compute the hidden lay layer er values. We must also design the architecture hidden lay er, and this requires us to c ho ose the activation functions that will be the netw network, ork, including how man many y lay layers ers the netw network ork should contain, ho how w these used to compute hidden layer alues. We must also the architecture of net networks works should the b e connected to veac each h other, and ho how w design many units should b e in the ork, including how man y lay ersorks the requires network computing should contain, how these eac each hnetw la layer. yer. Learning in deep neural netw networks the gradients of net works should b e connected to eac h other, and ho w many units should b e in complicated functions. We present the back-pr ack-prop op opagation agation algorithm and its mo modern dern eac h la yer. Learning in deep neural netw orks requires computing the gradients of, generalizations, which can b e used to efficien efficiently tly compute these gradients. Finally Finally, complicated We present the back-pr opagation algorithm and its mo dern w e close withfunctions. some historical p ersp erspectiv ectiv ective. e. generalizations, which can b e used to efficiently compute these gradients. Finally, we close with some historical p ersp ective.

6.1

Example: Learning XOR

T o mak make e the idea of aLearning feedforw feedforward ardXOR netw network ork more concrete, we b egin with an 6.1 Example: example of a fully functioning feedforward net netw work on a very simple task: learning T o mak e the idea of a feedforw ard netw ork more concrete, we b egin with an the XOR function. example of a fully functioning feedforward network on a very simple task: learning function (“exclusiv (“exclusivee or”) is an op operation eration on two binary values, x 1 the The XORXOR function. and x2. When exactly one of these binary values is equal to 1, the XOR function The 1XOR functionit (“exclusiv or”) XisOR an function op eration on twothe binary x returns . Otherwise, returns 0.eThe provides targetvalues, function theseOur binary values is equal to 1, they XOR yand = fx∗(. xWhen = f ( xfunction ; θ) and ) that exactly we wantone to of learn. mo model del pro provides vides a function returns 1 . Otherwise, it returns 0. The X OR function provides the target our learning algorithm will adapt the parameters θ to make f as similar as function p ossible y = ∗f ( x) that we want to learn. Our mo del provides a function y = f ( x; θ ) and to f . our learning algorithm will adapt the parameters θ to make f as similar as p ossible to fIn. this simple example, we will not b e concerned with statistical generalization. [0,, 0]> , [0 , 1]> , X = { [0 We wan wantt our netw network ork to p erform correctly on the four p oin oints ts tsX In this simple example, we will not b e concerned with statistical [1 [1,, 0]>, and [1 [1,, 1]>} . W Wee will train the netw network ork on all four ofX thesegeneralization. p oin oints. ts. The = [0 , 0] W e wan t our netw ork to p erform correctly on the four p oin ts , [0 , 1] , only challenge is to fit the training set. [1, 0] , and [1, 1] . We will train the network on all four of these { p oints. The We can treat this problem as a regression problem and use a mean squared error only challenge is to} fit the training set. loss function. We choose this loss function to simplify the math for this example Wuch e can this problem a regression problem andother, use a mean error as m astreat p ossible. We willassee later that there are moresquared appropriate loss function. We choose this loss function to simplify the math for this example as much as p ossible. We will see later that there are other, more appropriate 170


approac approaches hes for mo modeling deling binary data. Ev Evaluated aluated on our whole training set, the MSE loss function is approaches for mo deling binary data. 1 Xset,∗ the MSE loss 2function is Evaluated on our whole training J (θ ) = (f (x) − f (x; θ )) . 4 1 x∈X J (θ ) = (f (x) f (x; θ )) . 4 No Now w we must cho hoose ose the form of our mo model, del, Suppose ose that we − f (x; θ ). Supp a linear mo model, del, with θ consisting of w and b. Our mo model del is defined to b e Now we must cho ose the form of our mo del, f (x; θ ). Supp ose that we X a linear mo del, with θ consisting f (x;of w ,wb)and = xb>.wOur + b.mo del is defined to b e

(6.1) (6.1) cho hoose ose cho ose (6.2)

f (xform ; w , b)with = x resp w +ect b. to w and b using the normal (6.2) We can minimize J(θ ) in closed respect equations. We can minimize J(θ ) in closed form with resp ect to w and b using the normal After solving the normal equations, we obtain w = 0 and b = 12 . The linear equations. mo model del simply outputs 0.5 everywhere. Wh Why y do does es this happ happen? en? Fig. 6.1 shows how w = 0 = y .toThe After solving the normal equations, we obtain linear a linear mo model del is not able to represent the XOR function. and Onebwa way solve this mo del simply 0.del 5 everywhere. y do es tthis happ en? Fig. 6.1 shows how problem is to outputs use a mo model that learnsWh a differen different feature space in which a linear a linear deltoisrepresent not able to mo model del ismo able therepresent solution. the XOR function. One way to solve this problem is to use a mo del that learns a different feature space in which a linear Sp Specifically ecifically ecifically, , we will in intro tro troduce a very simple feedforward netw network ork with one mo del is able to represent theduce solution. hidden la layer yer containing two hidden units. See Fig. 6.2 for an illustration of ecifically , wefeedforward will intro duce very feedforward orkh with thisSp mo model. del. This net netw waork hassimple a vector of hiddennetw units that one are hidden la yer containing t w o hidden units. See Fig. 6.2 for an illustration of (1) computed by a function f (x; W , c). The values of these hidden units are then h this mo del. This feedforward net w ork has a v ector of hidden units that are used as the input for a second lay layer. er. The second lay layer er is the output lay layer er of the f ( x ; W , c computed by a function ) . The v alues of these hidden units are net network. work. The output lay layer er is still just a linear regression mo model, del, but no now w then it is used as to thehinput a second layer.netw Theork second layer is the layer of the applied ratherforthan to x. The network now contains twooutput functions chained network. h The lay is still mo del, but nowb eing it is (1) (x; W = foutput , cer y =just f (2)a(hlinear ; w , b ),regression together: ) and with the complete mo model del h x applied to rather than to . The netw ork now contains t wo functions chained f (x; W , c, w , b) = f (2)(f (1)(x)). together: h = f (x; W , c)(1) and y = f (h; w , b ), with the complete mo del b eing f What function should Linear mo models dels ha have ve serv served ed us well so far, f (x; W , c, w , b) = f (f (x))compute? . (1) and it may b e tempting to make f b e linear as well. Unfortunately Unfortunately,, if f(1) were What function should f net compute? delsremain have serv ed us function well so far, linear, then the feedforward network work as aLinear whole mo would a linear of and it may b e tempting to make b e linear as well. Unfortunately , if w ere f f > (1) its input. Ignoring the intercept terms for the momen moment, t, supp suppose ose f (x ) = W x linear, then the feedforward net work as a would remain a tlinear function as of >w > >whole (2) f ( h ) = h f ( x ) = w W x and . Then . We could represen represent this function fits(xinput. ) = x>Ignoring w 0 wherethe w 0 intercept = W w. terms for the moment, supp ose f (x ) = W x and f (h) = h w. Then f( x) = w W x. We could represent this function as Clearly, must use a nonlinear function to describ describee the features. Most neural f (xClearly ) = x ,wwewhere w = W w. net networks works do so using an affine transformation con controlled trolled by learned parameters, Clearly , w e must use a nonlinear function to describ e thefunction. features. W Most neural follo followed wed by a fixed, nonlinear function called an activ activation ation e use that networkshere, do soby using an affine controlled bvides y learned parameters, h = gtransformation ( W >x + c) , where W pro strategy defining provides the weigh weights ts of a followed by a fixed, nonlinear function an activ function. Wregression e use that linear transformation and c the biases. called Previously Previously, , to ation describ describe e a linear h = g ( W x + c ) , W strategy here, by defining where pro vides the weigh ts ofan a mo model, del, we used a vector of weigh weights ts and a scalar bias parameter to describe linear transformation and c the biases. Previously, to describ e a linear regression 171a scalar bias parameter to describe an mo del, we used a vector of weights and


Original x Space

Learned h Space

Original x Space

Learned h Space

1

x2

h2

1

0

0 0

1

0

x1

1 h1

2

Figure 6.1: Solving the XOR problem by learning a represen representation. tation. The b old num numb b ers prin printed ted on the plot indicate the value that the learned function must output at each p oin oint. t. Figure 6.1: Solving the XOR problem b y learning a represen tation. The b old num b ers (L (Left) eft) A linear mo model del applied directly to the original input cannot implement the XOR printed onWhen the plot indicate valueoutput that the learned function output When at each oin11, t., function. 00,, the the mo model’s del’s must increase as x must x1 = x 1p= 2 increases. (Left) Adel’s linear mo del applied directly the originalAinput the mo model’s output must decrease as xto2 increases. linearcannot modelimplement must applythe a XOR fixed x = 1, function. , thelinear mo del’s output must increase as xthe increases. co coefficien efficien efficienttWhen . 0The mo model del therefore cannot use value of When w 2 toxx2= x 1 to change x moefficient del’s output decrease as this increases. linear ust apply aspace fixed the co coefficient on x2must and cannot solve problem.A(R (Right) ight) model In the m transformed x to co efficiented t wbyto . The linear mo del cannot the value of can change represen represented thex features extracted bytherefore a neural netw network, ork,use a linear model no now w solve x example (R ight) the problem. co efficientInonour and cannot solve this problem. In the transformed space solution, the tw two o p oints that must hav havee output 1 ha have ve b een represented byathe features y a neural netww ork, a linear model can now solve collapsed into single p oin ointt extracted in feature bspace. In other ords, the nonlinear features hav havee > two p oints that must have output 1hha > the problem. our[1 the een x= [1,example , 0] > andsolution, x = [0 [0,, 1] [1 ,b0] mapp mapped ed b othIn to a single p oin ointt in feature sp space, ace, =ve . collapsed into a single p oin t in feature space. In other w ords, the nonlinear features hav e The linear mo model del can now describ describee the function as increasing in h 1 and decreasing in h2. [1,motiv 0] and = [0 , 1] to the = [1mo , 0]del. mapp b oth x = a single p oin t in is feature spmake ace, hthe In thisedexample, the motivation ationx for learning feature space only to model h The linear mo delsocan now describ e thetraining function as In increasing in and decreasinglearned in h . capacit capacity y greater that it can fit the set. more realistic applications, In this example, thealso motiv ation learning the feature space is only to make the mo del represen representations tations can help thefor mo model del to generalize. capacity greater so that it can fit the training set. In more realistic applications, learned representations can also help the mo del to generalize.

172


y

h W x

Figure 6.2: An example of a feedforward netw network, ork, drawn in tw two o different styles. Sp Specifically ecifically ecifically,, this is the feedforw feedforward ard netw network ork we use to solve the XOR example. It has a single hidden Figure 6.2: An example of a feedforward netwstyle, ork, drawn in twevery o different Spde ecifically la layer yer containing two units. (L (Left) eft) In this we draw unit styles. as a no node in the, this is the feedforw ard netw ork we use to solve the X OR example. It has a single hidden graph. This style is very explicit and unambiguous but for net networks works larger than this eft) layer containing two units. In this (R style, e draw everywe unit as example it can consume to too o (L muc much h space. (Right) ight)wIn this style, dra draw w aa no node node de in the graph.for This style is vvery explicit and unambiguous but for net works larger than this graph eac each h en entire tire ector representing a la layer’s yer’s activ activations. ations. This st style yle is muc much h more (R ight) example it can consume to o muc h space. In this style, we dra w a no de in the compact. Sometimes we annotate the edges in this graph with the name of the parameters graphdescrib for eacehthe entire vector representing ao la yer’s ations. This stthat yle isa muc h more W that describe relationship b etw etween een tw two la layers. yers. activ Here, we indicate matrix compact. Sometimes wefrom annotate edgesa in this graph withesthe of thefrom parameters describ describes es the mapping vector describes thename mapping x to the h , and w describ h to y. W that describ eomit the relationship etween twoasso layers. Here, indicate that alab matrix W e typically the interceptbparameters associated ciated withwe each lay layer er when labeling eling this describ the mapping from x to h , and a vector w describ es the mapping from h to y. kind of es drawing. We typically omit the intercept parameters asso ciated with each layer when lab eling this kind of drawing.

affine transformation from an input vector to an output scalar. Now, we describ describee an affine transformation from a vector x to a vector h , so an en vector of bias entire tire affine transformation from an input vector to an output scalar. Now, w e describ parameters is needed. The activ activation ation function g is typical typically ly chosen to be a functione h , somodern an affine transformation fromwith a vector a vector andern entire vector of orks, bias h i = gx( xto>W that is applied element-wise, neural netw networks, :,i + ci ). In mo g parameters is needed. The activ ation function is typical ly chosen to be a function the default recommendation is to use the rectifie ctified d line linear ar unit or ReLU (Jarrett h = g ( x W + c that is applied element-wise, with ) . In mo dernby neural netwation orks, et al. al.,, 2009; Nair and Hin Hinton ton, 2010; Glorot et al. al.,, 2011a) defined the activ activation the default is to use the r6.3 ectifie function g (zrecommendation ) = max{0, z } depicted in Fig. . d linear unit or ReLU (Jarrett et al., 2009; Nair and Hinton, 2010; Glorot et al., 2011a) defined by the activation We can now w max sp specify ecify completeinnetw network function g (zno )= 0, zourdepicted Fig.ork 6.3as . > max {0as , W > x + c} + b. (6.3) (x{; Wour w , b) = wnetw }, c,complete We can now spfecify ork W x + cLet+ b. fecify (x; Wa ,solution c, w , b) = We can no now w sp specify towthemax XOR0,problem.   } 1 1{ We can now sp ecify a solution Let Wto=the XOR problem. , 1 1 1 1 W = , 10 1 c= , −1 0   c= 1 , w =  1 , −2 − 1 w = 173 , 2   − 



(6.3) (6.4) (6.4) (6.5) (6.5) (6.6) (6.6)


Figure 6.3: The rectified linear activ activation ation function. This activ activation ation function is the default activ activation ation function recommended for use with most feedforw feedforward ard neural netw networks. orks. Applying Figure 6.3: The linear ation function. Thisyields activation function is the default this function to rectified the output of aactiv linear transformation a nonlinear transformation. activ ation function recommended for use with most feedforw ard neural netw orks. Applying Ho Howev wev wever, er, the function remains very close to linear, in the sense that is a piecewise linear this function thelinear outputpieces. of a linear transformation yieldsunits a nonlinear transformation. function withtotwo Because rectified linear are nearly linear, they However, the yfunction remains linear,mo indels the easy sensetothat is a piecewise linear preserv preserve e man many of the prop properties ertiesvery thatclose maketolinear models optimize with gradientfunction with t wo linear pieces. Because rectified linear units are nearly linear, they based metho methods. ds. They also preserve man many y of the prop properties erties that make linear mo models dels preserve man y ofAthe prop erties that make linear mo dels easy to optimize gradientgeneralize well. common principle throughout computer science is thatwith we can build based methosystems ds. They alsominimal preservecomp manonen y of ts. theMuch prop erties linearmemory mo dels complicated from componen onents. as a Tthat uringmake machine’s generalize well. A common principle throughout computer science is that we can build needs only to b e able to store 0 or 1 states, we can build a universal function approximator complicated from minimal comp onents. Much as a Turing machine’s memory from rectifiedsystems linear functions. needs only to b e able to store 0 or 1 states, we can build a universal function approximator from rectified linear functions.

174


and b = 0. We can now walk through the wa way y that the mo model del pro processes cesses a batc batch h of inputs. and b = 0. Let X b e the design matrix containing all four p oints in the binary input space, e can now walk the way that the mo del pro cesses a batch of inputs. withWone example p erthrough row:  all four  p oints in the binary input space, Let X b e the design matrix containing 0 0 with one example p er row: 0 1   X = (6.7)  10 00  . 0 1 X= 1 1 . (6.7) 1 0 The first step in the neural netw network ork is to the input matrix by the first 1 multiply 1 la layer’s yer’s weight matrix:   The first step in the neural network isto0multiply the input matrix by the first 0   layer’s weight matrix:  1 1  (6.8) X W =  10 10 . 1 1 (6.8) XW = 2 2 . 1 1 Next, we add the bias vector c, to obtain 2 2     Next, we add the bias vector c, to obtain 0 −1  1 0    0  1 .  (6.9)  1 0   21 −10 . (6.9) 1 0 2 1 a line with slop In this space, all of the examples lie along slopee 1. As we mov movee along   this line, the output needs to b egin at 0, then rise to 1, then drop bac back k down to 0. In linear this space, of theimplement examples lie along a line with slop e 1computing . As we mov e along A mo model delallcannot suc such h a function. T o finish the value     this line, the output needs to b egin at 0 , then rise to 1 , then drop bac k down to 0. of h for eac each h example, we apply the  rectifiedlinear transformation: A linear mo del cannot implement such a function. To finish computing the value   of h for each example, we apply the rectified 0 0 linear transformation:  1 0   0 0 . (6.10)  1 0  21 10 . (6.10) 1 0 2 1 This transformation has changed the relationship b etw etween een the examples. They no   longer lie on a single line. As shown in Fig. 6.1, they now lie in a space where a This transformation hasthe changed the relationship b etween the examples. They no linear mo model del can solve problem.    in Fig. 6.1, they now lie in a space where a longer lie on a single line. As shown  We finish by multiplying by theweigh weightt vector w: linear mo del can solve the problem.   We finish by multiplying by the weigh 0 t vector w :  1   0 . (6.11) 1 01 . (6.11) 1 175  0     

 






The neural net netw work has obtained the correct answer for ev every ery example in the batc batch. h. In this example, we simply sp specified ecified the solution, then show showed ed that it obtained The neural network has obtained the correct answer for every example in the batch. zero error. In a real situation, there might b e billions of mo model del parameters and In this example, w e simply sp ecified the solution, then show that itasobtained billions of training examples, so one cannot simply guess the ed solution we did zero error. In a real situation, there might b e billions of mo del parameters and here. Instead, a gradient-based optimization algorithm can find parameters that billions training examples, one cannot simply solution as we did pro produce duceofvery little error. The so solution we describ described ed guess to thethe XOR problem is at a here. Instead, a gradient-based optimization algorithm can find parameters that global minim minimum um of the loss function, so gradient descent could con converge verge to this pro duce very little error. The solution w e describ ed to the X OR problem is at a p oin oint. t. There are other equiv equivalen alen alentt solutions to the XOR problem that gradient global minim um of the loss function, so gradient descent could con verge to descen descentt could also find. The conv convergence ergence p oint of gradien gradientt descen descentt dep depends ends on this the p oint. vThere arethe other equivalenIn t solutions the XOR problem that gradient initial alues of parameters. practice, to gradien gradient t descent would usually not descen t could alsoundersto find. Theo d, conv ergencealued p ointsolutions of gradien t descen t dep on ted the find clean, easily understoo integer-v integer-valued like the one weends presen presented initial values of the parameters. In practice, gradient descent would usually not here. find clean, easily understo o d, integer-valued solutions like the one we presented here.

6.2

Gradien Gradient-Based t-Based Learning

Designing and training a neuralLearning netw network ork is not much differen differentt from training any 6.2 Gradien t-Based other machine learning mo model del with gradient descen descent. t. In Sec. 5.10, we describ described ed Designing and training a neural netw ork is not m uch differen t from training any ho how w to build a machine learning algorithm by sp specifying ecifying an optimization pro procedure, cedure, mo delfamily with. gradient descent. In Sec. 5.10, we describ ed aother cost machine function,learning and a mo model del family. how to build a machine learning algorithm by sp ecifying an optimization pro cedure, Thefunction, largest difference etw etween een the models dels we hav havee seen so far and neural a cost and a mobdel family . linear mo net networks works is that the nonlinearit nonlinearity y of a neural netw network ork causes most interesting loss The largest difference b etw een the linear mo dels have seen so farare andusually neural functions to become non-conv non-convex. ex. This means thatwe neural net networks works networksbyis using that the nonlinearit y of a neuraloptimizers network causes most interesting loss trained iterativ iterative, e, gradient-based that merely drive the cost functionstotoa become ex. This thatequation neural net worksused are to usually function very lo low wnon-conv value, rather thanmeans the linear solvers train trained by using iterativ e, gradient-based optimizers that merely drive the cost linear regression mo models dels or the con convex vex optimization algorithms with global conv convererfunction to a v ery lo w v alue, rather than the linear equation solvers used to train gence guarantees used to train logistic regression or SVMs. Conv Convex ex optimization linear regression mofrom dels or theinitial convex optimization global ercon converges verges starting any parameters (in algorithms theory—in with practice it conv is very gence guarantees used to ntrain logistic regression or SVMs. Convex optimization robust but can encounter umerical problems). Sto Stoc chastic gradient descent applied con verges starting from any initial parameters (in theory—in practice it is very to non-conv non-convex ex loss functions has no such conv convergence ergence guarantee, and is sensitive robust but can encounter n umerical problems). Sto c hastic gradient descent applied to the values of the initial parameters. For feedforward neural net networks, works, it is to non-conv exinitialize loss functions has ts notosuch conv ergencevalues. guarantee, is sensitive imp importan ortan ortantt to all weigh weights small random The and biases ma may y be to the v alues of the initial parameters. F or feedforward neural net works, it is initialized to zero or to small p ositiv ositivee values. The iterativ iterativee gradient-based optiimp ortanalgorithms t to initialize alltoweigh to small random values. The biases madeep y be mization used traintsfeedforward netw networks orks and almost all other initialized to zero or to small p ositiv e v alues. The iterativ e gradient-based optimo models dels will b e describ described ed in detail in Chapter 8, with parameter initialization in mization algorithms to train networks and almost all other deep particular discussed used in Sec. 8.4. Ffeedforward or the moment, it suffices to understand that mo dels will b e describis edalmost in detail inysChapter with the parameter in the training algorithm alwa always based on8,using gradientinitialization to descend the particular discussed Sec. 8.4. For the it suffices toare understand that cost function in one in way or another. Themoment, sp specific ecific algorithms impro improv vements the training algorithm is almost alwa ys based on using the gradient to descend the and refinemen refinements ts on the ideas of gradient descent, in intro tro troduced duced in Sec. 4.3, and, cost function in one way or another. The sp ecific algorithms are improvements 176 descent, intro duced in Sec. 4.3, and, and refinements on the ideas of gradient


more sp specifically ecifically ecifically,, are most often improv improvements ements of the sto stochastic chastic gradien gradientt descent algorithm, introduced in Sec. 5.9. more sp ecifically, are most often improvements of the sto chastic gradient descent We can of course, train mo models dels such as linear regression and supp support ort vector algorithm, introduced in Sec. 5.9. mac machines hines with gradien gradientt descen descentt to too, o, and in fact this is common when the training W e can of course, train mo dels suchofasview, linear regression and supp ort visector set is extremely large. From this p oint training a neural net network work not mac hines witht gradien t descenany t toother o, andmo indel. factComputing this is common when theis training m uch differen different from training model. the gradient sligh slightly tly set is extremely large. F rom this p oint of view, training a neural net work is not more complicated for a neural netw network, ork, but can still b e done efficiently and exactly exactly.. m uch differen t from training any other mo del. Computing the gradient is sligh tly Sec. 6.5 will describ describee ho how w to obtain the gradien gradientt using the back-propagation more complicated for a neural netw ork, but can still b e done efficiently and exactly . algorithm and mo modern dern generalizations of the back-propagation algorithm. Sec. 6.5 will describ e how to obtain the gradient using the back-propagation As withand other machine learning mo models, apply gradien gradient-based t-based learning we algorithm mo dern generalizations ofdels, the to back-propagation algorithm. must cho hoose ose a cost function, and we must choose how to represent the output of As with other machine mo dels, to apply gradien t-based learning we the mo model. del. W e no now w revisit learning these design considerations with sp special ecial emphasis on m ust c ho ose a cost function, and we must choose how to represent the output of the neural netw networks orks scenario. the mo del. We now revisit these design considerations with sp ecial emphasis on the neural networks scenario. An imp important ortant asp aspect ect of the design of a deep neural net network work is the choice of the cost function. Fortunately ortunately,, the cost functions for neural netw networks orks are more or less An imp ortant asp ect of the design of a deep neural net work the choice of the the same as those for other parametric mo models, dels, suc such h as linear is mo models. dels. cost function. Fortunately, the cost functions for neural networks are more or less In most cases, our parametric mo model del defines a distribution p ( y | x; θ ) and the same as those for other parametric mo dels, such as linear mo dels. we simply use the principle of maximum likelihoo likelihood. d. This means we use the p ( y asx;the θ ) cost In most cases, our parametric mo del defines a distribution and cross-en cross-entropy tropy b etw etween een the training data and the mo model’s del’s predictions w e simply use the principle of maximum likelihoo d. This means w e use the | function. cross-entropy b etween the training data and the mo del’s predictions as the cost Sometimes, we take a simpler approach, where rather than predicting a complete function. probabilit probability y distribution ov over er y , we merely predict some statistic of y conditioned Sometimes, w e take a simpler approach, than predicting a complete on x. Sp Specialized ecialized loss functions allow us towhere trainrather a predictor of these estimates. probability distribution over y , we merely predict some statistic of y conditioned The total cost function used to train a neural net network work will often combine one on x. Sp ecialized loss functions allow us to train a predictor of these estimates. of the primary cost functions describ described ed here with a regularization term. We hav havee The total cost function used to train a neural net work will often combine one already seen some simple examples of regularization applied to linear mo models dels in Sec. of the. The primary cost functions describ edfor here withmo a dels regularization term.applicable We have 5.2.2 5.2.2. weigh weight t decay approach used linear models is also directly already seen some simple examples of regularization applied to linear mo dels in Sec. to deep neural netw networks orks and is among the most p opular regularization strategies. 5.2.2. adv Theanced weighregularization t decay approach used forfor linear monetw dels is alsowill directly applicable More advanced strategies neural networks orks b e describ described ed in to deep neural netw orks and is among the most p opular regularization strategies. Chapter 7. More advanced regularization strategies for neural networks will b e describ ed in Chapter 7. Most mo modern dern neural net networks works are trained using maxim maximum um lik likeliho eliho elihoo o d. This means that the cost function is simply the negative log-likelihoo equiv log-likelihood, d, equivalen alen alently tly describ described ed Most mo dern neural networks are trained using maximum likeliho o d. This means 177 that the cost function is simply the negative log-likelihoo d, equivalently describ ed


as the cross-en cross-entropy tropy b et etw ween the training data and the mo model del distribution. This cost function is giv given en by as the cross-entropy b etween the training data and the mo del distribution. This cost function is given bJy(θ ) = −E , ∼pˆ log pmodel(y | x). (6.12) E log p from (θ ) cost = function changes (y mo x)del . to mo (6.12) The sp specific ecific form ofJ the model model, del, dep depending ending − . The expansion of the | ab on the sp specific ecific form of log pmodel abo ove equation typically The sp ecific form of the cost function c hanges from mo del to mo del, depma ending yields some terms that do not dep depend end on the mo model del parameters and may y be on the sp ecific form of . The expansion of the ab o ve equation typically log p discarded. For example, as we saw in Sec. 5.5.1, if pmodel (y | x) = N (y ; f (x; θ) , I ), yieldswesome terms not dep end cost, on the mo del parameters and may b e then recov recover er the that meandosquared error (y x) = (y ; f (x; θ) , I ), discarded. For example, as we saw in Sec. 5.5.1, if p 1 then we recover theJmean | , N ||ycost, − f (x; θ )||2 + const (θ ) = squared E , ∼pˆ error const, (6.13) 2 1E y f (x; θ ) + const, J (θ ) =1 (6.13) up to a scaling factor of 2 and does es not dep depend end on θ . The discarded 2 a term that do || Gaussian − || constan constantt is based on the variance of the distribution, which in this case up to a scaling factor of and a term that do es not dep end on θ .alence The discarded we chose not to parametrize. Previously Previously,, we saw that the equiv equivalence betw between een constanum t islik based on variancewith of the distribution, which in this case maxim maximum likeliho eliho elihoo o d the estimation an Gaussian output distribution and minimization of w e chose not to parametrize. Previously , we saw that the equiv alence betw een mean squared error holds for a linear model, but in fact, the equiv equivalence alence holds maxim um lik eliho o d estimation with an output distribution and minimization of regardless of the f (x; θ ) used to predict the mean of the Gaussian. mean squared error holds for a linear model, but in fact, the equivalence holds An adv advantage antage approach of deriving the of cost from maximum regardless of the f of (x;this θ ) used to predict the mean thefunction Gaussian. lik likeliho eliho elihoo o d is that it remov removes es the burden of designing cost functions for each mo model. del. An adv antage of this approach of deriving the cost function from maximum Sp Specifying ecifying a mo model del p(y | x ) automatically determines a cost function log p (y | x ). likeliho o d is that it removes the burden of designing cost functions for each mo del. One recurring theme throughout neural netw network ork design is that the gradien gradientt of Sp ecifying a mo del p(y x ) automatically determines a cost function log p (y x ). the cost function must b e large and predictable enough to serve as a go good od guide | throughout neural network design is that the gradien| t of One recurring theme for the learning algorithm. Functions that saturate (b (become ecome very flat) undermine the cost function must b e large and predictable enough to serve go od guide this ob objectiv jectiv jectivee b ecause they make the gradien gradientt b ecome very small.asIna many cases for the learning algorithm. F unctions that saturate (b ecome v ery flat) undermine this happ happens ens b ecause the activ activation ation functions used to pro produce duce the output of the this ob jectiv b ecause they make gradienThe t b ecome very small. In many cases hidden unitse or the output units the saturate. negativ to negativee log-likelihoo log-likelihood d helps this happ ens b ecause the activ ation functions used to pro duce the output of the avoid this problem for many models. Man Many y output units inv involv olv olvee an exp function hidden units or the output units saturate. The negativ e log-likelihoo d helps to that can saturate when its argumen argumentt is very negative. The log function in the a void this problem for many models. Man y output units inv olv e an function exp negativ negativee log-lik log-likeliho eliho elihooo d cost function undo undoes es the exp of some output units. We will logoffunction that can saturate when its argumen t is very negative. in the discuss the interaction b et etwe we ween en the cost function and the The choice output unit in negativ e log-lik eliho o d cost function undo es the exp of some output units. We will Sec. 6.2.2 . discuss the interaction b etween the cost function and the choice of output unit in One un unusual usual prop propert ert erty y of the cross-entrop cross-entropy y cost used to p erform maximum Sec. 6.2.2. lik likeliho eliho elihooo d estimation is that it usually do does es not ha have ve a minimum value when applied One un usual prop ert y of the cross-entrop y usedoutput to p erform maximum to the mo models dels commonly used in practice. For cost discrete variables, most likeliho d estimation is that usually do notthey havecannot a minimum value awhen applied mo models delsoare parametrized initsuch a wa way y es that represent probability to zero the mo usedarbitrarily in practice.close Fortodiscrete output variables, most of or dels one,commonly but can come doing so. Logistic regression mo dels are parametrized in such a wa y that they cannot represent a probability is an example of such a mo model. del. For real-v real-valued alued output variables, if the mo model del of zero or one, but can come arbitrarily close to doing so. Logistic regression 178 alued output variables, if the mo del is an example of such a mo del. For real-v


can control the density of the output distribution (for example, by learning the variance parameter of a Gaussian output distribution) then it b ecomes p ossible canassign control the density the output (for example, by learning to extremely highofdensity to thedistribution correct training set outputs, resultingthe in v ariance parameter of a Gaussian output distribution) then it b ecomes p ossible cross-en cross-entropy tropy approaching negativ negativee infinity infinity.. Regularization tec techniques hniques describ described ed to assign extremely high density to the correct training set outputs, resulting in in Chapter 7 provide sev several eral different wa ways ys of mo modifying difying the learning problem so cross-en e infinity . Regularization that thetropy mo model delapproaching cannot reapnegativ unlimited reward in this wa way y. techniques describ ed in Chapter 7 provide several different ways of mo difying the learning problem so that the mo del cannot reap unlimited reward in this way. Instead of learning a full probabilit probability y distribution p(y | x ; θ ) we often wan wantt to learn just one conditional statistic of y giv given en x. Instead of learning a full probability distribution p(y x ; θ ) we often want to learn For example, we may hav havee a predictor f (x ; θ) that we wish to predict the mean just one conditional statistic of y given x. | of y . For example, we may have a predictor f (x ; θ) that we wish to predict the mean If we use a sufficien sufficiently tly p owerful neural netw network, ork, we can think of the neural of y . net network work as b eing able to represent an any y function f from a wide class of functions, we use tly ponly owerful neural netw we can think the neural withIfthis classa bsufficien eing limited by features suc such hork, as contin continuit uit uity y andof b oundedness f from networkthan as bbeing able any function a wide rather y ha having vingtoa represent specific parametric form. From this class pointofoffunctions, view, we with this class b eing limited only by features suc h as contin uit y and b oundedness can view the cost function as being a functional rather than just a function. A rather thanisbay mapping having a from specific parametric form. Fbrom pointth of functional functions to real num numb ers. this We can thus us view, think we of can viewas the cost function as being a than functional rather thanajust a function. A learning choosing a function rather merely choosing set of parameters. functional is a mapping from functions toe real numb ers. occur We can us think of W e can design our cost functional to hav have its minimum at th some sp specific ecific learning as rather thandesign merely choosing a set of parameters. function wechoosing desire. Faorfunction example, we can the cost functional to hav havee its W e can design our cost functional to hav e its minimum occur at some spen ecific x. minim minimum um lie on the function that maps x to the exp expected ected value of y giv given functionanwoptimization e desire. Forproblem example, weresp canect design the costrequires functional to have its Solving with respect to a function a mathematical x to y given to x. minim um lie on theoffunction that mapsed the exp ected alue to tool ol called calculus variations variations, , describ described in Sec. 19.4.2 . Itvis notofnecessary Solving an optimization with ect to a function requires a mathematical understand calculus of problem variations to resp understand the conten content t of this chapter. A Att to ol called c alculus of variations , describ ed in Sec. 19.4.2 . It is not necessary the moment, it is only necessary to understand that calculus of variations ma may y btoe understand calculus of variations to understand the content of this chapter. At used to derive the following two results. the moment, it is only necessary to understand that calculus of variations may b e Our first result derived using calculus of variations is that solving the optimizaused to derive the following two results. tion problem Our first result derived calculus is)|| that solving the optimiza2 (6.14) ||y − f (x f ∗ =using arg min E , ∼pof variations tion problem f E (6.14) y f (x) f = arg min yields || − || f ∗ (x) = E ∼p (6.15) (y |x)[y ], yields E so long as this function lies within if we over. er. In other words,(6.15) f (x)the = class we optimize [y ], ov could train on infinitely man many y samples from the true data-generating distribution, so long as this function lies within class we optimize er. In other words, if the we minimizing the mean squared errorthe cost function gives a ov function that predicts could train on infinitely man y samples from the true data-generating distribution, mean of y for each value of x. minimizing the mean squared error cost function gives a function that predicts the 179 mean of y for each value of x.


Differen Differentt cost functions give differen differentt statistics. A second result deriv derived ed using calculus of variations is that Different cost functions give different statistics. A second result derived using calculus of variations isf ∗that = arg min E , ∼p ||y − f (x)||1 (6.16) f E f = arg min y f (x) (6.16) yields a function that predicts the value of y for eac each h x , so long as such a || − || function may b e describ described ed by the family of functions we optimize over. This cost yields a function that predicts the alue function is commonly called me mean an absolute verr error or or..of y for each x , so long as such a function may b e describ ed by the family of functions we optimize over. This cost Unfortunately Unfortunately,, mean squared error and mean absolute error often lead to p o or function is commonly called mean absolute error. results when used with gradient-based optimization. Some output units that Unfortunately meansmall squared error and mean absolute error often lead to p o or saturate pro produce duce,very gradients when combined with these cost functions. results when usedthat withthe gradient-based optimization. units mean that This is one reason cross-entrop cross-entropy y cost function is Some more poutput opular than saturateerror pro duce very absolute small gradients when combined these cost functions. squared or mean error, ev even en when it is notwith necessary to estimate an This is one reason that the cross-entrop y cost function is more p opular than mean en entire tire distribution p(y | x). squared error or mean absolute error, even when it is not necessary to estimate an entire distribution p(y x). | The choice of cost function is tightly coupled with the choice of output unit. Most of the time, we simply use the cross-entrop cross-entropy y b et etween ween the data distribution and the The c hoice of cost function is tightly coupled with thethe choice of output unit. Most mo model del distribution. The choice of how to represent output then determines of the time, w e simply use the cross-entrop y b et ween the data distribution and the the form of the cross-entrop cross-entropy y function. mo del distribution. The choice of how to represent the output then determines Any y kind of cross-entrop neural netw network unit that may b e used as an output can also b e the An form of the york function. used as a hidden unit. Here, we fo focus cus on the use of these units as outputs of the An y kind of neural netw ork unit thatinternally may b e used as an also be mo model, del, but in principle they can b e used as well. Weoutput revisit can these units used as a hidden unit. Here, we fo cus on the use of these units as outputs of the with additional detail ab about out their use as hidden units in Sec. 6.3. mo del, but in principle they can b e used internally as well. We revisit these units Throughout this section, we supp suppose ose that the feedforw feedforward ard net network work pro provides vides a with additional detail ab out their use as hidden units in Sec. 6.3. set of hidden features defined by h = f (x; θ). The role of the output lay layer er is then Throughout this section, we supp ose that the feedforw ard net work pro vides a to provide some additional transformation from the features to complete the task set ofthe hidden features by h = f (x; θ). The role of the output layer is then that netw network ork mustdefined p erform. to provide some additional transformation from the features to complete the task that the network must p erform. One simple kind of output unit is an output unit based on an affine transformation with no nonlinearity nonlinearity.. These are often just called linear units. One simple kind of output unit is an output unit based on an affine transformation Given features h,.aThese la layer yer of linear units linear pro produces duces a vector yˆ = W > h+ b. withGiv noennonlinearity are oftenoutput just called units. Linear outputhlay layers ers are often used to pro produce duce the mean of a conditional Given features , a layer of linear output units pro duces a vector yˆ = W h+ b. Gaussian distribution: Linear output layers are poften duce yˆ, I ). the mean of a conditional (6.17) (y | xused ) = Nto (y ;pro Gaussian distribution: (6.17) p(y x) =180 (y ; yˆ, I ). | N


Maximizing the log-likelihoo log-likelihood d is then equiv equivalent alent to minimizing the mean squared error. Maximizing the log-likelihoo d is then equivalent to minimizing the mean squared The maximum likelihoo likelihood d framew framework ork mak makes es it straigh straightforward tforward to learn the error. co cov variance of the Gaussian to too, o, or to mak makee the cov covariance ariance of the Gaussian b e a The maximum likelihoo d framew ork mak es it tforward thee function of the input. How However, ever, the cov covariance ariance must bstraigh e constrained to to b e learn a p ositiv ositive covariance of the to makto e the covariance of the ts Gaussian be a definite matrix for Gaussian all inputs.toIto,isordifficult satisfy suc such h constrain constraints with a linear functionla of thesoinput. How ever,output the covunits ariance e constrained b ecov a pariance. ositive output layer, yer, typically other aremust usedbto parametrizetothe covariance. definite matrix for deling all inputs. It is difficult satisfyedsuc h constrain ts with a linear Approac Approaches hes to mo modeling the cov covariance ariance aretodescrib described shortly shortly, , in Sec. 6.2.2.4 . output layer, so typically other output units are used to parametrize the covariance. Because linear units the do not saturate, pose difficult difficulty y for gradien gradientApproac hes to mo deling covariance arethey describ edlittle shortly , in Sec. 6.2.2.4 . tbased optimization algorithms and ma may y b e used with a wide variety of optimization Because linear units do not saturate, they pose little difficulty for gradientalgorithms. based optimization algorithms and may b e used with a wide variety of optimization algorithms. Man Many y tasks require predicting the value of a binary variable y . Classification problems with tw two o classes can b e cast in this form. Many tasks require predicting the value of a binary variable y . Classification The maximum-lik maximum-likeliho eliho elihood od approach is to define a Bernoulli distribution over y problems with two classes can b e cast in this form. conditioned on x. The maximum-likeliho od approach is to define a Bernoulli distribution over y A Bernoulli is defined by just a single num numb b er. The neural net conditioned on xdistribution . needs to predict only P ( y = 1 | x). For this num numb b er to b e a valid probability probability,, it A Bernoulli distribution is defined b y just a single num b er. The neural net must lie in the in interv terv terval al [0, 1]. needs to predict only P ( y = 1 x). For this numb er to b e a valid probability, it Satisfying this constraint requires some careful design effort. Supp Suppose ose we were must lie in the interval [0, 1]. | to use a linear unit, and threshold its value to obtain a valid probabilit probability: y: Satisfying this constraint requires some careful design effort. Supp ose we were n n oo > to use a linear unit, its value to obtain y: (6.18) P (and y = threshold 1 | x) = max 0, min 1, w ha+valid b probabilit .

(y = a1 valid x) =conditional max 0, min 1, w h +but b we . would not b(6.18) This would indeed Pdefine distribution, e able | to train it very effectiv effectively ely with gradient descent. Any time that w>h + b stra strayed yed This would indeed define a v alid conditional distribution, but we would not b e able outside the unit interv interval, al, the gradient of the output of the mo model del with resp respect ect to w h + b to train it v ery effectiv ely with gradient descent. Any time that stra its parameters would b e 0. A gradientnof 0 is ntypically problematic b ecause yed the oo outside the unit interv al, the gradient of the output of the mo del with resp ect to learning algorithm no longer has a guide for how to impro improv ve the corresponding its parameters would b e 0. A gradient of 0 is typically problematic b ecause the parameters. learning algorithm no longer has a guide for how to improve the corresponding Instead, it is b etter to use a different approach that ensures there is alwa always ys a parameters. strong gradien gradientt whenever the mo model del has the wrong answ answer. er. This approach is based Instead, it is b etter to use a different approach that on using sigmoid output units combined with maximum ensures lik likeliho eliho elihoo othere d. is always a strong gradient whenever the mo del has the wrong answer. This approach is based A sigmoid output unit is defined by on using sigmoid output units combined with maximum likeliho o d. A sigmoid output unit is defined yˆ = σbyw>h + b yˆ = σ 181 w h+b  





(6.19) (6.19)


where σ is the logistic sigmoid function describ described ed in Sec. 3.10. We can think of the sigmoid output unit as having tw two o comp componen onen onents. ts. First, it where σ is the logistic sigmoid function describ ed in Sec. 3.10. > uses a linear lay layer er to compute z = w h + b. Next, it uses the sigmoid activ activation ation W e can think of the sigmoid output unit as having tw o comp onen ts. First, it function to conv convert ert z into a probability probability.. uses a linear layer to compute z = w h + b. Next, it uses the sigmoid activation We omit the dep dependence endence on x for the moment to discuss how to define a function to convert z into a probability. probabilit probability y distribution ov over er y using the value z . The sigmoid can b e motiv motivated ated x W e omit the dep endence on for the moment to discuss how to define a ˜ by constructing an unnormalized probabilit probability y distribution P ( y ), whic which h do does es not y z probabilit y distribution ov er using the v alue . The sigmoid can b e motiv ated sum to 1. We can then divide by an appropriate constant to obtain a valid ˜( y ), which do es not b y constructing an unnormalized y distribution probabilit probability y distribution. If we b eginprobabilit with the assumption thatPthe unnormalized log sum to 1. W e can an appropriate a valid probabilities are linearthen in y divide and z , by we can exponentiateconstant to obtainto theobtain unnormalized probability distribution. If we b egintowith the unnormalized log probabilities. We then normalize see the thatassumption this yieldsthat a Bernoulli distribution y z probabilities linear in transformation and , we can exponentiate to obtain the unnormalized con controlled trolled byare a sigmoidal of z : probabilities. We then normalize to see that this yields a Bernoulli distribution controlled by a sigmoidal transformation log P˜(y ) = y z of z : (6.20) ˜ (6.21) log P P˜((yy)) = = exp( yz yz) (6.20) exp(y z ) (6.21) P˜(y ) = exp( (6.22) P1 y z ) exp(y y 0 z) y exp( =0 exp( yz) P ( y ) = (6.22) P (y ) = σ ((2y − 1)zy) z. ) (6.23) exp( (y )exp =σ ((2tiation y 1)z )and . normalization are common (6.23) Probabilit Probability y distributions basedPon exponen onen onentiation throughout the statistical mo modeling deling literature. such h a − The z variable defining suc Probabilit y distributions based on exp onen tiation and normalization are common distribution ov over er binary variables is called logit git git.. P a lo throughout the statistical mo deling literature. The z variable defining such a This approach to predicting the probabilities in log-space is natural to use distribution over binary variables is called a logit. with maximum likelihoo likelihood d learning. Because the cost function used with maxim maximum um This approach to predicting the probabilities in log-space is natural to use lik likeliho eliho elihoo o d is − log P ( y | x), the log in the cost function undo undoes es the exp of the with maximum likelihoo d learning. Because the cost function used with um sigmoid. Without this effect, the saturation of the sigmoid could prev prevent entmaxim gradientlog P ( ymaking x), the likeliholearning o d is log in the costThe function undo es the of the based from goo good d progress. loss function for exp maxim maximum um sigmoid. the saturation of the could − this |Bernoulli lik likeliho eliho elihoo o dWithout learning of aeffect, parametrized by sigmoid a sigmoid is prevent gradientbased learning from making goo d progress. The loss function for maximum likeliho o d learning of a Bernoulli J (θ ) =parametrized − log P (y | xby ) a sigmoid is (6.24) = − log σ ((2y − 1)z ) (6.25) J (θ ) = log P (y x) (6.24) =− ζ ((1 − 2y )z ) . (6.26) = log σ ((2|y 1)z ) (6.25) ζ−((1 prop 2yerties )z )−. from Sec. 3.10. By rewriting (6.26) This deriv derivation ation mak makes es use of=some properties − we can see that it saturates only when the loss in terms of the softplus function, This deriv ation mak es use of some prop Sec. 3.10 Bydelrewriting (1 − 2y )z is very negative. Saturation thus oerties ccurs from only when the .mo model already the loss in terms of the softplus function, we can see that it saturates only has the right answ answer—when er—when y = 1 and z is very p ositiv ositive, e, or y = 0 and z iswhen very (1 2 y ) z is very negative. Saturation thus o ccurs only when the mo del already negativ negative. e. When z has the wrong sign, the argument to the softplus function, has−the right answer—when y = 1 and z is very p ositive, or y = 0 and z is very 182the argument to the softplus function, negative. When z has the wrong sign,


(1 (1− − 2 y )z , may b e simplified to |z |. As |z | b ecomes large while z has the wrong sign, the softplus function asymptotes tow toward ard simply returning its argumen argumentt |z |. The (1 2 y ) z z z z , may b e simplified to . As b ecomes large while has sign, deriv derivative ative with resp respect ect to z asymptotes to sign ofwrong extremely sign((z), so, in the limitthe z the function asymptotes tow simplythe returning . The − softplus | |do | shrink z , the incorrect softplus function does es |ard not gradien gradientits t atargumen all. Thist prop property erty deriv ative with resp ect to asymptotes to ) , so, in the limit of extremely z sign ( z | | is very useful b ecause it means that gradien gradient-based t-based learning can act to quic quickly kly incorrecta zmistaken , the softplus correct z . function do es not shrink the gradient at all. This prop erty is very useful b ecause it means that gradient-based learning can act to quickly When we use other loss functions, such as mean squared error, the loss can correct a mistaken z . saturate anytime σ(z ) saturates. The sigmoid activ activation ation function saturates to 0 When we use other loss functions, such as mean squared error,very the ploss can when z b ecomes very negativ negativee and saturates to 1 when ositiv ositive. e. z b ecomes σ(z ) saturates. saturate anytime The activ ation function to 0 The gradien gradient t can shrink to too o small to b esigmoid useful for learning wheneversaturates this happ happens, ens, when z bthe ecomes e andanswer saturates to incorrect 1 when zanswer. b ecomesFor very e. whether mo model delvery has negativ the correct or the thisp ositiv reason, The gradien t eliho can shrink to o small toysb ethe useful for learning whenever this happ ens, maxim maximum um lik likeliho elihoo o d is almost alwa always preferred approach to training sigmoid whether the mo del has the correct answer or the incorrect answer. F or this reason, output units. maximum likeliho o d is almost always the preferred approach to training sigmoid Analytically Analytically,, the logarithm of the sigmoid is alwa always ys defined and finite, b ecause output units. the sigmoid returns values restricted to the op open en interv interval al (0 (0,, 1) 1),, rather than using Analytically , the logarithm of the sigmoid is alwa ys defined finite, b ecause the entire closed interv interval al of valid probabilities [0 [0,, 1] 1].. In softw software areand implementations, theav sigmoid returns problems, values restricted to the op en interv al (0, 1) rather thandusing to avoid oid numerical it is b est to write the negativ negative e ,log-likelihoo log-likelihood as a the entireofclosed intervthan al of as valid probabilities are sigmoid implementations, z, rather z ).softw function a function of yˆ[0=, 1]σ. (In If the function to avoidws numerical problems, is blogarithm est to write negativ e log-likelihoo underflo underflows to zero, then takingitthe of yˆthe yields negativ negative e infinit infinity y. d as a z y ˆ = σ ( z function of , rather than as a function of ). If the sigmoid function underflows to zero, then taking the logarithm of yˆ yields negative infinity. An Any y time we wish to represen representt a probability distribution ov over er a discrete variable with n p ossible values, we ma may y use the softmax function. This can b e seen as a An y time we wish to sigmoid representfunction a probability er a discrete variabley generalization of the whic which h distribution was used to ov represen represent t a probabilit probability n p ossible with wevma y use the softmax function. This can b e seen as a distribution ov over ervalues, a binary ariable. generalization of the sigmoid function which was used to represent a probability Softmax functions are most often used as the output of a classifier, to represen representt distribution over a binary variable. the probabilit probability y distribution ov over er n differen differentt classes. More rarely rarely,, softmax functions functions most oftenifused as the to represen can Softmax b e used inside the are mo model del itself, we wish theoutput mo model deloftoacclassifier, ho hoose ose b et etw ween one oft thedifferent probabilit y distribution er t classes. More rarely, softmax functions n differen n options for someov in internal ternal variable. can b e used inside the mo del itself, if we wish the mo del to cho ose b etween one of In the case of binary variables, we wished to pro produce duce a single num numb b er n different options for some internal variable. In the case of binary variables, to).pro duce a single numb er (6.27) yˆ =we P (wished y=1|x (yween = 10x ). 1, and b ecause we wan (6.27) Because this number needed to yˆlie= bPet etween and wanted ted the logarithm of the number to b e well-b ell-beha eha ehaved ved| for gradien gradient-based t-based optimization of Because this number to instead lie b etween 0 and 1, and x). = logwP˜e(ywan = ted 1 | the the log-likelihoo log-likelihood, d, weneeded chose to predict a num numb b erbzecause logarithm of the n umber to b e w ell-b eha ved for gradien t-based optimization of Exp Exponen onen onentiating tiating and normalizing gav gavee us a Bernoulli distribution controlled by the ˜ (y = 1 x). z = log P the log-likelihoo d, we c hose to instead predict a num b er sigmoid function. Exp onentiating and normalizing gave us a Bernoulli distribution controlled by| the 183 sigmoid function.


To generalize to the case of a discrete variable with n values, we now need to pro produce duce a vector yˆ, with yî = P (y = i | x ). We require not only that eac each h n T o generalize to the case of a discrete v ariable with v alues, w e now need elemen elementt of yî b e b et etween ween 0 and 1, but also that the entire vector sums to 1 so that yˆ, with yˆy distribution. = P (y = i x to pro duce a v ector ). W e require not that only wthat eac h it represents a valid probabilit probability The same approach orked for elemen t of b e b et ween 0 and 1 , but also that the entire vector sums to 1 so that y ˆ | the Bernoulli distribution generalizes to the multinoulli distribution. First, a linear it represents a v alid probabilit y distribution. la layer yer predicts unnormalized log probabilities: The same approach that worked for the Bernoulli distribution generalizes to the multinoulli distribution. First, a linear layer predicts unnormalized log probabilities: z = W > h + b, (6.28) z= W h+ b, (6.28) where z i = log P˜( y = i | x) . The softmax function can then exp exponentiate onentiate and normalize z to obtain the desired yˆ. Formally ormally,, the softmax function is given by where z = log P˜( y = i x) . The softmax function can then exp onentiate and exp( exp(z zi) softmax function is given by normalize z to obtain the| desired yˆ. Formally , the softmax(z )i = P . (6.29) exp(zzj ) j exp( exp(z ) softmax(z ) = . (6.29) exp( z ) As with the logistic sigmoid, the use of the exp function works very well when training the softmax to output a target value y using maximum log-likelihoo log-likelihood. d. In exp As with the logistic sigmoid, the use of the function works very well softmax((z )i. Definingwhen this case, we wish to maximize log P (y = i ; z ) = log softmax the training the softmax to output a target v alue y using maximum log-likelihoo d. In softmax in terms of exp is natural b ecauseP the log in the log-likelihoo log-likelihood d can undo log P ( = i ; z ) = log softmax ( z ) this case, we wish to maximize y . Defining the the exp of the softmax: softmax in terms of exp is natural b ecause the log in the log-likelihoo d can undo X the exp of the softmax: log softmax(z )i = zi − log exp( exp(zz j ). (6.30) j

log softmax(z ) = z exp(z ). log (6.30) The first term of Eq. 6.30 sho shows ws that−the input zi alw always ays has a direct contribution to the cost function. Because this term cannot saturate, we know that z alw The first term of even Eq. 6.30 ws that theof input ays has direct z i to the learning can pro proceed, ceed, if thesho contribution second terma of Eq. con6.30 X tribution to the cost function. Because this term cannot saturate, we know that b ecomes very small. When maximizing the log-likelihoo log-likelihood, d, the first term encourages can pro ceed, even if the contribution of z to the second of Eq.down. 6.30 zlearning z to term all of b e pushed i to b e pushed up, while the second term encourages P boecomes very intuition small. When maximizing the log-likelihoo first eterm log j exp exp((d, z j)the T gain some for the second term, , observ observe thatencourages this term zcantobbeeroughly z pushedapproximated up, while the by second term encourages all of to b e pushed maxj z j. This approximation is based on thedown. idea log exp ( z T o gain some intuition for the second term, ) , observ e that this term that exp exp((z k ) is insignificant for any z k that is noticeably less than max j z j. The max z . This approximation can b e roughly byapproximation is based on the idea in intuition tuition we canapproximated gain from this is that the negative log-lik log-likeliho eliho elihoo od that function insignificant any z the thatmost is noticeably less than exp(z ) is max z . IfThe cost alwa always ys stronglyfor p enalizes activ activee incorrect prediction. the intuitionanswer we canalready gain from thatsoftmax, the negative od correct has this the approximation largest input toisthe then log-lik the −zeliho i term P P cost the function ys( zstrongly p enalizes the most active incorrect prediction. If the log alwa exp( and j ) ≈ maxj z j = zi terms will roughly cancel. This example j exp correct answer already has the input tocost, the softmax, the z term will then contribute little to the largest ov overall erall training whic which h willthen b e dominated by max log exp ) not = z terms and the will roughly cancel. This−example other examples that( zare yetzcorrectly classified. will then contribute little ≈ to the overall training cost, which will b e dominated by So far we hav havee discussed only a single example. Ov Overall, erall, unregularized maximum other examples that are not yet correctly classified. lik likeliho eliho elihooo d will drive the mo model del to learn parameters that drive the softmax to predict So far we have discussed only a single example. Overall, unregularized maximum P 184 likeliho o d will drive the mo del to learn parameters that drive the softmax to predict


the fraction of coun counts ts of each outcome observed in the training set: Pm the fraction of counts of each outcome observed the training set: j=1 1yin =i,x =x Pm . (6.31) softmax(z (x; θ )) i ≈ 1x =x j=1 1 . (6.31) softmax(z (x; θ )) 1 this is guaranteed to happ Because maximum lik likeliho eliho elihooo d is a consistent estimator, happen en ≈ so long as the mo model del family is capable of represen representing ting the training distribution. In Because maximum likdel eliho o d is a consistent estimator, this is guaranteed happthe en practice, limited mo model capacity and imp imperfect erfect optimization will meantothat P so long as the mo del family is capable of represen ting the training distribution. In mo model del is only able to appro approximate ximate these fractions. practice, limited mo del capacity and imp erfect P optimization will mean that the Man Many y ob objectiv jectiv jectivee functions other than the log-likelihoo log-likelihood d do not work as well mo del is only able to approximate these fractions. with the softmax function. Sp Specifically ecifically ecifically,, ob objectiv jectiv jectivee functions that do not use a log to Man ob jectiv e functions other than when the log-likelihoo dtdo workb ecomes as well undo they exp of the softmax fail to learn the argumen argument to not the exp with the softmax function. Sp ecifically , ob jectiv e functions that do not use a very negative, causing the gradien gradientt to vanish. In particular, squared errorlogistoa undo the exp of theforsoftmax to learn whenfail the the b ecomes p oor loss function softmaxfail units, and can toargumen train thet to mo model del exp to change its voutput, ery negative, causing thedel gradien to vanish. In particular, squared error is a, even when the mo model makest highly confiden confident t incorrect predictions (Bridle p oor).loss for softmax units, fail to train the we moneed del totochange its 1990 1990). To function understand wh why y these otherand losscan functions can fail, examine output, even function when theitself. mo del makes highly confident incorrect predictions (Bridle, the softmax 1990). To understand why these other loss functions can fail, we need to examine Lik Likee the sigmoid, the softmax activ activation ation can saturate. The sigmoid function has the softmax function itself. a single output that saturates when its input is extremely negative or extremely Likee.the activation canare saturate. The sigmoid function has p ositiv ositive. Insigmoid, the casethe ofsoftmax the softmax, there multiple output values. These a single output that saturates when its input is extremely negative or extremely output values can saturate when the differences betw between een input values b ecome p ositiv e. In the case of the softmax, there are multiple values. These extreme. When the softmax saturates, many cost functionsoutput based on the softmax output values unless can saturate theinv differences betweenactiv input values b ecome also saturate, they arewhen able to invert ert the saturating activating ating function. extreme. When the softmax saturates, many cost functions based on the softmax o see thatunless the softmax function responds onds tosaturating the difference et etw ween its inputs, alsoTsaturate, they are able toresp invert the activbating function. observ observee that the softmax output is inv invarian arian ariantt to adding the same scalar to all of its T o see that the softmax function resp onds to the difference b etween its inputs, inputs: observe that the softmax output is zinv arian t to adding of its softmax( )= softmax( z + c)the . same scalar to all(6.32) inputs: Using this prop propert ert erty y, we can derive za)numerically softmax( = softmax(zstable + c). variant of the softmax: (6.32) Using this prop erty, wesoftmax( can derive a softmax( numerically of the softmax: z) = z −stable max ziv)ariant . (6.33) i

softmax( )= max zwith ). only small numerical (6.33) The reformulated version allows zus tosoftmax( ev evaluate aluatezsoftmax errors even when z con contains tains extremely large or − extremely negativ negativee num numb b ers. ExThe reformulated version allows us to ev aluate softmax with only small numerical amining the numerically stable variant, we see that the softmax function is driven z con errors when extremely large or extremely b y the even amount that its tains arguments deviate from maxi z i . negative numb ers. Examining the numerically stable variant, we see that the softmax function is driven softmax((z )i saturates to 1 when the corresp An output softmax corresponding onding input is maximal by the amount that its arguments deviate from max z . (zi = max i zi ) and zi is much greater than all of the other inputs. The output softmax (z ) saturates An output to 1 when the corresp onding input is maximal softmax softmax( ( z)i can zi is not also saturate to 0 when maximal and the maximum is z uch = max z z (m ) and is m uch greater than all of the other inputs. The output greater. This is a generalization of the way that sigmoid units saturate, and softmax( z) can also saturate to 0 when z is not maximal and the maximum is 185the way that sigmoid units saturate, and much greater. This is a generalization of


can cause similar difficulties for learning if the loss function is not designed to comp compensate ensate for it. can cause similar difficulties for learning if the loss function is not designed to The argumen argumentt to the softmax function can b e pro produced duced in tw twoo different wa ways. ys. comp ensate for it.z The most common is simply to hav havee an earlier lay layer er of the neural net network work output The argumen t to the softmax function can b e pro duced in tw o different ways. z > ev every ery element of z, as describ described ed ab aboove using the linear lay layer er z = W h + b . While The most commonthis is simply to hav e an earlier layer of the neural network output straigh straightforward, tforward, approach actually ov overparametrizes erparametrizes the distribution. The z z = W h + b ev ery element of , as describ ed ab o ve using the linear lay er . While constrain constraintt that the n outputs must sum to 1 means that only n − 1 parameters are straigh tforward, this approach overparametrizes theby distribution. n -th value necessary; the probability of theactually may b e obtained subtracting The the n n constrain t that the outputs m ust sum to 1 means that only 1 parameters are first n − 1 probabilities from 1. We can thus imp impose ose a requiremen requirementt that one element n -threquire necessary; the F probability of w the value may obtained by subtracting the − z b e fixed. of or example, e can that bzne = 0. Indeed, this is exactly first nthe1sigmoid probabilities from . We canPthus requiremen t that element (y =imp 1 | ose (z ) is equiv x) =a σ what unit do does. es. 1Defining equivalent alentone to defining of(yz = b− e1 fixed. or example, we can require that z z=and 0. Indeed, this isthe exactly | x) = Fsoftmax n −1 P softmax( (z )1 with z1 = 00.. Both a tw two-dimensional o-dimensional P ( y = 1 ) = σ ( z x what the sigmoid unit do es. Defining ) is equiv alent to defining argumen argumentt and the n argumen argumentt approaches to the softmax can describ describee the same n 1 P (yof=probability 1 x) = softmax (z ) withbut z and zdynamics. a tw o-dimensional = 0. BothInthe | set distributions, hav have e different learning practice, n difference argumen the argument bapproaches to the the ov softmax can describ e the or same |t and m − there is rarely uch et etw ween using overparametrized erparametrized version the set of probability distributions, but hav e different learning dynamics. In practice, restricted version, and it is simpler to implement the ov overparametrized erparametrized version. there is rarely much difference b etween using the overparametrized version or the From a neuroscientific p oin ointt of view, it is in interesting teresting to think of the softmax as restricted version, and it is simpler to implement the overparametrized version. a wa way y to create a form of comp competition etition b etw etween een the units that participate in it: the Fromoutputs a neuroscientific p oin it is interesting to think the softmax as softmax alw always ays sum tot 1ofsoview, an increase in the value of oneofunit necessarily a way toonds create form of comp etition etwothers. een theThis unitsisthat participate in lateral it: the corresp corresponds to aa decrease in the valuebof analogous to the softmax outputs alw ays sum to 1 so an increase in the v alue of one unit necessarily inhibition that is b elieved to exist b et etw ween nearb nearby y neurons in the cortex. At the corresp onds to athe decrease in the vween alue of This analogous to the lateral ai isand extreme (when difference b et etween theothers. maximal the others is large in inhibition that is b elieved to exist b et w een nearb y neurons in the cortex. At the magnitude) it b ecomes a form of winner-take-al winner-take-alll (one of the outputs is nearly 1 extreme (when the difference b et ween the maximal a and the others is large in and the others are nearly 0). magnitude) it b ecomes a form of winner-take-al l (one of the outputs is nearly 1 The name “softmax” can b e somewhat confusing. The function is more closely and the others are nearly 0). related to the argmax function than the max function. The term “soft” derives The can b e somewhat Theand function is more closely from thename fact “softmax” that the softmax function confusing. is con continuous tinuous differentiable. The related to the argmax function than the max function. The term “soft” derives argmax function, with its result represented as a one-hot vector, is not con continuous tinuous from the fact that the softmax function is con tinuous and differentiable. or differentiable. The softmax function thus pro provides vides a “softened” version ofThe the argmax function, with its result represented a one-hotfunction vector, isis not continuous softmax softmax( (z ) > z. argmax. The corresp corresponding onding soft version of theasmaximum or would differentiable. softmax provides a “softened” versionbut of the the It p erhapsThe b e better to function call the thus softmax function “softargmax,” softmax ( z ) z. argmax. The corresp onding softcon version of the maximum function is curren current t name is an en entrenched trenched conven ven vention. tion. It would p erhaps b e better to call the softmax function “softargmax,” but the current name is an entrenched convention. The linear, sigmoid, and softmax output units describ described ed ab abo ove are the most common. Neural net networks works can generalize to almost any kind of output lay layer er that The linear, sigmoid, and softmax output units describ ed ab o ve are the most we wish. The principle of maximum likelihoo likelihood d pro provides vides a guide for how to design common. Neural networks can generalize to almost any kind of output layer that we wish. The principle of maximum likelihoo d provides a guide for how to design 186


a go goo o d cost function for nearly any kind of output lay layer. er. ; θ ), the principle of In general, if we define a conditional distribution p ( a go o d cost function for nearly any kind of output layer.y | x maxim maximum um lik likeliho eliho elihoo o d suggests we use − log p(y | x; θ ) as our cost function. In general, if we define a conditional distribution p (y x; θ ), the principle of In general, we can think of the net network as; θrepresenting a function f( x; θ). maxim um likeliho o d suggests we neural use log pwork (y x ) as our | cost function. The outputs of this function are not direct predictions of the value y. Instead, − network| as representing a function f( x; θ). wevides can think of the neural f (xIn ;θ )general, = ω pro provides the parameters for a distribution ov over er y. Our loss function The outputs of this function are not direct predictions of the value y. Instead, can then b e interpreted as − log p(y ; ω (x)). f (x ;θ ) = ω provides the parameters for a distribution over y. Our loss function or example, we ma may yaswishlog toplearn can Fthen b e interpreted (y ; ω (the x)).variance of 2a conditional Gaussian for y , giv x given en . In the simple case, where the variance σ is a constant, there is a − to learn the variance of a conditional Gaussian for F or example, we mab yecause wish closed form expression the maximum likelihoo likelihood d estimator of variance is ysimply x . empirical σ , giventhe In the simple case, where the v ariance is a constant, there a y is mean of the squared difference b et etween ween observ observations ations and closed form expression b ecause the maximum likelihoo d estimator of v ariance is their exp expected ected value. A computationally more exp expensiv ensiv ensivee approach that do does es not simply the empirical mean ofco the squared difference ween observas ations require writing sp to simply includeb et the variance one yof and the special-case ecial-case code de is their exp ected v alue. A computationally more exp ensiv e approach that do es not prop properties erties of the distribution p( y | x) that is con controlled trolled by ω = f (x ; θ ). The require sp ecial-case include thea vcost ariance as one of the the ; ωis(xto negativ negativeewriting log-lik log-likeliho eliho elihoo o d − log pco(yde )) simply will then pro provide vide function with prop erties of the distribution ) that is con trolled b y ) . The p ( ω = f ( x ; θ y x appropriate terms necessary to mak makee our optimization pro procedure cedure incremen incrementally tally ( y ; ω ( x log p negativ e log-lik eliho o d )) will then pro vide a cost function with the | learn the variance. In the simple case where the standard deviation do does es not dep depend end appropriate necessary to mak e our optimization prothat cedure incremen tally − a new on the input,terms we can make parameter in the netw network ork is copied directly learn variance. In the simple case standard dovesrepresen not depting end σ itselfthe in into to ωthe . This new parameter might b e where or could b e adeviation parameter representing on the input, we can make a new parameter in the netw ork that is copied directly representing ting σ1 , dep depending ending on how we choose to σ 2 or it could b e a parameter β represen ω σ v represen in to . This new parameter might b e itself or could b e apredict parameter ting parametrize the distribution. We may wish our mo model del to a differen different t amount it could a parameter represen ending on how we choose to σ vor of ariance in byefor different vβalues of x .ting This ,isdep called a heter heterosc osc osceedastic mo model. del. parametrize the distribution. W may wish ourthe mo sp del to predict of a differen t amount In the heteroscedastic case, w eesimply make specification ecification the variance be y x of v ariance in for different v alues of . This is called a heter osc e dastic mo del. one of the values output by f ( x; θ). A typical way to do this is to formulate the In the heteroscedastic case, w e simply make sp ecification of described the variance be Gaussian distribution using precision, ratherthe than variance, as in Eq. ( xis ; θmost one output by fit ). A common typical wto ayuse to do this is toprecision formulate the 3.22 3.22..ofInthe thevalues multiv multivariate ariate case a diagonal matrix Gaussian distribution using precision, rather than variance, as described in Eq. 3.22. In the multivariate case it is most common to use a diagonal precision matrix diag (β ). (6.34) diag(βt).descen (6.34) This form formulation ulation works well with gradien gradient descentt b ecause the formula for the log-lik log-likeliho eliho elihoo o d of the Gaussian distribution parametrized by β in involv volv volves es only mulThis formulation works well with gradien t descen t b ecause the formulaaddition, for the βi . The tiplication by β i and addition of log gradient of multiplication, log-lik eliho o d of the Gaussian distribution parametrized volves only muland logarithm op operations erations is well-behav well-behaved. ed. By comparison,by if β weinparametrized the β log β tiplication by and addition of . The gradient of multiplication, addition, output in terms of variance, we would need to use division. The division function and logarithm op erations well-behav By comparison, if wecan parametrized the b ecomes arbitrarily steepisnear zero. ed. While large gradients help learning, output in terms variance,usually we would needintoinstability use division. division function arbitrarily large ofgradients result we parametrized the instability. . If The b ecomes arbitrarily steep near zero. While large gradients can help learning, output in terms of standard deviation, the log-likelihoo log-likelihood d would still inv involve olve division, arbitrarily large gradients usually result in instability . If we parametrized the and would also inv involv olv olvee squaring. The gradient through the squaring operation output in terms of standard deviation, the log-likelihoo d would still inv olve division, can vanish near zero, making it difficult to learn parameters that are squared. and would also involve squaring. The gradient through the squaring operation 187 to learn parameters that are squared. can vanish near zero, making it difficult


Regardless of whether we use standard deviation, variance, or precision, we must ensure that the cov covariance ariance matrix of the Gaussian is positive definite. Because Regardless of whether use standard variance, we must the eigenv eigenvalues alues of the we precision matrixdeviation, are the recipro reciprocals cals or of precision, the eigenv eigenvalues alues of ensure that the cov ariance matrix of the Gaussian is positive definite. Because the cov covariance ariance matrix, this is equiv equivalen alen alentt to ensuring that the precision matrix is the eigenv alues of the precision matrix are the cals ofthe thediagonal eigenvalues of p ositiv ositivee definite. If we use a diagonal matrix, or arecipro scalar times matrix, the cov matrix, this is equiv alent toonensuring thatofthe is. then theariance only condition we need to enforce the output theprecision mo model del is pmatrix ositivity ositivity. p ositiv e definite. If we usethe a diagonal matrix, or the a scalar times thetodiagonal matrix, a is If we supp suppose ose that ra raw w activ activation ation of mo model del used determine the then the only condition we need to enforce on the output of the mo del is p ositivity diagonal precision, we can use the softplus function to obtain a p ositiv ositivee precision. a If w e supp ose that is the ra w activ ation of the mo del used to determine the vector: β = ζ( a). This same strategy applies equally if using variance or standard diagonal precision, we can use theorsoftplus obtainiden a ptity ositiv e precision deviation rather than precision if usingfunction a scalartotimes identity rather than vdiagonal ector: β matrix. = ζ( a). This same strategy applies equally if using variance or standard deviation rather than precision or if using a scalar times identity rather than It is rare to learn a cov covariance ariance or precision matrix with richer structure than diagonal matrix. diagonal. If the cov covariance ariance is full and conditional, then a parametrization must It is rare toguaran learn atees covpariance or precision of matrix with richer structure than b e chosen that guarantees ositiv ositive-definiteness e-definiteness the predicted cov covariance ariance matrix. diagonal. the cov is full aB parametrization must Σ(xand ) = conditional, B (x)B> (x) ,then This can b eIfachiev achieved ed ariance by writing where is an unconstrained b e chosen that guaran tees p ositiv covariance matrix. square matrix. One practical issuee-definiteness if the matrixofisthe fullpredicted rank is that computing the Σ ( x ) = B ( x ) B ( x ) B This can b e achiev ed by writing , where is an unconstrained 3 lik likeliho eliho elihoo o d is exp expensiv ensiv ensive, e, with a d × d matrix requiring O(d ) computation for the square matrix. One practical if the matrix is, full that computing ( x) (or determinan determinant t and inv inverse erse of Σissue equiv equivalently alently alently, andrank moreiscommonly done, the its d d O ( d lik eliho o d is exp ensiv e, with a matrix requiring ) computation for the eigendecomp eigendecomposition osition or that of B (x)). determinant and inverse of Σ( x)×(or equivalently, and more commonly done, its We often osition want toorp erform regression, that is, to predict real values eigendecomp that ofmultimodal B (x)). that come from a conditional distribution p ( y | x) that can ha have ve sev several eral differen differentt Weinoften want for to pthe erform multimodal that is, to predict mixture real values y space p eaks same value of xregression, . In this case, a Gaussian is y x p ( that come from a conditional distribution ) that can ha ve sev eral differen a natural representation for the output (Jacobs et al., 1991; Bishop, 1994). t y space p eaks in for the same mixtures value of xas . their In| this case, are a Gaussian mixture ise Neural netw networks orks with Gaussian output often called mixtur mixture a natural representation for the output ( Jacobs et al. , 1991 ; Bishop , 1994 ). density networks networks.. A Gaussian mixture output with n comp componen onen onents ts is defined by Neural networksprobability with Gaussian mixtures as their output are often called mixture the conditional distribution density networks. A Gaussian mixture output with n comp onents is defined by n the conditional probability X distribution (6.35) p(y | x) = p(c = i | x)N (y ; µ (i) (x), Σ (i)(x)). i=1

(6.35) p(y x) = p(c = i x) (y ; µ (x), Σ (x)). The neural net network work must hav havee three outputs: a vector defining p ( c = i | x ), a | | N matrix providing µ(i) (x) for all i, and a tensor providing Σ (i)( x) for all i. These The neural netsatisfy work must havconstraints: e three outputs: a vector defining p ( c = i x ), a outputs must different matrix providing µ (x) forX all i, and a tensor providing Σ ( x) for all i. |These outputs must satisfy different p( cconstraints: = i | x): these form a multinoulli distribution 1. Mixture comp components onents over the n differen differentt comp componen onen onents ts asso associated ciated with latent variable c, and can 1. Mixture comp onents p( c = i x): these form a multinoulli distribution Weoconsider be latent because we do in thelatent data: given input c, and ver the cntodifferen t comp onen ts| not assoobserve ciateditwith variable andtarget can , it is not possible to know with certainty which Gaussian component was responsible for , but we can imagine that was generated by picking one of them, and make that unobserved choice a random variable. 188


typically b e obtained by a softmax ov over er an n-dimensional vector, to guarantee that these outputs are p ositive and sum to 1. typically b e obtained by a softmax over an n-dimensional vector, to guarantee µ(i)(outputs x ): these 2. Means associated ciated with the i-th that these areindicate p ositive the andcenter sum toor1.mean asso Gaussian comp component, onent, and are unconstrained (typically with no nonlinearity µ ( x i-th 2. at Means ) : these units). indicateIf y the or mean ciated all for these output is center a d -v -vector, ector, then asso the netw network orkwith mustthe output Gaussian comp onent, and are unconstrained (typically with no nonlinearity an n × d matrix con containing taining all n of these d-dimensional vectors. Learning yeliho at all for these output units). is a do-v then the netw ork must output these means with maxim maximum um Iflik likeliho elihoo d ector, is slightly more complicated than an n d the matrix conof taining all n of these -dimensional vectors. learning means a distribution with donly one output mo mode. de. Learning We only these means withthe maxim eliho o d is slightly complicated w ant ×to up update date meanum forlik the comp componen onen onent t thatmore actually pro produced ducedthan the learning the means of a distribution with only one output mo de. W e only observ observation. ation. In practice, we do not kno know w which comp componen onen onentt pro produced duced eac each h w ant to up date the mean for the comp onen t that actually pro duced the observ observation. ation. The expression for the negative log-likelihoo log-likelihood d naturally weights observ ation. Incontribution practice, wetodothe not kno which comp onen t pro each eac each h example’s loss forweach comp component onent by the duced probability observ ation. The expression forthe theexample. negative log-likelihoo d naturally weights that the comp componen onen onent t pro produced duced each example’s contribution to the loss for each comp onent by the probability i)(x Σ(onen 3. Co Cov variances )t: pro these sp specify ecify cov covariance ariance matrix for each comp componen onen onentt that the comp duced the the example. i. As when learning a single Gaussian comp componen onen onent, t, we typically use a diagonal Σ ( x 3. matrix Covariances ) : these sp ecify the cov ariance each comp onent to av avoid oid needing to compute determinan determinants. ts. matrix As withfor learning the means i. As learning a single Gaussiandcomp onent, we typically use ato diagonal of thewhen mixture, maxim is complicated by needing assign maximum um likelihoo likelihood matrix to av oid needing to compute determinan ts. As with learning the means partial resp responsibility onsibility for eac each h p oin ointt to eac each h mixture comp componen onen onent. t. Gradient of the mixture, maxim um likelihoo d is complicated by needing tocorrect assign descen descentt will automatically follo follow w the correct pro process cess if given the partial resp onsibility for each p oint to eac h mixture comp onen t.del. Gradient sp specification ecification of the negative log-likelihoo log-likelihood d under the mixture mo model. descent will automatically follow the correct pro cess if given the correct ecification of the log-likelihoo d under the mo del. It hasspbeen rep reported orted thatnegative gradien gradient-based t-based optimization of mixture conditional Gaussian mixtures (on the output of neural netw networks) orks) can b e unreliable, in part b ecause one It has been rep orted gradien t-based of unstable conditional Gaussian gets divisions (by the that variance) which can optimization b e numerically (when some mixtures (on the output of neural netw orks) can b e unreliable, in part b ecause one variance gets to b e small for a particular example, yielding very large gradients). gets divisions (by the v ariance) which can b e numerically unstable (when some One solution is to clip gr gradients adients (see Sec. 10.11.1) while another is to scale the vgradien ariance gets to b e small for a particular example, yielding very large gradients). gradients ts heuristically (Murray and Laro Larochelle chelle , 2014 ). One solution is to clip gradients (see Sec. 10.11.1) while another is to scale the Gaussian mixture outputs are particularly effective in generativ generativee mo models dels of gradients heuristically (Murray and Laro chelle, 2014). sp speec eec eech h (Sc Schuster huster, 1999) or mo mov vements of physical ob objects jects (Grav Graves es, 2013). The Gaussian mixture outputs are particularly effective in generativ e mo dels of mixture density strategy giv gives es a way for the netw network ork to represent multiple output sp eec (Schuster , 1999the ) orvmo vements of output, physicalwhich ob jects Gravesfor , 2013 ). The mo modes desh and to control ariance of its is (crucial obtaining density givthese es a wreal-v ay foralued the netw ork to An represent multiple output amixture high degree of strategy quality in real-valued domains. example of a mixture mo desy and to control theinvariance densit density net network work is shown Fig. 6.4of . its output, which is crucial for obtaining a high degree of quality in these real-valued domains. An example of a mixture In ygeneral, weisma may y wish contin continue model del larger vectors y con containing taining more densit network shown in to Fig. 6.4.ue to mo variables, and to imp impose ose ric richer her and richer structures on these output variables. For y conof In general, we may wish to contin to mo taining more example, we ma may y wish for our neuralue netw network orkdel to larger outputvectors a sequence characters vthat ariables, anda to imp ose ric and richer on tinue these output ariables. For forms sen sentence. tence. Inherthese cases, structures w wee ma may y con continue to use vthe principle example, we ma y wish for our neural netw ork to output a sequence of c haracters of maximum likelihoo likelihood d applied to our model p(y ; ω( x )) )),, but the mo model del we use that forms a sentence. In these cases, we may continue to use the principle of maximum likelihoo d applied to our189 model p(y ; ω( x )), but the mo del we use


Figure 6.4: Samples drawn from a neural netw network ork with a mixture density output la layer. yer. The input x is sampled from a uniform distribution and the output y is sampled from Figure drawn from orknonlinear with a mixture density layer. pmodel (y6.4: . The neural netw network orkaisneural able tonetw learn mappings from output the input to | x)Samples x y The input is sampled from a uniform distribution and the output is sampled from the parameters of the output distribution. These parameters include the probabilities p verning (y xwhich ). Theofneural ork iscomponents able to learn nonlinear from to go governing three netw mixture will generatemappings the output asthe wellinput as the the parameters of themixture output component. distribution. Each Thesemixture parameters include isthe probabilities | for each parameters component Gaussian with governing mean whichand of three mixture willofgenerate thedistribution output as well as the predicted variance. All ofcomponents these asp aspects ects the output are able to parameters for each mixture mixture component is Gaussian with v ary with respect to the input component. x, and to do Each so in nonlinear wa ways. ys. predicted mean and variance. All of these asp ects of the output distribution are able to vary with respect to the input x, and to do so in nonlinear ways.

to describe y b ecomes complex enough to be b ey eyond ond the scope of this chapter. Chapter 10 describ describes es how to use recurrent neural netw networks orks to define such mo models dels y to describe b ecomes complex enough to be b ey ond the scope of this c hapter. over sequences, and Part I I I describ describes es adv advanced anced techniques for mo modeling deling arbitrary Chapter 10 describ es how to use recurrent neural netw orks to define such mo dels probabilit probability y distributions. over sequences, and Part I I I describ es advanced techniques for mo deling arbitrary probability distributions.

6.3

Hidden Units

So e ha have ve fo focused cused our discussion on design choices for neural netw networks orks that 6.3far wHidden Units are common to most parametric machine learning mo models dels trained with gradientSo far w e ha ve fo cused our discussion on design choices for to neural networks that based optimization. Now we turn to an issue that is unique feedforward neural are common parametric mo dels trained with gradientnet networks: works: ho how wtotomost cho hoose ose the type machine of hiddenlearning unit to use in the hidden la layers yers of the based optimization. Now w e turn to an issue that is unique to feedforward neural mo model. del. networks: how to cho ose the type of hidden unit to use in the hidden layers of the The design of hidden units is an extremely active area of research and do does es not mo del. yet ha have ve man many y definitiv definitivee guiding theoretical principles. The design of hidden units is an extremely active area of research and do es not unitse are an excellen excellent t default choice of hidden unit. Many other yet Rectified have manlinear y definitiv guiding theoretical principles. typ ypes es of hidden units are av available. ailable. It can b e difficult to determine when to use Rectified linear units are an excellen t default choice an of hidden unit.choice). Many other whic which h kind (though rectified linear units are usually acceptable We typ es of hidden units are available. It can b e difficult to determine when to use which kind (though rectified linear units 190 are usually an acceptable choice). We


describ describee here some of the basic intuitions motiv motivating ating each type of hidden units. These intuitions can b e used to suggest when to try out each of these units. It is describ eimp here sometoofpredict the basic intuitions motiv ating type hiddenpro units. usually impossible ossible in adv advance ance which will workeach b est. Theofdesign process cess These intuitions can b e used to suggest when to try out each of these units. It is consists of trial and error, in intuiting tuiting that a kind of hidden unit may work well, usually imp ossible to predict in adv ance which will w ork b est. The design pro cess and then training a net network work with that kind of hidden unit and ev evaluating aluating its consists of trial error, inset. tuiting that a kind of hidden unit may work well, p erformance on aand validation and then training a network with that kind of hidden unit and evaluating its Some of the hidden units included in this list are not actually differentiable at p erformance on a validation set. all input p oints. For example, the rectified linear function g (z ) = max max{ {0 , z } is not Some of the hidden units included in this list are not actually differentiable at differen differentiable tiable at z = 00.. This may seem lik likee it inv invalidates alidates g for use with a gradientg (zp)erforms = max w0ell all input p oints.algorithm. For example, the rectified lineardescent function is not , z enough based learning In practice, gradient still differen tiable at z to =b 0.e This seem likelearning it invalidates for use with { a gradient} for these mo models dels used may for machine tasks.g This is in part because based learning In practice, descent stillatp erforms well enough neural netw network orkalgorithm. training algorithms dogradient not usually arrive a lo local cal minimum of for these mo dels to b e used for machine learning tasks. This is in part because the cost function, but instead merely reduce its value significan significantly tly tly,, as shown in neural ork training do usually arrive at8. aBecause lo cal minimum of Fig. 4.3netw . These ideas willalgorithms b e describ described ed not further in Chapter we do not the costtraining function, but instead merely reduce value significan shown in exp expect ect to actually reach a p oin oint t whereitsthe gradient is 0 ,tly it ,isasacceptable Fig.the 4.3minima . Theseofideas willfunction b e describ further 8. Because do not for the cost to ed corresp correspond ondin toChapter p oints with undefinedwegradient. 0 exp ect training to actually reach a p oin t where the gradient is , it is acceptable Hidden units that are not differentiable are usually non-differen non-differentiable tiable at only a for thenum minima of pthe cost function to acorresp ondgto (z )p oints small numb b er of oin oints. ts. In general, function has awith left undefined deriv derivative ativegradient. defined Hidden units that are not differentiable are usually non-differen tiable at only by the slope of the function immediately to the left of z and a right deriv derivativ ativ ativeea small num er ofslop p oin ts. the In general, function g (z )tohas left deriv defined defined bybthe slope e of functionaimmediately thea right of z .ative A function z b y the slope of the function immediately to the left of and a right deriv ativ e is differen differentiable tiable at z only if b oth the left deriv derivative ative and the right deriv derivativ ativ ative e are defined b y the slopto e of the other. function immediately to the of ztext . A of function defined and equal each The functions used in right the con context neural z is differen tiable at only if b oth the left deriv ative and the right deriv ativ e are net networks works usually ha have ve defined left deriv derivatives atives and defined right deriv derivativ ativ atives. es. In the defined and equal to each other. The functions used in the con text of neural max{ {0 , z }, the left deriv case of g (z ) = max derivative ative at z = 0 is 0 and the right deriv derivativ ativ ativee net works usually ha ve defined left deriv atives and defined right deriv ativ es. In is 1. Soft Software ware implemen implementations tations of neural netw network ork training usually return onethe of 0 , zes, rather maxativ caseone-sided of g (z ) =deriv the leftthan deriv ative at zthat = 0the is 0deriv and ative the right derivativor e the derivativ atives rep reporting orting derivative is undefined is 1. Soft implemen of neural netw ork training usually that return one of { ma } tations raising anware error. This may y b e heuristically justified by observing gradien gradienttthe one-sided deriv ativ es rather than rep orting that the deriv ative is undefined or based optimization on a digital computer is sub subject ject to nu numerical merical error an anyw yw yway ay ay.. raising an error. This ma y b e heuristically justified b y observing that gradien When a function is asked to ev evaluate aluate g(0) (0),, it is very unlikely that the underlyingtoptimization on a digital sub ject to vnu merical anyway.  thaterror vbased alue truly was 0. Instead, it was computer lik small alue was rounded likely ely to b eissome When a function is asked to ev aluate g(0),pleasing it is veryjustifications unlikely that underlying to 0. In some con contexts, texts, more theoretically arethe av available, ailable, but  vthese alue truly was 0 . Instead, it w as lik ely to b e some small v alue that was rounded usually do not apply to neural net network work training. The imp importan ortan ortantt p oin ointt is that to 0 . In some con texts, more theoretically pleasing justifications are av ailable, but in practice one can safely disregard the non-differen non-differentiability tiability of the hidden unit theseation usually do not apply toedneural activ activation functions describ described b elo elow. w.network training. The imp ortant p oint is that in practice one can safely disregard the non-differentiability of the hidden unit Unless indicated otherwise, most hidden units can b e describ described ed as accepting activation functions describ ed b elow. a vector of inputs x, computing an affine transformation z = W >x + b, and indicated otherwise, hidden units can describ ed as accepting thenUnless applying an elemen element-wise t-wise most nonlinear function hidden units are g( z)b.e Most x z = W x + b,ation a vector of inputs , computing an affine transformation and distinguished from eac each h other only by the choice of the form of the activ activation then applying function g (z ). an element-wise nonlinear function g( z). Most hidden units are distinguished from each other only by the choice of the form of the activation 191 function g (z ).


Rectified linear units use the activ activation ation function g (z ) = max{0, z }.

Rectified linear units optimize b ecause are0so to linear Rectified linear units useare theeasy activto ation function g (z ) they = max , z similar . units. The only difference b et etween ween a linear unit and a rectified linear unit is { so }similar to linear linear units areoutputs easy to zero optimize b ecause they are thatRectified a rectified linear unit across half its domain. This makes the units. The only difference b et ween a linear unit and a rectified is deriv derivatives atives through a rectified linear unit remain large whenev whenever er the linear unit isunit active. thatgradients a rectifiedarelinear unit large outputs half its domain. Thisative makes the The not only but zero also across consisten consistent. t. The second deriv derivative of the deriv atives through a rectified linear unit remain large whenev er the unit is active. rectifying op operation eration is 0 almost ev everywhere, erywhere, and the deriv derivative ative of the rectifying The gradients are not only large but also consisten t. The second derivthe ative of the op operation eration is 1 everywhere that the unit is activ active. e. This means that gradient rectifyingisop eration is 0 almost everywhere, the bderiv ative ofation the rectifying direction far more useful for learning than itand would e with activ activation functions op eration is 1 everywhere that the unit is activ e. This means that the gradient that introduce second-order effects. direction is far more useful for learning than it would b e with activation functions linear units are effects. typically used on top of an affine transformation: thatRectified introduce second-order > on top of an affine transformation: Rectified linear units are typically used h = g (W x + b). (6.36)

h =ofgthe (W affine x + btransformation, ). When initializing the parameters it can b e a(6.36) go goo od practice to set all elements of b to a small, p ositive value, such as 0.1. This makes When of the affine transformation, canmost b e ainputs go o d it veryinitializing lik likely ely that the the parameters rectified linear units will b e initially activeit for b toderiv practice to set all of the a small, p ositive such as 0.1. This makes in the training setelements and allow derivatives atives to passvalue, through. it very likely that the rectified linear units will b e initially active for most inputs Sev Several eral generalizations rectified lineartounits Most of these generalin the training set and allowofthe derivatives pass exist. through. izations p erform comparably to rectified linear units and o ccasionally p erform Several generalizations of rectified linear units exist. Most of these generalb etter. izations p erform comparably to rectified linear units and o ccasionally p erform One drawback to rectified linear units is that they cannot learn via gradientb etter. drawback based methods on examples for which their activ activation ation is zero. A v variety ariety of One drawbac k to rectified linear units is that they cannot learn via gradientgeneralizations of rectified linear units guarantee that they receive gradient ev everyerybased methods on examples for which their activ ation is zero. A v ariety of where. generalizations of rectified linear units guarantee that they receive gradient everyThree generalizations of rectified linear units are based on using a non-zero where. max(0 (0, zi) + α i min min(0 (0, zi ). Absolute value slop slopee α i when zi < 0: hi = g ( z , α) i = max Three generalizations of rectified linear units are based using a non-zero rectific ctification ation fixes α i = −1 to obtain g (z) = |z |. It is used foronob object ject recognition α z < h = g ( z , α ) = max (0 , z ) + α min (0 , z slop e when 0 : ) . A bsolute value from images (Jarrett et al., 2009), where it mak makes es sense to seek features that are αolarity = 1rev g (z)input = z illumination. rinv evctific ation fixes toersal obtain . It is used for ob ject recognition in ariant under ap reversal of the Other generalizations from imageslinear (Jarrett et− al., more 2009),broadly where it mak toaky seek features that | | es sense of rectified units are applicable. A le leaky ReLU (Maas et are al., in v ariant under a p olarity rev ersal of the input illumination. Other generalizations 2013 2013)) fixes αi to a small value like 0.01 while a par arametric ametric ReLU or PR PReLU eLU treats of rectified linear units are more broadly applicable. A le aky R eLU ( Maas et al., αi as a learnable parameter (He et al., 2015). 2013) fixes α to a small value like 0.01 while a parametric ReLU or PReLU treats Maxout units (Go Gooo dfello dfellow w et al. al.,, 2013a) generalize rectified linear units further. α as a learnable parameter (He et al., 2015). Instead of applying an elemen element-wise t-wise function g (z ), maxout units divide z in into to Maxout units ( Go o dfello w et al. , 2013a ) generalize rectified linear units further. groups of k values. Each maxout unit then outputs the maximum element of one Instead of applying an element-wise function g (z ), maxout units divide z into groups of k values. Each maxout unit then outputs the maximum element of one 192


of these groups: of these groups:

g (z ) i = max z j j∈G

(6.37)

g (z ) = max z (6.37) 1)k k + 1, . . . , ik where G (i) is the indices of the inputs for group i , { (i − 1) ik} } . This pro provides vides way y of learning a piecewise linear function that resp responds onds to multiple G a wa i ( 1) k + 1, . . . , ik . This where is the indices of the inputs for group , i directions in the input x space. provides a way of learning a piecewise linear function { that − resp onds to m } ultiple k pieces. A maxout unit can learn a piecewise linear, con conv v ex function with up to directions in the input x space. Maxout units can thus b e seen as itself rather k pieces. A maxout unit can learn a piecewise linear, con v ex function with up to unit than just the relationship b etw etween een units. With large enough k , a maxout can Maxout units can thus b e seen as itself rather learn to appro approximate ximate an any y conv convex ex function with arbitrary fidelit fidelity y. In particular, k than just the relationship b etw een units. With large enough , a maxout unit a maxout la layer yer with tw two o pieces can learn to implement the same function of can the learn x toasappro ximate an y er conv ex function with arbitrary fidelit y. In particular, input a traditional lay layer using the rectified linear activ activation ation function, absolute maxout layer with two pieces can leaky learn to sameorfunction of the vaalue rectification function, or the or implement parametricthe ReLU, can learn to x input as a traditional lay er using the rectified linear activ ation function, absolute implemen implementt a totally different function altogether. The maxout lay layer er will of course vbalue rectification function, or the leaky or parametric ReLU, or to e parametrized differently from an any y of these other la layer yer types, so can the learn learning implementwill a totally different function altogether. The maxout will of course dynamics b e differen different t even in the cases where maxout learnslay toerimplement the b e parametrized differently from an y of these other la yer types, so the learning same function of x as one of the other lay layer er types. dynamics will b e different even in the cases where maxout learns to implement the Eac Each h maxout unit is now parametrized by k weight vectors instead of just one, same function of x as one of the other layer types. so maxout units typically need more regularization than rectified linear units. They k weight h maxout unit isregularization now parametrized just one, can Eac work well without if the by training set visectors large instead and theofnumber of so maxout units typically need more regularization than rectified linear units. They pieces p er unit is kept lo low w (Cai et al., 2013). can work well without regularization if the training set is large and the number of Maxout units have few(Cai other b enefits. pieces p er unit is hav kepte alow et al. , 2013).In some cases, one can gain some statistical and computational adv advantages antages by requiring few fewer er parameters. Sp Specifically ecifically ecifically,, Maxout units have a few b enefits. Infilters some cases, can gain some staif the features captured by nother differen different t linear can b eone summarized without tisticalinformation and computational antages byerrequiring fewof erkparameters. Sp ecifically losing by takingadv the max ov over each group features, then the next, n if the features captured by differen t linear filters can b e summarized without la layer yer can get by with k times fewer weights. losing information by taking the max over each group of k features, then the next Because unit kis times drivenfewer by mw ultiple havee some redunlayer can geteach by with eights.filters, maxout units hav dancy that helps them to resist a phenomenon called catastr atastrophic ophic for forgetting getting in Because each unit is driven b y m ultiple filters, maxout units hav e some whic which h neural netw networks orks forget how to p erform tasks that they were trainedredunon in dancy that helps them to resist a phenomenon called c atastr ophic for getting in the past (Go Goo o dfello dfellow w et al. al.,, 2014a). which neural networks forget how to p erform tasks that they were trained on in and all of).these generalizations of them are based on the the Rectified past (Go olinear dfellounits w et al. , 2014a principle that mo models dels are easier to optimize if their behavior is closer to linear. Rectified linear units andofall of these of them are optimization based on the This same general principle using lineargeneralizations b ehavior to obtain easier principle that mo dels are easier to optimize if their behavior is closer to linear. also applies in other con contexts texts b esides deep linear netw networks. orks. Recurrent netw networks orks can This principle of using linear b ehavior to obtain easier optimization learnsame from general sequences and pro a sequence of states and outputs. When training produce duce also applies in other contexts b esides deepthrough linear netw orks.time Recurrent networks can them, one needs to propagate information sev several eral steps, which is much learn from andcomputations pro duce a sequence of states and outputs. When btraining easier whensequences some linear (with some directional deriv derivatives atives eing of them, one needs to propagate information through sev eral time steps, which is m magnitude near 1) are inv involv olv olved. ed. One of the best-p best-performing erforming recurren recurrentt net netw wuch ork easier when some linear computations (with some directional derivatives b eing of magnitude near 1) are involved. One 193 of the best-p erforming recurrent network


arc architectures, hitectures, the LSTM, propagates information through time via summation—a particular straightforw straightforward ard kind of suc such h linear activ activation. ation. This is discussed further arc hitectures, the LSTM, propagates information through time via summation—a in Sec. 10.10. particular straightforward kind of such linear activation. This is discussed further in Sec. 10.10. Prior to the introduction of rectified linear units, most neural netw networks orks used the logistic sigmoid activ activation ation function Prior to the introduction of rectified linear units, most neural networks used the logistic sigmoid activation functiong (z ) = σ (z ) (6.38) g (zfunction ) = σ (z ) or the hyperb hyperbolic olic tangent activ activation ation

(6.38)

or the hyperb olic tangent activation g (z ) function = tanh(z ).

(6.39)

g (z ) = tanh(zb)ecause . These activ activation ation functions are closely related tanh(z ) = 2σ(2z ) − 1(6.39) .

We ha have ve already seen sigmoid units as output units, used to predict the These activation functions are closely related b ecause tanh(z ) = 2σ(2z ) 1. probabilit probability y that a binary variable is 1. Unlike piecewise linear units, sigmoidal − W e ha ve already seen of sigmoid units as output units, toused to predict the units saturate across most their domain—they saturate a high value when probabilit y that a binary v ariable is 1 . Unlike piecewise linear units, sigmoidal z is very p ositive, saturate to a low value when z is very negative, and are only units saturate across mostinput of their domain—they to a high value when z is near 0. saturate strongly sensitive to their when The widespread saturation of zsigmoidal z is very punits ositive, saturate to a low v alue when is very negative, and are only can mak makee gradient-based learning very difficult. For this reason, strongly sensitive to theirininput when z isnetw near 0. is The widespread saturation of their use as hidden units feedforward networks orks no now w discouraged. Their use sigmoidal units can mak e gradient-based learning very difficult. F or this reason, as output units is compatible with the use of gradien gradient-based t-based learning when an their use as hidden units in feedforward netw orks is no w discouraged. Their use appropriate cost function can undo the saturation of the sigmoid in the output as output units is compatible with the use of gradient-based learning when an la layer. yer. appropriate cost function can undo the saturation of the sigmoid in the output When a sigmoidal activ activation ation function must b e used, the hyperb hyperbolic olic tangent layer. activ activation ation function typically p erforms b etter than the logistic sigmoid. It resembles ay sigmoidal activclosely ation function must that b e used, the = hyperb olicσ (0) tangent tanh(0) = 12 . the When identit function more 0 while identity closely, , in the sense activationtanh function typically p erforms b etter the logistic Itwork resembles yˆ = Because is similar to iden identity tity near 0, than training a deep sigmoid. neural net network tanh σ (0) = the identit y function more closely , in the sense that (0) = 0 while > > > > > > w tanh tanh((U tanh tanh((V x)) resembles training a linear model yˆ = w U V x so. tanh = Because is ations similaroftothe iden tity near a deep work yˆthe long as the activ activations netw network ork can0b, etraining kept small. Thisneural mak makes esnet training w (Uorktanh (V x)) resembles training a linear model yˆ = w U V x so tanhtanh netw easier. network long as the activations of the network can b e kept small. This makes training the activ activation ation functions are more common in settings other than feedtanhSigmoidal network easier. forw forward ard netw networks. orks. Recurren Recurrentt netw networks, orks, man many y probabilistic mo models, dels, and some Sigmoidal activ ation functions are more common in settings other feedauto autoencoders encoders ha have ve additional requirements that rule out the use of than piecewise forward netw orks. Recurren t netw many units probabilistic mo dels,despite and some linear activ activation ation functions and makeorks, sigmoidal more app appealing ealing the auto encoders ha ve additional requirements that rule out the use of piecewise dra drawbacks wbacks of saturation. linear activation functions and make sigmoidal units more app ealing despite the drawbacks of saturation. 194


Man Many y other types of hidden units are p ossible, but are used less frequently frequently.. general, a wide variet ariety y of differentiable functions p erform p erfectly ManIn y other types of hidden units are p ossible, but are used less frequently . well. Man Many y unpublished activ activation ation functions p erform just as well as the p opular ones. In general, a wide v ariet y ofthe differentiable functions p erform net p erfectly well. To pro a concrete example, authors tested a feedforward provide vide network work using Man y unpublished activ functions just as an well as the ones. h = cos cos( (W x + b) on theation MNIST datasetp erform and obtained error ratep opular of less than To pro videisa comp concrete example, the authors feedforward network using 1%, which competitiv etitiv etitive e with results obtainedtested using amore con conven ven ventional tional activ activation ation h = cos ( W x + b ) on the MNIST dataset and obtained an error rate of less than functions. During researc research h and dev development elopment of new tec techniques, hniques, it is common 1%, which is comp etitiv withation results obtainedand using ventional activation to test many differen different t eactiv activation functions findmore thatcon several variations on functions.practice Duringp erform researchcomparably and development of new hniques, it hidden is common standard comparably. . This means thattecusually new unit test many differen t activ ation and find thattoseveral on tto yp ypes es are published only if they arefunctions clearly demonstrated providevariations a significant standard practice p erform comparably . This means that usually new hidden unit impro improvemen vemen vement. t. New hidden unit types that p erform roughly comparably to known ttyp yp es are published if they are clearly demonstrated to provide a significant are so commononly as to b e uninteresting. ypes es improvement. New hidden unit types that p erform roughly comparably to known It would b e impractical to list all of the hidden unit types that hav havee app appeared eared typ es are so common as to b e uninteresting. in the literature. We highlight a few esp especially ecially useful and distinctiv distinctivee ones. It would b e impractical to list all of the hidden unit types that have app eared (z ) at all. One can also think of Oneliterature. p ossibilityWis to not hav have an activ activation ation guseful in the e highlight aefew esp ecially and distinctive ones. this as using the iden identity tity function as the activ activation ation function. We ha hav ve already g ( z One p ossibility is to not hav e an activ ation ) at all. One can also of seen that a linear unit can be useful as the output of a neural net network. work.think It may this base using identityunit. function as lay theeractiv ation function. e have already also used asthe a hidden If every layer of the neural netw network orkWconsists of only seen that a linear unit can be useful as the output of a neural net work. Iter, may linear transformations, then the netw network ork as a whole will b e linear. Ho Howev wev wever, it also b e used as hidden unit. If every lay er of the neural netw ork consists of only is acceptable fora some lay of the neural net to b e purely linear. Consider layers ers network work transformations, then nthe netwand ork pasoutputs, a wholehwill > x + bHowever, it = gb(eWlinear. alinear neural netw network ork lay layer er with inputs ). We ma may y is acceptable for some lay ers of the neural net work to b e purely linear. Consider replace this with tw two o lay layers, ers, with one lay layer er using weight matrix U and the other n p outputs, h = gfunction, ( W x +then b ). W a neural netw ork lay er with inputs and ma y using weigh weightt matrix V . If the first lay layer er has no activ activation ation wee ha have ve U and replace thisfactored with twothe layers, with one lay using weightlay matrix W . other essen essentially tially weigh weight t matrix oferthe original layer er based on the The V using weigh t matrix . If the first lay er has no activ ation function, then we have > > factored approach is to compute h = g(V U x + b ). If U pro produces duces q outputs, essenU tially the contain weight matrix original layer based . The W V together + pthe ) q parameters, W on np then andfactored only (n of while con contains tains h = g ( V U x + b U q factored approach is to compute ) . If pro duces outputs, parameters. For small q , this can b e a considerable sa saving ving in parameters. It + p) qtransformation np then Uatand together contain only parameters, while con tains but comes theVcost of constraining the(n linear to b eWlo low-rank, w-rank, parameters. small q , this a considerable ving in parameters. It these low-rankFor relationships are can oftenb esufficient. Linear sa hidden units thus offer an comes at cost of constraining transformation to b e low-rank, but effectiv effective e wthe ay of reducing the num numb bthe er oflinear parameters in a net network. work. these low-rank relationships are often sufficient. Linear hidden units thus offer an Softmax units are another kind of unit that is usually used as an output (as effective way of reducing the numb er of parameters in a network. describ described ed in Sec. 6.2.2.3) but ma may y sometimes b e used as a hidden unit. Softmax Softmax units are another kind of unit that is ousually used asvan output k units naturally represent a probabilit probability y distribution ver a discrete ariable with(as describ ed in Sec. 6.2.2.3 ) but ma y sometimes b e used as a hidden unit. Softmax p ossible values, so they may b e used as a kind of switch. These kinds of hidden k units are naturally a probabilit y distribution over a discrete variablelearn withto units usuallyrepresent only used in more adv advanced anced architectures that explicitly p ossible values, so they mayedb einused a kind manipulate memory memory, , describ described Sec.as10.12 . of switch. These kinds of hidden units are usually only used in more advanced architectures that explicitly learn to manipulate memory, describ ed in Sec. 10.12. 195


A few other reasonably common hidden unit types include:   A few other reasonably common hidden unit types include: • Radial bbasis ||W W:,i − x||2 . This asis function or RBF unit: hi = exp −σ1 || function b ecomes more active as x approac approaches hes a template W :,i. Because it W x Radial basis function or RBF unit: h = exp . This saturates to 0 for most x, it can b e difficult to optimize. • function b ecomes more active as x approaches a template − || W− . ||Because it a ( a)0 = a ) = log + e b).e This • Softplus saturates forζ (most x, (1 it can difficult optimize. Softplus:: gto is a to smo smooth oth version of the rectifier, in by Dugas et al. ( 2001 ) for function approximation by Nair intro tro troduced duced oth version of and rectifier, g ( a ) = ζ ( a ) = log (1 + e Softplus : ) . This is a smo the and Hinton (2010) for the conditional distributions of undirected probabilistic in tro duced by Dugas al. ()2001 ) for function approximation and byfound Nair • mo models. dels. Glorot et al. et (2011a compared the softplus and rectifier and and Hinton ( 2010 ) for the conditional distributions of undirected probabilistic b etter results with the latter. The use of the softplus is generally discouraged. mo dels. Glorot et al. (2011athat ) compared the softplus rectifier found The softplus demonstrates the p erformance of and hidden unit and types can b etter withtuitiv the latter. use of the is generally discouraged. b e veryresults counterin counterintuitiv tuitive—one e—oneThe might exp expect ectsoftplus it to hav have e an adv advan an antage tage over Therectifier softplusdue demonstrates that tiable the p erformance types less can the to b eing differen differentiable ev everywhere erywhereof orhidden due to unit saturating b e v ery counterin tuitiv e—one might exp ect it to hav e an adv an tage o ver completely completely,, but empirically it do does es not. the rectifier due to b eing differentiable everywhere or due to saturating less • Har completely but empirically it do es not. Hard d tanh :, this is shap shaped ed similarly to the tanh and the rectifier but unlik unlikee min(1 (1, a)) the latter, it is b ounded, g(a) = max (− 1, min )).. It was introduced tanh tanh Har d : this is shap ed similarly to the and the rectifier but unlike by Collob ( 2004 ). Collobert ert • the latter, it is b ounded, g(a) = max ( 1, min(1, a)). It was introduced by Collob (2004 ). Hidden unit ert design remains an activ activee area − of researc research h and man many y useful hidden unit types remain to b e disco discovered. vered. Hidden unit design remains an active area of research and many useful hidden unit types remain to b e discovered.

6.4

Arc Architecture hitecture Design

Another key design consideration for neural netw networks orks is determining the architecture. 6.4 Arc hitecture Design The word ar archite chite chitectur ctur cturee refers to the ov overall erall structure of the net netw work: ho how w many Another key design consideration for neural netw orks is determining the architecture. units it should ha have ve and how these units should b e connected to each other. The word architecture refers to the overall structure of the network: how many Most neuralhanet networks are these organized groups of units called lay layers. ers. Most units it should veworks and how units into should b e connected to each other. neural netw network ork architectures arrange these lay layers ers in a chain structure, with eac each h neural networks organized into groups units called the layers. la layer yerMost b eing a function of theare la layer yer that preceded it. In of this structure, first Most la layer yer neural is givennetw by ork architectures arrangethese layers in a chain structure, with each layer b eing a function of the(1) layer that preceded it. In this structure, the first layer (6.40) h = g(1) W (1)> x + b(1) , is given by the second lay layer er is giv given en bhy = g , (6.40) W x+b   the second layer is given hb(2) y = g(2) W (2)>h (1) + b(2) , (6.41) and so on. and so on.

h

=g

 W

196



h

+b  , 

(6.41)


In these chain-based arc architectures, hitectures, the main architectural considerations are to cho hoose ose the depth of the net network work and the width of each lay layer. er. As we will see, In these chain-based arc hitectures, the main architectural considerations are a netw network ork with even one hidden lay layer er is sufficien sufficientt to fit the training set. Deep Deeper er to c ho ose the depth of the net work and the width of each lay er. As we will see, net networks works often are able to use far few fewer er units p er la layer yer and far fewer parameters a netw ork with even one hidden lay er is sufficien t to fit harder the training set. Deep er and often generalize to the test set, but are also often to optimize. The networks often are able tofor usea far p er lavia yerexp and far fewer parameters ideal netw network ork architecture taskfew mer ustunits b e found experimentation erimentation guided by and often generalize to the test set, but are also often harder to optimize. The monitoring the validation set error. ideal network architecture for a task must b e found via exp erimentation guided by monitoring the validation set error. A linear mo model, del, mapping from features to outputs via matrix multiplication, can by definition represen representt only linear functions. It has the adv advantage antage of b eing easy to A linear mo del, mapping from features to outputs via matrix multiplication, can train b ecause many loss functions result in conv convex ex optimization problems when by definition represen t only linear functions. It has the antage of b eing easy to applied to linear mo models. dels. Unfortunately Unfortunately, , we often wan want t toadv learn nonlinear functions. train b ecause many loss functions result in convex optimization problems when At first glance, we migh mightt presume that learning a nonlinear function requires applied to linear mo dels. Unfortunately, we often want to learn nonlinear functions. designing a sp specialized ecialized mo model del family for the kind of nonlinearity we wan wantt to learn. A t first glance, we migh t presume that learning a nonlinear function Fortunately ortunately,, feedforward netw networks orks with hidden la layers yers pro provide vide a univ universal ersal requires appro approxixidesigning a sp ecialized mo del family for the kind of nonlinearity we wan t to learn. mation framew framework. ork. Sp Specifically ecifically ecifically,, the universal appr approximation oximation the theor or orem em (Hornik et al., F ortunately , feedforward netw orks with hidden la yers pro vide a univ appro xi1989; Cyb Cybenko enko, 1989) states that a feedforward netw network ork with a linearersal output la layer yer mation framew ork. Sp ecifically the universal approximation theorfunction em (Hornik eth al. and at least one hidden lay layer er ,with any “squashing” activ activation ation (suc (such as, 1989 ; Cyb enko , 1989)activ states thatfunction) a feedforward network withany a linear layer the logistic sigmoid activation ation can approximate Boreloutput measurable and at least layer with any “squashing” activ ation (such as function fromone onehidden finite-dimensional space to another with an any yfunction desired non-zero the logistic sigmoid activ ation function) can approximate any Borel measurable amoun amountt of error, provided that the net netw work is given enough hidden units. The function from one finite-dimensional space to another with an desired non-zero deriv of the feedforward netw can also appro they deriv of the derivatives atives network ork approximate ximate derivatives atives amoun t of error, provided that the net w ork is given enough hidden units. The function arbitrarily well (Hornik et al., 1990). The concept of Borel measurability deriv atives the of the feedforward approximate the deriv atives the is b eyond scop scope e of this bnetw o ok;ork forcan ouralso purposes it suffices to say thatof any function arbitrarily well ( Hornik et al. , 1990 ). The concept of Borel measurability con continuous tinuous function on a closed and b ounded subset of Rn is Borel measurable is b thema scop this b o ok; for purposes it suffices to say that any and eyond therefore approximated by our a neural net may y bee of network. work. network ork may RA neural netw continuous function a closed and b ounded subset is Boreldiscrete measurable also appro approximate ximate an any yon function mapping from an any y finite of dimensional space and therefore ma y b e approximated b y a neural net work. A neural netw ork may to another. While the original theorems were first stated in terms of units with also appro ximate an y function mapping from an y finite dimensional discrete space activ activation ation functions that saturate b oth for very negativ negativee and for very p ositive to another. While the original theorems w ere first in terms of for units with argumen arguments, ts, universal appro approximation ximation theorems hav havee stated also b een prov proven en a wider activation functions that saturate b oth forthevery e and forrectified very p ositive class of activ activation ation functions, whic which h includes no now wnegativ commonly used linear argumen ts, universal appro ximation theorems hav e also b een prov en for a wider unit (Leshno et al., 1993). class of activation functions, which includes the now commonly used rectified linear universal approximation unitThe (Leshno et al.approxim , 1993). ation theorem means that regardless of what function we are trying to learn, we kno know w that a large MLP will b e able to this The universal ation theorem regardless of what function. How However, ever,approxim we are not guaran guaranteed teed means that thethat training algorithm willfunction b e able w learn, we knoifwthe that a large MLP will b e able this toe are trying that to function. Even MLP is able to represent theto function, learning function. How ever, we are not guaran teed that the training algorithm will b e able can fail for tw twoo differen differentt reasons. First, the optimization algorithm used for training to that function. Even if the MLP is able to represent the function, learning 197 optimization algorithm used for training can fail for two different reasons. First, the


ma may y not b e able to find the value of the parameters that corresp corresponds onds to the desired function. Second, the training algorithm might choose the wrong function due to y not b e able to find value of the that corresp onds to sho thews desired oma verfitting. Recall fromthe Sec. 5.2.1 thatparameters the “no free lunch” theorem shows that function. Second, the training algorithm might choose the wrong function due to there is no universally sup superior erior machine learning algorithm. Feedforw eedforward ard netw networks orks o verfitting. Recall from Sec. 5.2.1 that the “no free lunch” theorem sho ws that pro provide vide a universal system for represen representing ting functions, in the sense that, giv given en a there is nothere universally erior machine learning algorithm. Feedforw ard netw orks function, exists a sup feedforward netw network ork that approximates the function. There prono vide a universal systemfor forexamining representing functions, senseexamples that, given a is universal pro procedure cedure a training setinofthe sp specific ecific and existsthat a feedforward network the function. cfunction, ho hoosing osing athere function will generalize to pthat ointsapproximates not in the training set. There is no universal pro cedure for examining a training set of sp ecific examples and The universal approximation theorem says that there exists a net network work large cho osing a function that will generalize to p oints not in the training set. enough to ac achieve hieve any degree of accuracy we desire, but the theorem do does es not The universal approximation theorem says that there exists a net work sa say y how large this netw network ork will b e. Barron (1993) pro provides vides some b ounds onlarge the enough to ac hieve any degree of accuracy w e desire, but the theorem do es not size of a single-lay single-layer er netw network ork needed to approximate a broad class of functions. sa y how large this netw ork will b e. Barron ( 1993 ) pro vides some b ounds on the Unfortunately Unfortunately,, in the worse case, an exp exponential onential num numb b er of hidden units (p (possibly ossibly size of single-lay er netw orkonding neededtotoeac approximate a broad class functions. with onea hidden unit corresp corresponding each h input configuration that of needs to b e Unfortunately , in the worse case, an exp onential num b er of hidden units (p ossibly distinguished) ma may y b e required. This is easiest to see in the binary case: the with one hidden unit corresp onding tooneac h inputv configuration needs to b e ∈ {0, 1}n is 22that number of p ossible binary functions vectors and selecting distinguished) ma y b e required. This is easiest to see in the binary case: one such function requires 2n bits, which will in general require O(2 n) degreesthe of 0, 1 is 2 and selecting number of p ossible binary functions on vectors v freedom. O(2 ) degrees of one such function requires 2 bits, which will in general ∈ { require } In summary summary, , a feedforward net network work with a single la layer yer is sufficient to represen representt freedom. an any y function, but the lay layer er may b e infeasibly large and ma may y fail to learn and In summary , a feedforward net work with a single la yer is sufficient represen generalize correctly correctly.. In many circumstances, using deep deeper er mo models dels cantoreduce thet an y function, butrequired the layer b e infeasibly large and maand y fail toreduce learn and n umber of units to may represent the desired function can the generalize correctly . In many circumstances, using deep er mo dels can reduce the amoun amountt of generalization error. number of units required to represent the desired function and can reduce the There families oferror. functions whic which h can b e appro approximated ximated efficien efficiently tly by an amoun t ofexist generalization arc architecture hitecture with depth greater than some value d, but whic which h require a muc much h larger There existisfamilies of functions whichorcan b eto appro efficien byban d. Inximated mo model del if depth restricted to b e less than equal many cases, thetly num numb er arc hitecture with depth greater than some v alue , but whic h require a muc h larger d of hidden units required by the shallow mo model del is exp exponen onen onential tial in n. Suc Such h results d mo del if depth is restricted to b e less than or equal to . In many cases, the num b er were first prov proven en for mo models dels that do not resemble the contin continuous, uous, differen differentiable tiable n of hidden units required by the shallow mo del is exp onen tial in . Suc h results neural netw networks orks used for machine learning, but hav havee since b een extended to these w eredels. firstThe provfirst en for mo dels that not resemble contin uous,, differen mo models. results were fordocircuits of logicthe gates (Håstad 1986). tiable Later neural netw orks used for machine learning, but hav e since b een extended to these work extended these results to linear threshold units with non-negative weigh weights ts mo dels. The first results w ere for circuits of logic gates ( Håstad , 1986 ). Later (Håstad and Goldmann, 1991; Ha Hajnal jnal et al., 1993), and then to netw networks orks with w ork extended these results to linear threshold units with non-negative weigh ts con continuous-v tinuous-v tinuous-valued alued activ activations ations (Maass, 1992; Maass et al., 1994). Many mo modern dern (neural Håstadnet and Goldmann , 1991;linear Ha jnal et al.,Leshno 1993), et andal.then to) netw orks with netwo wo works rks use rectified units. (1993 demonstrated continuous-v activ ations (Maass , 1992 Maassolynomial et al., 1994 ). ation Manyfunctions, mo dern that shallow alued netw networks orks with a broad family of; non-p non-polynomial activ activation neural netrectified works use rectified units. Leshno et al. (1993 ) demonstrated including linear units, linear hav havee universal approximation prop properties, erties, but these that shallow netw orks with a broad family of non-p olynomial activ ation functions, results do not address the questions of depth or efficiency—they sp specify ecify only that rectified units, hav e universal approximation prop erties, butu these aincluding sufficiently wide linear rectifier netw network ork could represen represent t any function. Pascan Pascanu et al. results do not address the questions of depth or efficiency—they sp ecify only that a sufficiently wide rectifier network could represent any function. Pascanu et al. 198


(2013b) and Montufar et al. (2014) sho showed wed that functions representable with a deep rectifier net can require an exp exponen onen onential tial num umb b er of hidden units with a shallow ((one 2013b ) and Montufar et al. ( 2014 ) sho wed that representable a hidden lay layer) er) net network. work. More precisely precisely,, theyfunctions show showed ed that piecewisewith linear deep rectifier net can can brequire an exp onen tial num b er of hiddenorunits withunits) a shallow net networks works (which e obtained from rectifier nonlinearities maxout can (one hidden lay er) net work. More precisely , they show ed that piecewise linear represen representt functions with a num numb b er of regions that is exp exponen onen onential tial in the depth of the net works Fig. (which b e obtained rectifier maxout units) can net network. work. 6.5can illustrates how afrom netw network ork withnonlinearities absolute valueorrectification creates represenimages t functions a numcomputed b er of regions thatofissome exp onen tial in the with depthresp of the mirror of thewith function on top hidden unit, respect ect net work. Fig. 6.5 illustrates how a netw ork with absolute v alue rectification creates to the input of that hidden unit. Each hidden unit sp specifies ecifies where to fold the mirror images of the function computed on top of some hidden unit,absolute with resp ect input space in order to create mirror resp responses onses (on b oth sides of the value to the input hidden unit. folding Each hidden unit sp toonentially fold the nonlinearit nonlinearity). y).ofBythat comp composing osing these op operations, erations, weecifies obtainwhere an exp exponentially input nspace in of order to create mirror resp onses (oncan b oth sides ofallthe absolute value large umber piecewise linear regions which capture kinds of regular nonlinearit y). Bypatterns. comp osing these folding op erations, we obtain an exp onentially (e.g., rep repeating) eating) large number of piecewise linear regions which can capture all kinds of regular (e.g., rep eating) patterns.

Figure 6.5: An intuitiv intuitive, e, geometric explanation of the exp exponential onential adv advan an antage tage of deep deeper er rectifier netw networks orks formally shown by Pascan Pascanu u et al. (2014a) and by Montufar et al. (2014). Figure 6.5:absolute An intuitiv e, rectification geometric explanation of same the exp onential antage deep er (L (Left) eft) An value unit has the output for adv every pair of mirror et al. et al. rectifier orks formally shown byofPascan u (2014a ) and by Montufardefined (by 2014 p oin oints ts innetw its input. The mirror axis symmetry is given by the hyperplane the). (Leights eft) An absolute value rectification has theon same mirror w and bias of the unit. A functionunit computed top output of that for unitevery (the pair greenofdecision p oints inwill its input. The mirror axis of symmetry is given y theaxis hyperplane defined by the surface) b e a mirror image of a simpler pattern acrossbthat of symmetry symmetry. . (Center) weights and bias unit. Abfunction top ofthe that unit green. decision The function canofbethe obtained y foldingcomputed the space on around axis of (the symmetry symmetry. (R (Right) ight) surface) will b e a mirror image ofe afolded simpler across of symmetry . (Center) Another repeating pattern can b onpattern top of the firstthat (byaxis another do downstream wnstream unit) (Right) The function can be obtained(which by folding therep space around the axis of tw symmetry to obtain another symmetry is now repeated eated four times, with two o hidden. lay layers). ers). Another repeating pattern can b e folded on top of the first (by another downstream unit) to obtain symmetry (which is nowin repMontufar eated four et times, hiddenthat layers). More another precisely precisely, , the main theorem al. (with 2014tw ) ostates the

number of linear regions carv carved ed out by a deep rectifier net network work with d inputs, More precisely , the theorem depth l , and n units p ermain hidden lay layer, er, isin Montufar et al. (2014) states that the number of linear regions carved out by a deep rectifier network with d inputs, isd(l−1) ! depth l , and n units p er hidden lay er, n O nd , (6.42) d n O n , (6.42) d case of maxout netw i.e., exp exponential onential in the depth l. In the networks orks with k filters p er unit, the num numb b er of linear regions is i.e., exp onential in the depth l. In the case of maxout networks with k filters p er !   unit, the numb er of linear regionsOis k (l−1)+d . (6.43) O k 199 







.

(6.43)


Of course, there is no guaran guarantee tee that the kinds of functions we wan wantt to learn in applications of machine learning (and in particular for AI) share such a prop property erty erty.. Of course, there is no guarantee that the kinds of functions we want to learn in We ma may y also wan wantt to choose a deep mo model del for statistical reasons. Any time applications of machine learning (and in particular for AI) share such a prop erty. we choose a sp specific ecific machine learning algorithm, we are implicitly stating some W e ma y also wan choose a deep del of forfunction statistical Any time set of prior b eliefs wet to hav have e ab about out what mo kind thereasons. algorithm should w e choose aosing sp ecific machine we are implicitly some learn. Cho Choosing a deep mo model dellearning enco encodes des algorithm, a very general b elief that thestating function we set of prior b eliefs w e hav e ab out what kind of function the algorithm should want to learn should inv involv olv olvee comp composition osition of several simpler functions. This can b e learn. Cho osing a deep mo del enco des a very general b elief that wee in interpreted terpreted from a representation learning p oin oint t of view as sa saying ying the thatfunction we b eliev elieve w ant to learn should inv olv e comp osition of several simpler functions. This can be the learning problem consists of disco discovering vering a set of underlying factors of variation interpreted t of view as sa ying that w e b elievofe that can in from turnabrepresentation e describ described ed in learning terms ofp oin other, simpler underlying factors learning problem ,consists discovering a set underlying factorsasofexpressing variation vthe ariation. Alternately Alternately, we can of in interpret terpret the use of aofdeep arc architecture hitecture canthat in turn b e describ in terms of other, simpler program underlying factors of of athat b elief the function weedwant to learn is a computer consisting vm ariation. Alternately , we can in terpret the use of a deep arc hitecture as expressing ultiple steps, where eac each h step makes use of the previous step’s output. These a b elief that the function we w ant to learn is a computer program consisting in intermediate termediate outputs are not necessarily factors of variation, but can instead bofe multiple steps, where step makes theork previous output. These analogous to coun counters ters eac or h p ointers that use the of netw network uses tostep’s organize its internal in termediate outputs are not necessarily of result variation, but can instead b e pro processing. cessing. Empirically Empirically, , greater depth do does esfactors seem to in b etter generalization analogous to coun ters or p ointers that the netw ork uses to organize its internal for a wide variety of tasks (Bengio et al., 2007; Erhan et al., 2009; Bengio , 2009; pro cessing. Empirically , greater depth es seem to result b etter generalization Mesnil et al. al., , 2011; Ciresan et al. al., , 2012do ; Krizhevsky et al. al.,,in2012 ; Sermanet et al. al.,, for a wide v ariety of tasks ( Bengio et al. , 2007 ; Erhan et al. , 2009 ; Bengio , 2009 2013; Farab arabet et et al. al.,, 2013; Couprie et al. al.,, 2013; Kahou et al. al.,, 2013; Go Goo o dfello dfellow w; Mesnil et al.;, Szegedy 2011; Ciresan et al.,).2012 Krizhevsky et al.6.7 , 2012 Sermanet al., et al., 2014d et al., 2014a See; Fig. 6.6 and Fig. for ;examples of et some 2013 ; F arab et et al. , 2013 ; Couprie et al. , 2013 ; Kahou et al. , 2013 ; Go o dfello w of these empirical results. This suggests that using deep architectures do does es indeed et al., 2014d ; Szegedy al.,the 2014a ). See Fig. 6.6 and 6.7 learns. for examples of some express a useful prior et ov over er space of functions theFig. mo model del of these empirical results. This suggests that using deep architectures do es indeed express a useful prior over the space of functions the mo del learns. So far we hav havee describ described ed neural net networks works as b eing simple chains of la layers, yers, with the main considerations b eing the depth of the netw network ork and the width of eac each h la layer. yer. So far we hav e describ ed neural net works as b eing simple chains of la yers, with the In practice, neural net networks works sho show w considerably more div diversity ersity ersity.. main considerations b eing the depth of the network and the width of each layer. Man Many y neural net network work architectures ha have ve b een dev develop elop eloped ed for sp specific ecific tasks. In practice, neural networks show considerably more diversity. Sp Specialized ecialized arc architectures hitectures for computer vision called conv convolutional olutional net networks works are Man y neural net work architectures ha ve b een dev elop ed for sp ecific tasks. describ described ed in Chapter 9. Feedforward netw networks orks may also b e generalized to the Sp ecialized arc hitectures for computer vision called conv olutional net works are recurren recurrentt neural net networks works for sequence pro processing, cessing, describ described ed in Chapter 10, which describ ed in Chapter 9. Feedforward networks may also b e generalized to the ha have ve their own arc architectural hitectural considerations. recurrent neural networks for sequence pro cessing, describ ed in Chapter 10, which In general, the la layers yers need not b e connected in a chain, ev even en though this is the have their own architectural considerations. most common practice. Many arc architectures hitectures build a main chain but then add extra In general, the la yers need not b e skip connected in a chain, en though the arc architectural hitectural features to it, such as connections goingevfrom lay layer er ithis to is lay layer er common Many architectures a main chain but then imost + 2 or higher.practice. These skip connections mak makeebuild it easier for the gradient to add flo flow w extra from architectural it, such skip connections going from layer i to layer output lay layers ers features to lay layers ers to nearer the as input. i + 2 or higher. These skip connections make it easier for the gradient to flow from output layers to layers nearer the input.200


Figure 6.6: Empirical results showing that deeper netw networks orks generalize better when used to transcrib transcribee multi-digit numbers from photographs of addresses. Data from Go Goo o dfello dfellow w Figure 6.6: Empirical results showing that deeper netw orks generalize better when used et al. (2014d). The test set accuracy consistently increases with increasing depth. See to transcrib ulti-digit numbers from photographs addresses. Data from Go w Fig. 6.7 for ae m con control trol exp experimen erimen eriment t demonstrating that of other increases to the mo model delo dfello size do et al. (2014d Theeffect. test set accuracy consistently increases with increasing depth. See not yield the ).same Fig. 6.7 for a control exp eriment demonstrating that other increases to the mo del size do not yield the same effect.

201


Figure 6.7: Deep Deeper er mo models dels tend to p erform b etter. This is not merely b ecause the mo model del is larger. This exp experimen erimen erimentt from Go Goodfellow odfellow et al. (2014d) shows that increasing the number Figure 6.7: Deep mo dels to p erformnetw b etter. is notincreasing merely b ecause the moisdel is of parameters inerla layers yers of tend con convolutional volutional networks orksThis without their depth not et al. larger. This exp erimen t from Go odfellow ( 2014d ) shows that increasing the n umber nearly as effectiv effectivee at increasing test set p erformance. The legend indicates the depth of of parameters layers of con volutional networks without increasing their in depth is not net network work used toinmake each curve and whether the curve represents variation the size of nearly asolutional effective or at the increasing test set play erformance. The elegend indicates the depth of the conv convolutional fully connected layers. ers. We observ observe that shallow mo models dels in this net work used to make each curve and whether the curve represents v ariation in the size of con context text overfit at around 20 million parameters while deep ones can b enefit from having olutional or the fully connected ers. mo Wedel observ e thatashallow mo dels in ov this othe verconv 60 million. This suggests that using lay a deep model expresses useful preference over er context overfit at around million parameters while deep can b enefit fromthat having the space of functions the20mo model del can learn. Sp Specifically ecifically ecifically, , itones expresses a b elief the over 60 million. This suggests that using functions a deep mocomposed del expresses a useful preference over function should consist of many simpler together. This could result the space of functions the mo delthat can islearn. Sp ecifically expresses a b elief that(e.g., the either in learning a representation comp composed osed in turn, ofit simpler represen representations tations functiondefined should in consist functionsa composed together. This could result corners termsofofmany edges)simpler or in learning program with sequentially dep dependent endent either(e.g., in learning acate representation that then is comp osed inthem turnfrom of simpler represen tations (e.g., steps first lo locate a set of ob objects, jects, segment each other, then recognize corners defined in terms of edges) or in learning a program with sequentially dep endent them). steps (e.g., first lo cate a set of ob jects, then segment them from each other, then recognize them).

202


Another key consideration of architecture design is exactly how to connect a pair of la layers yers to each other. In the default neural netw network ork la layer yer describ described ed by a linear Another key consideration of architecture design is exactly how to connect a transformation via a matrix W , every input unit is connected to every output pair ofMany layerssp toecialized each other. In thein default neural netw orkhav layer describ ed by a linear unit. specialized netw networks orks the chapters ahead have e fewer connections, so transformation via a matrix , every input unit is connected to every output W that eac each h unit in the input lay layer er is connected to only a small subset of units in unit.output Manylay sper. ecialized orks infor thereducing chaptersthe ahead hav fewer connections, so the layer. Thesenetw strategies num numb b ere of connections reduce thatneac h unit in the inputand layer connected to only a small subsetto of ev units in the umber of parameters theis amount of computation required evaluate aluate the netw output er. are These strategies for reducingenden the num of connections reduce the network, ork,lay but often highly problem-dep problem-dependen endent. t. Fb orerexample, conv convolutional olutional the n umber of parameters and the amount of computation required to ev aluate net networks, works, described in Chapter 9, use sp specialized ecialized patterns of sparse connections the netw are often highly problem-dep endent.InFor that are ork, very but effective for computer vision problems. thisexample, chapter,conv it isolutional difficult net works, described in Chapter 9 , use sp ecialized patterns of sparse connections to give much more sp specific ecific advice concerning the arc architecture hitecture of a generic neural that are very effective for computer vision problems. this chapter, it is difficult net network. work. Subsequen Subsequentt chapters develop the particular In architectural strategies that to give m uch more sp ecific advice concerning the arc hitecture of a generic neural ha have ve b een found to work well for different application domains. network. Subsequent chapters develop the particular architectural strategies that have b een found to work well for different application domains.

6.5

Bac Back-Propagation k-Propagation and Other Differen Differentiation tiation Algorithms 6.5 Back-Propagation and Other Differentiation AlgoWhen we use a feedforward neural netw network ork to accept an input x and pro produce duce an rithms output yˆ, information flo flows ws forward through the netw network. ork. The inputs x pro provide vide x andatpro When we use a feedforward neural netw ork to accept an input units duce an the initial information that then propagates up to the hidden eac each h la layer yer yˆ, information through output forward the ork. The provide and finally pro produces duces yˆflo . ws This is called forwar forward d netw pr prop op opagation agation agation. . inputs Duringx training, the initial information that then propagates up to the hidden units at eac yer forw forward ard propagation can contin continue ue onw onward ard un until til it pro produces duces a scalar cost costJ ). la The J (θh yˆ . This finally pro duces is calledet forwar d pr agation . During band ack-pr ack-prop op opagation agation algorithm (Rumelhart al., 1986a ),op often simply called btraining, ackpr ackprop op, forwws ardthe propagation canfrom contin onwto ard until it pro duces a scalar cost J (netw θ). The allo allows information theuecost then flow backw backwards ards through the network, ork, back-pr agation algorithm (Rumelhart et al., 1986a), often simply called backprop, in orderopto compute the gradient. allows the information from the cost to then flow backwards through the network, Computing an analytical expression for the gradien gradientt is straigh straightforward, tforward, but in order to compute the gradient. numerically ev such an expression can b e computationally exp evaluating aluating expensive. ensive. The Computing an analytical expression for the gradien t is straigh tforward, bac back-propagation k-propagation algorithm do does es so using a simple and inexp inexpensiv ensiv ensivee pro procedure. cedure.but numerically evaluating such an expression can b e computationally exp ensive. The The term back-propagation misunderstoo as meaning whole back-propagation algorithm do es isso often using misundersto a simple ando dinexp ensive prothe cedure. learning algorithm for multi-la multi-layer yer neural netw networks. orks. Actually Actually,, bac back-propagation k-propagation The term back-propagation is often misundersto as meaning the whole refers only to the metho method d for computing the gradient,o dwhile another algorithm, learning algorithm for multi-la yer neural orks. Actually back-propagation suc such h as sto stochastic chastic gradient descent, is used netw to p erform learning ,using this gradient. refers only to the metho d for computing the gradient, while another Furthermore, back-propagation is often misundersto misunderstood od as b eing sp specific ecificalgorithm, to multisuc h as sto chastic gradient descent, is used to p erform learning using this la layer yer neural netw networks, orks, but in principle it can compute deriv derivativ ativ atives es of any gradient. function F urthermore, back-propagation is often misundersto od as b eing sp ecific to multi(for some functions, the correct resp response onse is to rep report ort that the deriv derivativ ativ ativee of the layer neural networks, but in principle can compute deriv es of any function is undefined). Sp Specifically ecifically ecifically, , we it will describ describe e ho how w to ativ compute the function gradien gradientt (for the correct onse isxto the whose derivativ e of the ∇x fsome ( x, y) functions, f, where for an arbitrary functionresp is arep setort of vthat ariables deriv derivativ ativ atives es function is undefined). Sp ecifically , we will describ e ho w to compute the gradien are desired, and y is an additional set of variables that are inputs to the functiont f( x, y) for an arbitrary function f, where x is a set of variables whose derivatives are desired, and y is an additional set of203variables that are inputs to the function ∇


but whose deriv derivatives atives are not required. In learning algorithms, the gradient we most often require is the gradien gradientt of the cost function with resp respect ect to the parameters, but whose deriv atives are not required. In learning algorithms, thederiv gradient weeither most involv volv volvee computing other derivatives, atives, ∇θ J( θ ). Many machine learning tasks in often require the gradien t cess, of theorcost withlearned resp ect mo to del. the parameters, as part of theis learning pro process, to function analyze the model. The back) . Many machine learning tasks in volv e computing other deriv atives, either J ( θ propagation algorithm can b e applied to these tasks as well, and is not restricted as part of thethelearning or function to analyze learned moparameters. del. The back∇ computing to gradientpro of cess, the cost withthe resp respect ect to the The propagation algorithm can b e applied to these tasks as well, and is not restricted idea of computing deriv derivatives atives by propagating information through a netw network ork is to computing the gradient of the cost function with resp ect to the parameters. The very general, and can b e used to compute values such as the Jacobian of a function idea of computing deriv atives by propagating information through a netw ork is f with multiple outputs. We restrict our description here to the most commonly vused ery general, and fcan compute values such as the Jacobian of a function case where hasb ea used singletooutput. f with multiple outputs. We restrict our description here to the most commonly used case where f has a single output. So far we hav havee discussed neural net networks works with a relatively informal graph language. To describ describee the back-propagation algorithm more precisely precisely,, it is helpful to hav havee a So far we hav e discussed neural net works with a relatively informal graph language. more precise computational gr graph aph language. To describ e the back-propagation algorithm more precisely, it is helpful to have a Man Many y ways of formalizing computation more precise computational graph language.as graphs are p ossible. Here, useofeach no node de in computation the graph to as indicate variable. The variable may Many we ways formalizing graphsa are p ossible. b e a scalar, vector, matrix, tensor, or ev even en a variable of another typ ype. e. Here, we use each no de in the graph to indicate a variable. The variable may To formalize our graphs, we also need to in intro tro troduce duce the idea of an op oper er eration ation ation.. b e a scalar, vector, matrix, tensor, or even a variable of another typ e. An op operation eration is a simple function of one or more variables. Our graph language To formalize by oura graphs, we also to intro duce the idea of an operation. is accompanied set of allow allowable ableneed op operations. erations. Functions more complicated An op eration is a simple function of one or more v ariables. Our graph language than the op operations erations in this set ma may y b e describ described ed by comp composing osing many op operations erations is accompanied by a set of allow able op erations. F unctions more complicated together. than the op erations in this set may b e describ ed by comp osing many op erations Without loss of generality generality,, w wee define an op operation eration to return only a single together. output variable. This do does es not lose generalit generality y b ecause the output variable can ha hav ve Without loss of generality , w e define an op eration to return only a single multiple en entries, tries, suc such h as a vector. Softw Software are implementations of back-propagation output vsupp ariable. do es not losemgeneralit y b ecausebut thewe output caninha ve usually support ort This op operations erations with ultiple outputs, avoidvariable this case our multiple entries, suchitasin atro vector. implementations of not back-propagation description because intro troduces duces Softw man many yare extra details that are imp importan ortan ortantt to usually supp ort op erations with m ultiple outputs, but we a void this case in our conceptual understanding. description because it intro duces many extra details that are not imp ortant to If a variable y is computed by applying an op operation eration to a variable x, then conceptual understanding. we draw a directed edge from x to y . W Wee sometimes annotate the output node x, then a vname ariable is op computed by applying an optimes eration to this a variable withIfthe of ythe operation eration applied, and other omit lab label el when the x y w e draw a directed edge from to . W e sometimes annotate the output node op operation eration is clear from context. with the name of the op eration applied, and other times omit this lab el when the Examples of computational op eration is clear from context. graphs are shown in Fig. 6.8. Examples of computational graphs are shown in Fig. 6.8. 204


Figure 6.8: Examples of computational graphs. (a) The graph using the × op operation to  eration > compute z = xy . (b) The graph for the logistic regression prediction yˆ = σ x w + b . (a) Figureof6.8: of computational graphs. graph usingalgebraic the opexpression eration to Some theExamples intermediate expressions do not ha have veThe names in the (b) z = xy yˆ = w+ b . compute The graphW for the logistic prediction ×uσ( ) .x (c) but need names. in the graph. e simply nameregression the i-th suc such h variable The Some of the intermediate do=not ha algebraic expression , Xnames W + bin computational graph for theexpressions expression H which computes a design max max{ { 0ve }, the (c) The i u but need names in the graph. W e simply name the -th suc h v ariable matrix of rectified linear unit activ activations ations H giv given en a design matrix con containing taining a .minibatch H = 0 , X W + max b computational graph for the expression , which computes a design of inputs X . (d) Examples a–c applied at most one op operation eration to each variable, but it  H given matrix of rectified activ matrix containing agraph minibatch {a design }a computation is p ossible to applylinear moreunit than oneations op operation. eration. Here we show that X . than (d) Examples of inputsmore a–c applied most one to each variable, but it applies one op operation eration to the at weights linear regression mo model. del. w ofopaeration P The iseights p ossible apply morethe than one eration. Here show a tcomputation graph w are to used to make b oth theopprediction weigh weight decay p enalty yˆ andwethe λ that w 2. w applies more than one op eration to the weights of a linear regression mo del. The w . weights are used to make the b oth the prediction yˆ and the weight decay p enalty λ

205

P


The chain rule of calculus (not to b e confused with the chain rule of probability) is used to compute the deriv derivativ ativ atives es of functions formed by comp composing osing other functions The c hain rule of calculus (not to b e confused with the chain rule of probability) is whose deriv derivativ ativ atives es are kno known. wn. Bac Back-propagation k-propagation is an algorithm that computes the compute derivativ es of compefficien osing other functions cused haintorule, with athe sp specific ecific order of functions op operations erationsformed that isbyhighly efficient. t. whose derivatives are known. Back-propagation is an algorithm that computes the Let x b e a real num numb b er, and let f and g b oth b e functions mapping from a real chain rule, with a sp ecific order of op erations that is highly efficient. number to a real num numb b er. Supp Suppose ose that y = g (x) and z = f (g (x)) = f (y ). Then x f and g b oth b e functions mapping from a real Let b e a real num b er, and let the chain rule states that y =dyg (x) and z = f (g (x)) = f (y ). Then number to a real numb er. Supp osedzthatdz = . (6.44) the chain rule states that dx dy dx dz dz dy = . (6.44) We can generalize this b ey eyond ond dx the scalar Suppose ose that x ∈ R m, y ∈ Rn , dy dxcase. Supp g maps from Rm to Rn , and f maps from R n to R. If y = g (x ) and z = R f (y ), then R We can generalize this b eyond the scalar case. Supp ose that x ,y , R R R R X ∂ z from ∂ zto∂ y j. If y = g (x ) and ∈ g maps from z = f (y ),∈then to , and f maps . (6.45) = ∂xi ∂ y j ∂ xi j ∂z ∂y ∂z . (6.45) = ∂x ∂y ∂x In vector notation, this may b e equiv equivalently alently written as  > In vector notation, this may b e equivalently ∂ y written as (6.46) ∇xz = X ∇y z , ∂x ∂y z= z, (6.46) ∂y x g. matrix∂of where ∂x is the n × m Jacobian ∇ ∇ x can b e obtained by multiplying F rom this we see that the gradient of a v ariable is the n ∂ym Jacobian matrix of g . where a Jacobian matrix ∂x by a gradient ∇yz. The algorithm consists  back-propagation × that the gradient of a variable x F rom this we see can b e obtained ultiplying of p erforming suc such h a Jacobian-gradient pro product duct for each op operation eration by in m the graph. z. The back-propagation algorithm consists a Jacobian matrix by a gradient wesuc doh not apply the bac back-propagation k-propagation algorithm merely to vectors, of pUsually erforming a Jacobian-gradient ∇ pro duct for each op eration in the graph. but rather to tensors of arbitrary dimensionalit dimensionality y. Conceptually Conceptually,, this is exactly the Usually w e do not apply the bac k-propagation algorithmis merely vectors, same as back-propagation with vectors. The only difference ho how w thetonum numb b ers but rather to tensors of arbitrary dimensionalit y . Conceptually , this is exactly the are arranged in a grid to form a tensor. We could imagine flattening each tensor same as back-propagation with vectors. The only difference is ho w the num b ers in into to a vector b efore we run back-propagation, computing a vector-v vector-valued alued gradient, are arranged in a gridthe to gradien form a ttensor. We could imagine each tensor and then reshaping gradient back into a tensor. In flattening this rearranged view, in to a v ector b efore we run back-propagation, computing a vector-v alued gradient, bac back-propagation k-propagation is still just multiplying Jacobians by gradien gradients. ts. and then reshaping the gradient back into a tensor. In this rearranged view, To denote the gradient of a value z with resp respect ect to a tensor , we write ∇ z , back-propagation is still just multiplying Jacobians by gradients. just as if were a vector. The indices into no now w ha have ve multiple co coordinates—for ordinates—for z with zy, To denote gradient of a value resp ect to W a tensor , we write example, a 3-Dthe tensor is indexed by three co coordinates. ordinates. e can abstract this awa way just as if a single were avvector. intothe no w have m ultiple ordinates—for ∇ all b y using ariable The represent complete tuple of co indices. For i to indices example, a 3-Dtuples tensori,is(∇ indexed bes y three ∂z co ordinates. We can abstract this away z)i giv p ossible index gives ∂ . This is exactly the same as how for all by using a single variable i to represent the complete tuple of indices. For all z) gives p ossible index tuples i, ( . This is exactly the same as how for all 206 ∇


∂z p ossible integer indices i in into to a vector, (∇x z )i giv gives es ∂x . Using this notation, we can write the chain rule as it applies to tensors. If = g ( ) and z = f ( ), then p ossible integer indices i into a vector, ( z ) gives . Using this notation, we X ∂Ifz = g ( ) and z = f ( ), then can write the chain rule as it applies to tensors. ∇ z= (∇ ∇ j) . (6.47) ∂ j j ∂z ( z= ) . (6.47) ∂ ∇ ∇

Using the chain rule, it is straigh straightforward tforward down wn an algebraic expression for X to write do the gradient of a scalar with resp respect ect to an any y no node de in the computational graph that Using the chain rule, it is straigh tforward to write dothat wn an algebraic in expression for pro produced duced that scalar. Ho Howev wev wever, er, actually ev evaluating aluating expression a computer the gradient of a scalar with resp ect to an y no de in the computational graph that in intro tro troduces duces some extra considerations. pro duced that scalar. However, actually evaluating that expression in a computer Sp Specifically ecifically ecifically, , many subexpressions expressions ma may y be rep repeated eated several times within the intro duces some extra sub considerations. overall expression for the gradien gradient. t. Any pro procedure cedure that computes the gradien gradientt Sp ecifically , many sub expressions ma y be rep eated several times within the will need to choose whether to store these sub subexpressions expressions or to recompute them o verall expression for the gradien t. Any pro cedure that computes the gradien sev several eral times. An example of ho how w these rep repeated eated sub subexpressions expressions arise is given int will whether to store the these sub expressions or twice to recompute them Fig. need 6.9. to In choose some cases, computing same sub subexpression expression would simply sev times.F An example of ho w thesethere rep eated expressions is of given in b e eral wasteful. For or complicated graphs, can bsub e exp exponentially onentiallyarise many these Fig. 6.9 . In some cases, computing the same sub expression twice would simply wasted computations, making a naiv naivee implemen implementation tation of the chain rule infeasible. b e w asteful. F or complicated graphs, there can b etwice exp onentially these In other cases, computing the same sub subexpression expression could b e amany validofway to w asted memory computations, making at a naiv implemen tation of the chain rule infeasible. reduce consumption the ecost of higher run runtime. time. In other cases, computing the same sub expression twice could b e a valid way to We first b egin by a version of the back-propagation algorithm that sp specifies ecifies reduce memory consumption at the cost of higher runtime. the actual gradient computation directly (Algorithm 6.2 along with Algorithm 6.1 We asso firstciated b egin forw by aard version of the back-propagation algorithm sp ecifies for the associated forward computation), in the order it will actuallythat b e done and the actual to gradient computation directlyof(Algorithm along witheither Algorithm 6.1 according the recursive application chain rule.6.2One could directly for the asso ciated forward computation), in the order actuallyasb ea done p erform these computations or view the description of it thewill algorithm sym symb band olic according to the recursive application of c hain rule. One could either directly sp specification ecification of the computational graph for computing the back-propagation. Howp erform computations ormake view explicit the description of the algorithm a symb olic ev ever, er, thisthese formulation do does es not the manipulation and theasconstruction sp ecification of the computational graph computing the back-propagation. Howof the sym symb b olic graph that p erforms the for gradient computation. Such a formulation ever, this formulation es not make explicit the manipulation andalso thegeneralize construction is presented below in do Sec. 6.5.6 , with Algorithm 6.5, where we to of the sym b olic graph that p erforms the gradient computation. Such a formulation no nodes des that con contain tain arbitrary tensors. is presented below in Sec. 6.5.6, with Algorithm 6.5, where we also generalize to First consider a computational graph describing ho how w to compute a single scalar no des that contain arbitrary tensors. u(n) (sa (say y the loss on a training example). This scalar is the quantit quantity y whose First consider a computational graph describing ho w to compute a single (1) ) . In gradien gradientt we wan wantt to obtain, with resp respect ect to the n i input no nodes des u to u (nscalar u (say the loss on a training∂uexample). This scalar is the quantity whose other words we wish to compute ∂u for all i ∈ {1 ,2 , . . . , ni } . In the application gradient we want to obtain, with resp ect to the n input no des u to u . In of back-propagation to computing gradients for gradien gradientt descent over parameters, i , . . . , n h, other words we wish to compute for all . In the uapplication ( n ) (1) to u (n ) u will b e the cost asso associated ciated with an example or1a,2minibatc minibatch, while of back-propagation to computing gradients for gradien t descent o ver parameters, } corresp correspond ond to the parameters of the mo model. del. ∈ { u will b e the cost asso ciated with an example or a minibatch, while u to u corresp ond to the parameters of the mo207 del.


We will assume that the no nodes des of the graph hav havee b een ordered in such a wa way y that we can compute their output one after the other, starting at u(n +1) and Weupwill the in noAlgorithm des of the 6.1 graph hav e de b een in such a wa going to assume defined , eac each h no node is asso associated ciated with any u(n). Asthat u (i)ordered u that we can their output one after the thefunction other, starting at and op operation eration f (i)compute and is computed by ev evaluating aluating going up to u . As defined in Algorithm 6.1, each no de u is asso ciated with an (i) aluating op eration f and is computed byuev = f (A (i)the ) function (6.48) A u = ) ts of u (i). (6.48) where A (i) is the set of all no nodes des that aref (paren parents A where is the A setpro ofcedure all no des that are paren of u . procedure that p erforms thetscomputations mapping n i inputs (1) (n ) ( n ) u to u to an output u . This defines a computational graph where each no node de n A pro cedure that p erforms the computations mapping inputs ( i ) ( i ) computes numerical value u by applying a function f to the set of argumen arguments ts u u u to to an output . This defines a computational graph where each no de ( i ) ( j ) ( i ) A that comprises the values of previous no nodes des u , j < i, with j ∈ P a(u ). The computes numerical value ugraph by applying a function the the set of argumen ts x , and fis settointo n i no input to the computational is the vector first nodes des A a(u (output) i, with that the values of previous no des ugraph , j< ). The ) u(1) to u (ncomprises . The output of the computational is read offj thePlast input ∈ first n no des no node de uto(n)the . computational graph is the vector x , and is set into the u to u . The output of the computational graph is read off the last (output) i = 1, . . . , ni no deuu(i) ← . x i i = 1, . . . , n x u i = ni + 1, . . . , n ← {u(j ) | j ∈ P a(u(i) )} A(i) ← (i)= n + (i)1, .(.i). , n u A ← f (A ) u j P a(u ) A u ← f({n) ( | )∈ } u ← u That algorithm sp specifies ecifies the forw forward ard propagation computation, whic which h we could put in a graph G . In order to p erform back-propagation, we can construct a That algorithm ecifies the forw propagation whic h weThese could G and computational graphspthat dep depends ends onard adds to itcomputation, an extra set of no nodes. des. put ina subgraph a graph B. with In order todep erform back-propagation, we incan construct B pro form one no node p er no node de of G. Computation proceeds ceeds ina computational graph dep ends on and adds an each extrano set These G of that G, itand exactly the reverse the order of computation into node deofofnoBdes. computes form a subgraph with one no de p er no de of . Computation in pro ceeds in the deriv derivativ ativ ativee ∂u asso associated ciated with Gthe forw forward ard graph no node de u(i) . This is done ∂u exactly the reverseB of the order of computation in G , and each no de of B computes using the chain rule with resp respect ect to scalar output u(n) : the derivative asso ciated with the forwardGgraph no de u . This B is done ( n ) ( n ) ( i ) X output using the chain rule with ∂resp u ect to scalar ∂ u ∂uu : = (6.49) ∂ u(j ) ∂ u(i) ∂ u(j ) i:j ∈P a(u ) ∂ u ∂u ∂u = (6.49) ∂u ∂ u ∂ u as sp specified ecified by Algorithm 6.2. The subgraph B con contains tains exactly one edge for each edge from node u (j ) to no node de u(i) of G . The edge from u (j ) to u (i) is asso associated ciated with as sp ecified by Algorithm 6.2. The subgraph contains exactly one edge for each ∂u the computation of ∂u . In addition, a dot pro product duct is p erformed for each no node, de, edge from node u to no de u of . X The edge B from u to u (i)is asso ciated with b et etw ween the gradient already computed with resp respect ect to no nodes des u that are children the computation of . In addition, G a dot pro duct is p erformed for each no de, b etween the gradient already computed 208 with resp ect to no des u that are children


of u (j ) and the vector con containing taining the partial deriv derivatives atives ∂u for the same children ∂u ( i ) no nodes des u . To summarize, the amount of computation required for p erforming of u and the vector containing the partial derivatives for the same children the backback-propagation propagation scales linearly with the nu number mber of edges in G , where the no des u . To the amount of computing computation required forative p erforming computation forsummarize, each edge corresp corresponds onds to a partial deriv derivative (of one the scales linearly with of edges , where the no node debackwithpropagation resp respect ect to one of its paren parents) ts) as the wellnu asmber p erforming oneinmultiplication computation for each edge ondsthis to computing partialalued deriv ative one Gno and one addition. Belo Below, w, wecorresp generalize analysis to atensor-v tensor-valued nodes, des,(of whic which h no de with resp ect to one of its paren ts) as well as p erforming one m ultiplication is just a wa way y to group multiple scalar values in the same no node de and enable more and one addition. Belo w, w e generalize this analysis to tensor-v alued no des, which efficien efficientt implemen implementations. tations. is just a way to group multiple scalar values in the same no de and enable more efficient implemen tations. version of the bac Simplified back-propagation k-propagation algorithm for computing ( n ) the deriv derivativ ativ atives es of u with resp respect ect to the variables in the graph. This example is Simplified version back-propagation forallcomputing in intended tended to further understanding of bythe showing a simplified algorithm case where variables u the deriv ativ es of with resp ect to the v ariables in the graph. This example (1) ). are scalars, and we wish to compute the deriv derivativ ativ atives es with resp respect ect to u , . . . , u(n is in tended to further understanding b y showing a simplified case where all v ariables This simplified version computes the deriv derivativ ativ atives es of all no nodes des in the graph. The are scalars, and we wish to compute the deriv ativ es with resp tobuer of, . edges . . , u in. computational cost of this algorithm is prop proportional ortional to the ect num umb This simplified version computes thederiv deriv ativ ofciated all nowith des in theedge graph. The the graph, assuming that the partial derivativ ativ ative e es asso associated each requires computational cost of this algorithm is prop ortional to the n um b er of edges in a constant time. This is of the same order as the num numb b er of computations for the graph, assuming that the partial derivative asso ciated with each edge requires the forward propagation. Each ∂u is a function of the paren parents ts u(j ) of u(i), thus ∂u a constant time. This is of the same order as the numb er of computations for linking the no nodes des of the forward graph to those added for the back-propagation the forward propagation. Each is a function of the parents u of u , thus graph. linking the no des of the forward graph to those added for the back-propagation Run forward propagation (Algorithm 6.1 for this example) to obtain the activ activaagraph. tions of the net network work Run forward propagation (Algorithm for will this store example) to obtain aInitialize , a data structure6.1that the deriv derivativ ativ atives esthe thatactiv hav have e tions of the network [ u (i)] will store the computed value of b een computed. The entry _ Initialize , a data structure that will store the derivatives that have ∂u . [ u ] will store the computed value of b∂u een computed.(nThe entry _ _ [∂ u ) ] ← 1 . j = n − 1 do down wn to 1 P _ [∂ u ] 1 ∂u ∂u = The next line computes ∂u using stored values: i:j ∈P a(u ) ∂u ∂u j = n 1 down ← to 1P ∂u ∂u (j ) ] ← _ line[ucomputes _ [u (i)] ∂u using stored values: i:j ∈P a(u =) The next −

[u ] ] [u (i)] | i = 1, . . . , n }_ i ← _ [u ] i = 1, . . . , n Pdesigned to reduce the number of common The back-propagation algorithm is | } { sub subexpressions expressions without regard to memory memory.. Sp Specifically ecifically ecifically,, it p erforms on the order P The back-propagation algorithm is designed reduce umber of common of one Jacobian pro product duct p er no node de in the graph.toThis canthe b e nseen from the fact sub expressions without regard to memory . Sp ecifically , it p erforms on the ( j ) i) of in Algorithm 6.2 that bac backprop kprop visits each edge from no node de u to no node de u(order of one Jacobian pro duct p er no de in the graph. This can b e seen from the∂ufact the graph exactly once in order to obtain the asso associated ciated partial deriv derivativ ativ ativee . in Algorithm 6.2 that backprop visits each edge from no de u to no de u ∂u of Bac Back-propagation k-propagation thus av avoids oids the exp exponential onential explosion in rep repeated eated sub subexpressions. expressions. the graph exactly once in order to obtain the asso ciated partial derivative . 209 Back-propagation thus avoids the exp onential explosion in rep eated sub expressions. _ {

_[u


Ho Howev wev wever, er, other algorithms may b e able to av avoid oid more sub subexpressions expressions by p erforming simplifications on the computational graph, or ma may y b e able to conserv conservee memory by Ho wev er, other algorithms may b e able to av oid more sub expressions by these p erforming recomputing rather than storing some sub subexpressions. expressions. We will revisit ideas simplifications on the computational graph, or ma y b e able to conserv e memory by after describing the bac back-propagation k-propagation algorithm itself. recomputing rather than storing some sub expressions. We will revisit these ideas after describing the back-propagation algorithm itself. To clarify the ab abov ov ovee definition of the bac back-propagation k-propagation computation, let us consider the sp specific ecific graph asso associated ciated with a fully-connected multi-lay ulti-layer er MLP MLP.. To clarify the ab ove definition of the back-propagation computation, let us consider Algorithm 6.3 first shows the forw forward ard propagation, which maps parameters to the sp ecific graph asso ciated with a fully-connected multi-layer MLP. ˆ , y) asso the sup supervised ervised loss L( y associated ciated with a single (input,target) training example 6.3 output first shows theneural forward propagation, maps parameters (x, yAlgorithm ), with yˆ the of the netw is provided in input. to network ork when xwhich ˆ , y) asso ciated with a single (input,target) training example the sup ervised loss L( y Algorithm 6.4 then shows ws the corresp corresponding computation toinbeinput. done for (x, y ), with yˆ the output sho of the neural netwonding ork when x is provided applying the back-propagation algorithm to this graph. Algorithm 6.4 then shows the corresp onding computation to be done for Algorithm 6.3 and Algorithm 6.4 are demonstrations that are chosen to b e applying the back-propagation algorithm to this graph. simple and straightforw straightforward ard to understand. How Howev ev ever, er, they are sp specialized ecialized to one Algorithm 6.3 and Algorithm 6.4 are demonstrations that are chosen to b e sp specific ecific problem. simple and straightforward to understand. However, they are sp ecialized to one Mo Modern dern softw software are implementations are based on the generalized form of backsp ecific problem. propagation describ described ed in Sec. 6.5.6 b elo elow, w, whic which h can accommo accommodate date any computaMograph dern softw are implementations based on the generalized formsym of backtional by explicitly manipulatingare a data structure for represen representing ting symb b olic propagation describ ed in Sec. 6.5.6 b elo w, whic h can accommo date any computacomputation. tional graph by explicitly manipulating a data structure for representing symb olic computation. Algebraic expressions and computational graphs b oth op operate erate on symb symbols ols ols,, or variables that do not hav havee sp specific ecific values. These algebraic and graph-based Algebraic expressions and symb computational graphs b oth op erate on symbols represen representations tations are called symbolic olic representations. When we actually use, or or vtrain ariables that do not hav e sp ecific v alues. These algebraic and graph-based a neural netw network, ork, we must assign sp specific ecific values to these sym symb b ols. We represen tations are called symb olic representations. When we actually use as or replace a sym symb b olic input to the netw network ork x with a sp specific ecific numeric value, such train a neural netw ork, we must assign sp ecific v alues to these sym b ols. W e [1.2, 3.765, −1.8]> . replace a symb olic input to the network x with a sp ecific numeric value, such as [1.2Some , 3.765approaches , 1.8] . to back-propagation take a computational graph and a set of numerical values for the inputs to the graph, then return a set of numerical − Some approaches to back-propagation a computational graph and a set values describing the gradien gradient t at those inputtake values. We call this approach “symbolof numerical values for the This inputs graph, used then by return a setsuc of hnumerical to-n to-number” umber” differentiation. is to thethe approach libraries such as Torc orch h alues describing gradien at those values. We call this approach “symbol(vCollob Collobert ert et al., the 2011b ) andt Caffe (Jiainput , 2013 ). to-number” differentiation. This is the approach used by libraries such as Torch Another approach is to take a computational graph and add additional no nodes des (Collob ert et al., 2011b) and Caffe (Jia, 2013). to the graph that pro provide vide a symbolic description of the desired deriv derivativ ativ atives. es. This Another approach is to take a computational graph and add additional no des to the graph that provide a symbolic description of the desired derivatives. This 210


Figure 6.9: A computational graph that results in rep repeated eated sub subexpressions expressions when computing the gradient. Let w ∈ R b e the input to the graph. We use the same function f : R → R Figure A computational graph results eated sub x =expressions f( w), y =when f ( x),computing z = as the 6.9: operation that R we apply atthat every step in of rep a chain: R f(yR). w applyb eEq. the gradient. Let the6.44 input to obtain: the graph. We use the same function f : T o compute , we and f(y ). as the operation that ∈ we apply at every step of a chain: x = f( w), y = f ( x), z = → ∂ z and obtain: To compute , we apply Eq. 6.44 (6.50) ∂w ∂∂zz ∂ y ∂ x (6.50) (6.51) = ∂w ∂y ∂x ∂w ∂z ∂y ∂x =f 0(y )f 0(x)f 0 (w ) (6.52) (6.51) ∂y ∂x ∂w 0 0 0 (6.53) =f (f (f (w )))f (f (w ))f (w ) =f (y )f (x)f (w ) (6.52) ( w ) = f ( f ( f ( w ))) f ( f ( w )) f Eq. 6.52 suggests an implemen implementation tation in which we compute the value of f (w ) only(6.53) once and store it in the variable x . This is the approach taken by the back-propagation (w ) expression Eq. 6.52 suggests an implemen tationisinsuggested which webcompute the value the of f sub only once algorithm. An alternative approach y Eq. 6.53 , where subexpression x it in thethan variable . This is the approach taken back-propagation fand (w ) store app appears ears more once. In the alternativ alternative e approac approach, h, f (w)byis the recomputed eac each h time algorithm. alternative approach is suggested y Eq. 6.53of, where the sub expression it is needed.AnWhen the memory required to storebthe value these expressions is low, f ( w f ( w ) ) app ears more than once. In the alternativ e approac h, is recomputed eac h time the back-propagation approach of Eq. 6.52 is clearly preferable b ecause of its reduced it is needed. When the memory required to store the v alue of these expressions is low, run runtime. time. How However, ever, Eq. 6.53 is also a valid implementation of the chain rule, and is useful the back-propagation approach of Eq. 6.52 is clearly preferable b ecause of its reduced when memory is limited. runtime. However, Eq. 6.53 is also a valid implementation of the chain rule, and is useful when memory is limited.

211


Forward propagation through a typical deep neural netw network ork and the computation of the cost function. The loss L(yˆ, y ) dep depends ends on the output yˆ F orward propagation through a typical deep neuralTnetw ork and and on the target y (see Sec. 6.2.1.1 for examples of loss functions). o obtain the L ( y ˆ , y yˆ the computation of the cost function. The loss ) dep ends on the output total cost J , the loss may b e added to a regularizer Ω( Ω(θθ ), where θ con contains tains all the yts(see and on the target 6.2.1.1 for examples of losshow functions). To obtain the parameters (weigh (weights andSec. biases). Algorithm 6.4 shows to compute gradien gradients ts J θ θ total cost , the loss may b e added to a regularizer Ω( ) , where con tains all the of J with resp respect ect to parameters W and b. For simplicity simplicity,, this demonstration uses parameters (weigh ts and biases). Algorithm 6.4 shows how to gradien ts only a single input example x. Practical applications should usecompute a minibatch. See J W b of with resp ect to parameters and . F or simplicity , this demonstration uses Sec. 6.5.7 for a more realistic demonstration. only a single input example x. Practical applications should use a minibatch. See Net Netw work depth, l Sec. 6.5.7 for (ai),more i ∈ {1realistic , . . . , l },demonstration. the weigh weightt matrices of the mo model del W Net ork depth, l (i)w model del b , i ∈ {1, . . . , l }, the bias parameters of the mo ,i 1, . . . , l , the weight matrices of the mo del W x, the input to pro process cess , i ∈ 1{, . . . , l ,}the bias parameters of the mo del b y , the target output { to pro } cess h (0) = xx, the∈input k = y1,, .the . . , ltarget output h a (k=) = x b(k) + W (k) h (k−1) . .(,kl)) h (kk)==1f, .(a a = b +W h h h(l= ) f (a ) yˆ = J = L(yˆ, y ) + λΩ(θ ) yˆ = h J = L(yˆ, y ) + λΩ(θ ) is the approach taken by Theano (Bergstra et al., 2010; Bastien et al., 2012) and TensorFlow (Abadi et al., 2015). An example of how this approach works is the approach by Theano (Bergstraadv etan al. , 2010 Bastien et al. is illustrated in taken Fig. 6.10 . The primary advan antage tage of ;this approac approach h ,is2012 that) and deriv TensorFlow (Abadi et ed al.,in2015 An example approach works the derivatives atives are describ described the).same languageofashow thethis original expression. is illustrated in Fig. 6.10 . The primary adv an tage of this approac h is Because the deriv derivativ ativ atives es are just another computational graph, it is p ossible tothat run the derivatives areagain, describ ed in the same language original expression. bac back-propagation k-propagation differentiating the deriv derivatives ativesasinthe order to obtain higher Because the Computation derivatives areofjust another computational itedisinp ossible to run deriv derivatives. atives. higher-order deriv derivativ ativ atives es is graph, describ described Sec. 6.5.10 . back-propagation again, differentiating the derivatives in order to obtain higher We will use the latter approach and describ describee the bac back-propagation k-propagation algorithm in derivatives. Computation of higher-order derivatives is describ ed in Sec. 6.5.10. terms of constructing a computational graph for the deriv derivativ ativ atives. es. Any subset of the Wemay will then use the approach describ e the bacvk-propagation in graph b e latter ev evaluated aluated usingand sp specific ecific numerical alues at a lateralgorithm time. This terms a computational graph eac for hthe derivativshould es. Anybsubset of the allo allows ws ofusconstructing to avoid specifying exactly when each operation e computed. graph may then b e ev aluated using sp ecific numerical v alues at a later time. This Instead, a generic graph ev evaluation aluation engine can ev evaluate aluate every no node de as so soon on as its allo ws us to a v oid specifying exactly when eac h operation should b e computed. paren parents’ ts’ values are av available. ailable. Instead, a generic graph evaluation engine can evaluate every no de as so on as its The description of the symbol-to-symbol based approach subsumes the symbolparents’ values are available. to-n to-number umber approac approach. h. The symbol-to-num symbol-to-numb b er approach can b e understo understood od as The description of the symbol-to-symbol based approach subsumes the symbolp erforming exactly the same computations as are done in the graph built by the to-n umber approac h. The symbol-to-num b er approach can sym b e understo od as sym symb b ol-to-sym ol-to-symb b ol approach. The key difference is that the symb b ol-to-n ol-to-number umber p erforming exactly the same computations as are done in the graph built by the symb ol-to-symb ol approach. The key212 difference is that the symb ol-to-number


Backwar Backward d computation for the deep neural net network work of Algorithm 6.3, which uses in addition to the input x a target y. This computation Backwar d activ computation theeac deep neural networkfrom of Algok) for k, starting yields the gradients on the activations ations a(for each h la layer yer the x y rithm 6.3 , which uses in addition to the input a target . This computation output lay layer er and going bac backwards kwards to the first hidden lay layer. er. From these gradients, a k, starting yields the gradients on the activ ations for eac h la yer from the whic which h can b e interpreted as an indication of how eac each h lay layer’s er’s output should change output layerror, er andone going kwards the first layer. From these gradients, to reduce canbac obtain theto gradien gradient t onhidden the parameters of eac each h lay layer. er. The whic h can b e interpreted as an indication of how eac h lay er’s output should change gradien gradients ts on weigh weights ts and biases can b e immediately used as part of a sto stoc chasto reduce error, one can obtain the gradien t on the parameters of eac h lay er. tic gradient up update date (p (performing erforming the up update date right after the gradients ha have ve bThe een gradien ts on weigh ts and biases can b e immediately used as part of a sto c hascomputed) or used with other gradient-based optimization metho methods. ds. tic gradient up date (p erforming the up date right after the gradients have b een After the forward computation, compute the gradient on the output lay layer: er: computed) or used with other gradient-based optimization metho ds. g ← ∇yˆJ = ∇yˆ L(yˆ, y ) After thel , forward compute the gradient on the output layer: k= l − 1, . . .computation, ,1 (yˆ, y ) t on the lay g Con J =the L Convert vert gradien gradient layer’s er’s output in into to a gradient into the prek∇= l , l ∇ . . . ,ation 1 ← nonlinearit nonlinearity y1,activ activation (elemen (element-wise t-wise multiplication if f is element-wise): Con the (k) )the layer’s output into a gradient into the preg ←vert ∇a − J =gradien g  f 0(taon nonlinearitgradients y activation (elemen multiplication f isregularization element-wise): Compute on weigh weights ts t-wise and biases (including ifthe term, ( a ) g J = g f where needed): Compute gradients ←∇  on weights and biases (including the regularization term, ∇ b J = g + λ∇ b Ω(θ ) where J needed): ∇ = g h(k−1)> + λ∇ W Ω(θ ) W Ω(θ )w.r.t. the next lo J = gthe + λgradients Propagate lower-lev wer-lev wer-level el hidden lay layer’s er’s activ activations: ations: + λ J = g h Ω( θ ) ( k ) > ∇ ∇ J =W g g ← ∇h Propagate the gradients w.r.t. the next lower-level hidden layer’s activations: ∇ ∇ J =W g g ←∇

213


Figure 6.10: An example of the sym symb b ol-to-symbol approac approach h to computing deriv derivatives. atives. In this approach, the back-propagation algorithm do does es not need to ever access any actual Figure An example of the sym b ol-to-symbol h to computing derivatives.ho In sp specific ecific6.10: numeric values. Instead, it adds no nodes des to aapproac computational graph describing how w thiscompute approach, thederiv back-propagation algorithm doaluation es not need to ever access any actual to these derivativ ativ atives. es. A generic graph ev evaluation engine can later compute the sp ecific numeric values. Instead, it v adds no(L des to In a computational graph describing how deriv derivatives atives for any sp specific ecific numeric alues. (Left) eft) this example, we b egin with a graph to compute es. A generic graph aluation engine can later compute the represen representing ting these ))). . (Right) We run theevbac back-propagation k-propagation algorithm, instructing z = f (deriv f( f (ativ w))) eft) Inonding deriv for any sp ecificfor numeric values. (L this example, wethis b egin with a we graph it to atives construct the graph the expression corresp corresponding to . In example, do z = f ( f ( f ( w ))) (Right) represen ting . W e run the bac k-propagation algorithm, instructing not explain how the bac back-propagation k-propagation algorithm works. The purp purpose ose is only to illustrate it to construct theresult graphis: forathe expression corresp . In this example, of wethe do what the desired computational graphonding with atosymbolic description not explain deriv derivative. ative. how the back-propagation algorithm works. The purp ose is only to illustrate what the desired result is: a computational graph with a symbolic description of the derivative.

214


approac approach h do does es not exp expose ose the graph. approach do es not exp ose the graph. The back-propagation algorithm is very simple. To compute the gradient of some scalar z with resp respect ect to one of its ancestors x in the graph, we b egin by observing The back-propagation is zvery simple. o compute the gradient of some that the gradient withalgorithm resp respect ect to is given byTdz dz = 1. We can then compute z witht resp x zin in scalar to ect oneto of each its ancestors thethe graph, weby b egin by observing the gradien gradient withect resp respect parent of graph multiplying the that the gradient ect toofz the is given by that = 1pro . W e canz then curren current t gradien gradient t bywith the resp Jacobian op operation eration produced duced . We compute contin continue ue z the gradien t with resp ect to each parent of in the graph by multiplying the multiplying by Jacobians tra traveling veling bac backwards kwards through the graph in this wa way y until z curren t gradien t by the Jacobian of the op eration that pro duced . W e contin ue we reac reach h x. For any no node de that may b e reached by going bac backwards kwards from z through ultiplying y Jacobians traveling through the from graphdifferent in this wa y until tmwo or morebpaths, we simply sum bac thekwards gradients arriving paths at x. For any no de that may b e reached by going backwards from z through w e reac hde. that no node. two or more paths, we simply sum the gradients arriving from different paths at formally,, each no node de in the graph G corresp corresponds onds to a variable. To achiev achievee thatMore no de.formally maxim maximum um generality generality,, we describ describee this variable as b eing a tensor . T Tensor ensor can More formally , each no de in the graph corresp onds to a v ariable. T o achiev in general hav havee an any y num numb b er of dimensions, and subsume scalars, vectors, ande maximum generality, we describ e this variable as b eing a tensor . Tensor can G matrices. in general have any numb er of dimensions, and subsume scalars, vectors, and We assume that each variable is asso associated ciated with the following subroutines: matrices. that each is asso ciated with the following subroutines: ( ):variable •We assume _ This returns the op operation eration that computes , represen sented ted by the edges coming in into to in the computational graph. For example, ( ): This _ may b e a Python returns the op eration the thatmatrix computes , reprethere or C++ class representing multiplication sen ted by the in the computational or example, • op operation, eration, andedges the coming into function. Supp Suppose ose we graph. ha have ve a vFariable that there may b e a Python or C++ class representing the matrix multiplication ( ) is created by matrix multiplication, C = AB . Then _ op eration, and the function. Supp ose we ha ve a v ariable that returns a p oin ointer ter to an instance of the corresp corresponding onding C++ class. ( ) is created by matrix multiplication, C = AB . Then _ • returns _ a p ointer( to , Gan ): instance This returns thecorresp list of onding variables thatclass. are children of of the C++ in the computational graph G . _ ( , ): This returns the list of variables that are children of •• ( , G ) : GThisgraph in_the computational returns. the list of variables that are parents of in the computational graph G . G _ ( , ) : This returns the list of variables that are parents of inhthe computational graph . •Eac G is also Each op operation eration asso associated ciated with a op operation. eration. This G op operation eration can compute a Jacobian-vector pro product duct as describ described ed by Eq. 6.47. This Eac h op eration is also asso ciated with a op eration. This . Each is how the back-propagation algorithm is able to achiev achievee great generality generality. op eration can compute a Jacobian-vector pro duct as describ ed by Eq. 6.47 . This op operation eration is resp responsible onsible for kno knowing wing ho how w to back-propagate through the edges in is how the back-propagation algorithm is able to achiev e great generality . Each the graph that it participates in. For example, we might use a matrix multiplication op eration to is create resp onsible for kno wing w to ose back-propagate through the edges in C= ABho z with op operation eration a variable . Supp Suppose that the gradient of a scalar the graph participates in. Fmatrix or example, we might op use a matrix multiplication C isitgiven resp respect ect to that by G . The multiplication operation eration is resp responsible onsible for C = AB z ewith op eration to create a v ariable . Supp ose that the gradient of a scalar defining tw twoo bac back-propagation k-propagation rules, one for each of its input arguments. If w call resp ect to C is given by G . The matrix multiplication op eration is resp onsible for defining two back-propagation rules, one215 for each of its input arguments. If we call


the metho method d to request the gradient with resp respect ect to A giv given en that the gradient on the output is G , then the metho method d of the matrix multiplication op operation eration A the metho d to request the gradient with resp ect to giv en that the gradient > must state that the gradien gradientt with resp respect ect to A is given by GB . Likewise, if we G on the output is , then the metho d oft the multiplication op eration call the metho method d to request the gradien gradient withmatrix resp respect ect to B, then the matrix m ust state that the gradien t with resp ect to is given b y . Likewise, we A GB op operation eration is resp responsible onsible for implementing the metho method d and sp specifying ecifyingifthat B, then the call desired the methoisd given to request the t with resp ect toalgorithm matrix G .gradien the gradient by A> The back-propagation itself do does es op eration is resp onsible for implementing the metho d and sp ecifying that not need to know any differentiation rules. It only needs to call each op operation’s eration’s A G the desired gradient is given by . The back-propagation algorithm itself do es . ( , , ) must rules with the right argumen arguments. ts. Formally ormally,, not need to know any differentiation rules. It only needs to call each op eration’s return ( , , )(6.54) rules with the right X argumen must . ( Formally ) i ), i , . (∇ ts. return i .( ) ) , ( (6.54) whic which h is just an implemen implementation tation of the cchain hain rule as expressed in Eq. 6.47. Here, is a list of inputs∇that are supplied to the op operation, eration, is the whic h is just an implemen tation of the c hain rule as expressed in Eq. 6.47t. mathematical function that the op operation eration implemen implements, ts, is the input whose gradien gradient Here, is a list of inputs that are supplied to the op eration, is we wish to compute, and X is the gradient on the output of the op operation. eration. the mathematical function that the op eration implements, is the input whose gradient The metho method d should alwa always ys pretend that all of its inputs are distinct we wish to compute, and is the gradient on the output of the op eration. from each other, even if they are not. For example, if the op operator erator is passed The metho d should alwa ys pretend that all of its inputs are xdistinct 2 two copies of x to compute x , the metho method d should still return as the from each other, even if they are not. F or example, if the op erator is passed deriv derivative ative with resp respect ect to b oth inputs. The bac back-propagation k-propagation algorithm will later tadd wo copies of to compute , the metho d, should return the x x x astotal b oth of these arguments together to obtain 2x which still is the correct derivative deriv derivative ative with on x.resp ect to b oth inputs. The back-propagation algorithm will later add b oth of these arguments together to obtain 2x, which is the correct total Soft Software ware implemen implementations tations of bac back-propagation k-propagation usually pro provide vide b oth the op operaeraderivative on x. tions and their metho methods, ds, so that users of deep learning softw software are libraries are implementations of graphs back-propagation vide b othlik the op eraableSoft to ware back-propagate through built using usually commonpro op operations erations like e matrix tions and their exp metho ds, so that and userssoofon. deep learning software who libraries area m ultiplication, exponen onen onents, ts, logarithms, Soft Software ware engineers build able implementation to back-propagate through graphs built using common op erations e matrix new of back-propagation or adv advanced anced users who need tolikadd their m ultiplication, exp onen ts, logarithms, and so on. Soft ware engineers who build own op operation eration to an existing library must usually derive the metho method d fora new implementation of back-propagation or advanced users who need to add their an op any y new operations erations man manually ually ually.. own op eration to an existing library must usually derive the metho d for The back-propagation algorithm is formally describ described ed in Algorithm 6.5. any new op erations manually. In Sec. 6.5.2, we motiv motivated ated back-propagation as a strategy for av avoiding oiding6.5 computThe back-propagation algorithm is formally describ ed in Algorithm . ing the same sub subexpression expression in the chain rule multiple times. The naiv naivee algorithm In hav Sec.e 6.5.2 , we tial motiv ated back-propagation as a strategy for avoidingNow computcould have exp exponen onen onential run runtime time due to these rep repeated eated sub subexpressions. expressions. that ing the same sub expression in the chain rule m ultiple times. The naiv e algorithm we ha have ve sp specified ecified the bac back-propagation k-propagation algorithm, we can understand its comcould hav e exp onen tial run time that due to these rep eatedev sub expressions. Now that putational cost. If we assume each op operation eration evaluation aluation has roughly the w e ha ve sp ecified the bac k-propagation algorithm, w e can understand its comsame cost, then we may analyze the computational cost in terms of the number putational cost. If we assume each eration evaluation has roughly the of op operations erations executed. Keep inthat mind hereopthat we refer to an op operation eration as the same cost,talthen e may analyze the computational cost inactually terms of the nof umber fundamen fundamental unitwof our computational graph, which might consist very of op erations executed. Keep in mind here that we refer to an op eration as the man many y arithmetic op operations erations (for example, we migh mightt hav havee a graph that treats matrix fundamental unit of our computational graph, which might actually consist of very many arithmetic op erations (for example, 216we might have a graph that treats matrix


The outermost skeleton of the back-propagation algorithm. This p ortion do does es simple setup and clean cleanup up work. Most of the imp importan ortan ortantt work happ happens ens The outermost skeleton of the back-propagation algorithm. This in the subroutine of Algorithm 6.6 .p ortion do es simple setup and cleanup work. Most of the imp ortant work happ ens in the subroutine Algorithm 6.6gradients must b e computed. T, the target set of vof ariables whose . G T, the computational graph the vtarget whose gradients must b e computed. z ,, the ariableset to of b evariables differentiated computational graph Let G 0 b e ,Gthe pruned to contain only no nodes des that are ancestors of z and descendents z , the v ariable to b e differentiated G of no nodes des in T. Let b e pruned to ,contain no desasso thatciating are ancestors and descendents Initialize a data only structure associating tensors of toztheir gradients T of no in . Gdes G _ [z ] ← 1 Initialize , a data structure asso ciating tensors to their gradients in T _ _ [z ]( , G1, G 0, _ ) T in ← , _to T ) ( , ,restricted Return _ G G T Return restricted to n no multiplication as a single op operation). eration). Computing a gradient in a graph with withn nodes des 2 will never execute more than O( n ) op operations erations or store the output of more than n no des multiplication as a Here singlewe opare eration). Computing a gradient in a graph with O (n2 ) op operations. erations. counting op operations erations in the computational graph, not will never execute more than ) op erations or store the output of more than O ( n individual op operations erations executed by the underlying hardware, so it is imp important ortant to O ( n ) op erations. Here we are counting op erations in the computational graph, not remem rememb b er that the run runtime time of eac each h op operation eration ma may y b e highly variable. For example, individual op erations executed by the underlying hardware, so it is imp ortant multiplying tw twoo matrices that each contain millions of en entries tries might corresp correspond ond to to er that theinrun of eac macomputing y b e highlythe variable. For example, aremem singlebop operation eration thetime graph. Whe op caneration see that gradient requires as multiplying tw oerations matrices that each contain millions of enstage tries will might ond to most operations b ecause the forw forward ard propagation at corresp worst execute O(n2 ) op a single op eration the graph. e can see that thewegradient as n no all nodes des in the in original graphW(dep (depending ending on computing which values wan wantt torequires compute, most ) op erations b ecause the forw ard propagation stage will at w orst execute O ( n we may not need to execute the entire graph). The back-propagation algorithm all n one no des in the originalpro graph ending on which values with we wan t tono compute, O(1) adds Jacobian-vector product, duct,(dep which should b e expressed nodes, des, p er w e may not need to execute the entire graph). The back-propagation algorithm edge in the original graph. Because the computational graph is a directed acyclic adds one Jacobian-vector duct, Fwhich b egraphs expressed (1) no des,used p er graph it has at most O ( n2pro ) edges. or theshould kinds of thatwith are O commonly edge in the original graph. Because the computational graph is a directed acyclic in practice, the situation is even b etter. Most neural netw network ork cost functions are graph it chain-structured, has at most O ( n )causing edges. back-propagation For the kinds of graphs that commonly O (are n) cost. roughly to ha have ve This isused far in practice, the situation is even b etter. Most neural netw ork cost functions are b etter than the naive approach, whic which h migh mightt need to execute exp exponen onen onentially tially man many y O ( n roughly chain-structured, causing back-propagation to ha ve ) cost. This is far no nodes. des. This p otentially exponential cost can b e seen by expanding and rewriting b etter than the naive approach, h might need the recursive chain rule (Eq. 6.49whic ) non-recursiv non-recursively: ely: to execute exp onentially many no des. This p otentially exponential cost can b e seen by expanding and rewriting t X Y the recursive chain∂rule ely: u (n) (Eq. 6.49) non-recursiv ∂ u (π ) = . (6.55) (π ) ∂ u(j ) ∂ u ∂u path (u ,u ,...,u ), k=2 ∂ u = . (6.55) from π =j to π =n ∂u ∂u Since the num numb b er of paths from no node de j to no node de n can grow up to exp exponen onen onentially tially in the Since the numb er of paths from no de j to217 no de n can grow up to exp onentially in the Y X


( , G , G 0, The inner lo loop op subroutine _ _ ) of the back-propagation algorithm, called by the bac back-propagation k-propagation algorithm defined ( , , , inner lo op subroutine _ _ ) of in Algorithm 6.5The . the back-propagation algorithm, called by the back-propagation G G algorithm defined , the variable whose gradient should b e added to G and . in Algorithm 6.5. G , the graph to mo modify dify dify.. ariable whose b e added toin the andgradient. . thevrestriction of Ggradient to no nodes desshould that participate G ,0 ,the , the graph, atodata mo dify . G gradients structure mapping no nodes des to their , the restriction of to no des that participate in the gradient. is inG mapping no des to their gradients G ReturnG _ , a[ data ] structure is in Return _ [ ] i← 1

_ ( , G0 ) _ ( ) _ _ ( , G , G( 0 ,, ) _ ) _ ( ) (i) 0 G ← . ( _ ( , G), , ) _ ( , , , _ ) ← i← i+1 . ( G_G ( , ), , ) ← P i← i← + 1(i) G i ←_ [ ]= Insert and the op operations erations creating it into G _ [ ] = ← Return Insert and the op erations creating it into Return G Pthese paths, the num length of numb b er of terms in the ab abo ove sum, whic which h is the num numb b er of such paths, can grow exp exponen onen onentially tially with the depth of the forw forward ard propagation length of these paths, the num b er of terms in the ab o v e sum, whic h is the numbfor er graph. This large cost would b e incurred b ecause the same computation of such paths, can grow exp onen tially with the depth of the forw ard propagation ∂u would b e redone man many y times. T To o av avoid oid suc such h recomputation, we can think ∂u graph. This large cost would b e incurred b ecause the same computation for of back-propagation as a table-filling algorithm that tak takes es adv advan an antage tage of storing would b e redone∂umany times. To avoid such recomputation, we can think in intermediate termediate results . Eac Each h no node de in the graph has a corresp corresponding onding slot in a of back-propagation ∂u as a table-filling algorithm that takes advantage of storing table to store the gradien gradientt for that no node. de. By filling in these table entries in order, in termediate results . Eac h no de incommon the graph has a correspThis onding slot in a bac back-propagation k-propagation av avoids oids rep repeating eating many sub subexpressions. expressions. table-filling table to store the gradien t fordynamic that no de. By filling. in these table entries in order, strategy is sometimes called pr pro ogr gramming amming amming. back-propagation avoids rep eating many common sub expressions. This table-filling strategy is sometimes called dynamic programming. i

in 1← ← ←in

As an example, we walk through the bac back-propagation k-propagation algorithm as it is used to train a multila multilayer yer p erceptron. As an example, we walk through the back-propagation algorithm as it is used to Here we develop a very simple multila multilayer yer perception with a single hidden train a multilayer p erceptron. la layer. yer. To train this mo model, del, we will use minibatc minibatch h stochastic gradient descen descent. t. Here we develop a very simple multilayer perception with a single hidden 218minibatch stochastic gradient descent. layer. To train this mo del, we will use


The back-propagation algorithm is used to compute the gradien gradientt of the cost on a single minibatch. Sp Specifically ecifically ecifically,, we use a minibatch of examples from the training The back-propagation algorithm is used to compute gradien t ofclass the cost on ya. set formatted as a design matrix a vector the of asso associated ciated lab labels els X and singlenetw minibatch. Sp ecifically ,erweofuse a minibatch examples = max 0 , X the W (1)training max{ {from } . To The network ork computes a lay layer hidden featuresofH set formatted as a design matrix and a vector of asso ciated class lab elsour X y. simplify the presentation we do not use biases in this mo model. del. We assume that H = max 0{, 0X W} elemen The ork computes oferation hiddenthat features . Tto max max{ ,Z graphnetw language includes aa layer op operation can compute elementsimplify presentation we do not use biases this mo del.ov Were{ assume that our } then wise. Thethe predictions of the unnormalized log in probabilities over classes are max graph a op eration that can compute giv given en blanguage y H W (2)includes . We assume that our graph language includes a 0, Z elementwise. The predictions of the unnormalized log probabilities ov er {the probabilit } are then op operation eration that computes the cross-en cross-entropy tropy b et etween ween the targets y andclasses probability y given by H W . Webyassume that our graphlog language includes a resulting crossdistribution defined these unnormalized probabilities. The optropy erationdefines that computes cross-en tropy b et ween the targetsyypand the probabilit y JMLE en entropy the cost the . Minimizing this cross-entrop cross-entropy erforms maxim maximum um distribution defined bofy the these unnormalized logtoprobabilities. The resulting crosslik likeliho eliho elihooo d estimation classifier. How However, ever, mak makee this example more realistic, J en tropy defines the cost . Minimizing this cross-entrop y p erforms maxim um we also include a regularization term. The total cost likeliho o d estimation of the classifier. more realistic,  However, to make this example      we also include a regularization term. cost X The(1)total X 2 (2) 2 J = JMLE + λ  Wi,j + W i,j  (6.56) i,j

i,j

J =J +λ W + W (6.56) consists of the cross-en cross-entropy tropy and a weight decay term with co coefficien efficien efficientt λ. The computational graph is illustrated in Fig. 6.11. consists of the cross-entropy and weight decay term withco efficient λ. The thea gradien The computational graph for gradient t of this example is large enough that computational graph is illustratedX in Fig. 6.11   . X  it would b e tedious to draw or to read. This demonstrates one of the b enefits  The computational graph for the gradien t of this example is large enough that of the back-propagation algorithm, whic which h is that it can automatically generate it w ould b e tedious to draw or to read. This demonstrates one of the b enefits gradien gradients ts that would b e straigh straightforw tforw tforward ard but tedious for a softw software are engineer to of the back-propagation algorithm, whic h is that it can automatically generate deriv derivee man manually ually ually.. gradients that would b e straightforward but tedious for a software engineer to Weeman canually roughly trace out the b eha ehavior vior of the back-propagation algorithm deriv . by lo looking oking at the forw forward ard propagation graph in Fig. 6.11. To train, we wish W e can roughly trace out the of the algorithm J . vior ∇Wb eha to compute b oth ∇ W J and There are back-propagation tw two o different paths leading b y lo oking at the forw ard propagation graph in Fig. 6.11 . T o train, we wish J bac from to the w eights: one through the cross-en cost, and one backward kward cross-entropy tropy J J to compute b oth and . There are tw o different paths leading through the weigh weightt decay cost. The weigh weightt decay cost is relatively simple; it will J bac kward from to the one through (i) weights: i) ∇ ∇ alw always ays con contribute tribute 2λW to the gradient on W (the . cross-entropy cost, and one through the weight decay cost. The weight decay cost is relatively simple; it will The other path through the cross-en cross-entropy tropy cost is slightly more complicated. always contribute 2λW to the gradient on W . Let G b e the gradient on the unnormalized log probabilities U (2) pro provided vided by The other path through the cross-en tropy cost is slightly more complicated. the op operation. eration. The bac back-propagation k-propagation algorithm no now w needs to G U Let b e the gradient on the unnormalized log probabilities pro vided by explore tw two o different branc branches. hes. On the shorter branch, it adds H >G to the the op eration. The bac k-propagation algorithm no w needs to gradien gradientt on W (2), using the back-propagation rule for the second argument to explore twomultiplication different branc hes. On The the other shorter branch, it adds the G to the matrix operation. branch corresp corresponds ondsH to the longer t on W further , using along the back-propagation rule the second argument to cgradien hain descending the netw network. ork. First, thefor back-propagation algorithm the matrix∇multiplication (2)>operation. The other branch corresp onds to the longer computes using the back-propagation rule for the first argument H J = GW ctohain descending further along the network. First, the matrix multiplication op operation. eration. Next, thethe back-propagation op operation eration uses algorithm its bac backkJ = GW computes using the back-propagation rule for the first argument 219 to the matrix op eration uses its back∇ multiplication op eration. Next, the


Figure 6.11: The computational graph used to compute the cost used to train our example of a single-lay single-layer er MLP using the cross-entrop cross-entropy y loss and weigh weightt deca decay y. Figure 6.11: The computational graph used to compute the cost used to train our example of a single-layer MLP using the cross-entropy loss and weight decay.

propagation rule to zero out comp components onents of the gradien gradientt corresp corresponding onding to entries of U (1) that were less than 0. Let the result be called G 0 . The last step of the propagation rule toalgorithm zero out comp thek-propagation gradient corresp onding to second entries bac back-propagation k-propagation is to onents use theofbac back-propagation rule for the U G . The on of that were less than 0eration . Let the result be0 to called lastWstep >G (1) of the argumen argument t of the op operation to add X the gradient . back-propagation algorithm is to use the back-propagation rule for the second After these gradients hav havee b een computed, itGis the resp responsibilit onsibilit onsibility y of the gradien gradientt argument of the op eration to add X to the gradient on W . descen algorithm, or another optimization algorithm, to use these gradients to descentt After these gradients hav e b een computed, it is the resp onsibilit y of the gradien t up update date the parameters. descent algorithm, or another optimization algorithm, to use these gradients to For the the parameters. MLP MLP,, the computational cost is dominated by the cost of matrix up date multiplication. During the forward propagation stage, we multiply by each weight For resulting the MLPin , the is w dominated by of During matrix O ( wcomputational matrix, ) multiply-adds,cost where is the num numb b er the of wcost eigh eights. ts. multiplication. During the forward propagation e multiply of by eac each weightt the backw backward ard propagation stage, we multiply stage, by thewtranspose each h weigh weight O ( wsame w is The matrix, resulting in the ) multiply-adds, where the num b ermemory of weighcost ts. During matrix, which has computational cost. main of the the backw ard propagation stage, we m ultiply by the transpose of eac h weigh t algorithm is that we need to store the input to the nonlinearity of the hidden la layer. yer. matrix, which has the same cost. The main of has the This value is stored from thecomputational time it is computed un until til the memory backw backward ardcost pass algorithmto is that we need tot.store input cost to theis nonlinearity hidden yer. m islathe returned the same p oin oint. The the memory thus O(mnhof ), the where This value is stored in from time it and is computed untilb er theofbackw ard pass has number of examples thethe minibatch nh is the num numb hidden units. returned to the same p oint. The memory cost is thus O(mn ), where m is the number of examples in the minibatch and n is the numb er of hidden units. 220


Our description of the bac back-propagation k-propagation algorithm here is simpler than the implemen mentations tations actually used in practice. Our description of the back-propagation algorithm here is simpler than the implenotedactually ab abo ove,used we hav have e restricted the definition of an op operation eration to b e a menAs tations in practice. function that returns a single tensor. Most softw implemen software are implementations tations need to As noted ab o v e, we hav e restricted the definition of an op eration to bwish e a supp support ort op operations erations that can return more than one tensor. For example, if we function that returns a single tensor. softw needit to to compute b oth the maximum value in Most a tensor andare theimplemen index of tations that value, is supp ort op erations that can return more than one tensor. F or example, if we wish b est to compute b oth in a single pass through memory memory,, so it is most efficien efficientt to to compute b oth the maximum v alue in a tensor and the index of that value, it is implemen implementt this pro procedure cedure as a single op operation eration with two outputs. b est to compute b oth in a single pass through memory, so it is most efficient to We ha hav ve not describ described ed ho how w to control the memory consumption of bac backkimplement this pro cedure as a single op eration with two outputs. propagation. Bac Back-propagation k-propagation often inv involves olves summation of man many y tensors together. W e ha v e not describ ed ho w to control the memory consumption of , bac kIn the naive approac approach, h, each of these tensors would b e computed separately separately, then propagation. Back-propagation involves many tensors together. all of them would b e added in often a second step.summation The naiv naiveeofapproac approach h has an overly In the naive approac h, each of these tensors would b e computed separately , then high memory b ottleneck that can b e avoided by main maintaining taining a single buffer and all of them would b e added in a second step. The naiv e approac h has an o verly adding each value to that buffer as it is computed. high memory b ottleneck that can b e avoided by maintaining a single buffer and Real-w Real-world orld implementations adding each value to that buffer of as back-propagation it is computed. also need to handle various data types, such as 32-bit floating p oint, 64-bit floating p oin oint, t, and integer values. Real-w orld implementations of back-propagation also need handle various The p olicy for handling each of these typ ypes es tak takes es sp special ecial care totodesign. data types, such as 32-bit floating p oint, 64-bit floating p oint, and integer values. Some op operations erations ha have ve undefined gradients, and it is imp important ortant to track these The p olicy for handling each of these typ es takes sp ecial care to design. cases and determine whether the gradient requested by the user is undefined. Some op erations have undefined gradients, and it is imp ortant to track these Various other technicalities make real-world differentiation more complicated. cases and determine whether the gradient requested by the user is undefined. These technicalities are not insurmoun insurmountable, table, and this chapter has describ described ed the key V arious other technicalities make real-world differentiation more complicated. in intellectual tellectual to tools ols needed to compute deriv derivatives, atives, but it is imp important ortant to b e aw aware are These technicalities are not insurmoun table, and this chapter has describ ed the k ey that many more subtleties exist. intellectual to ols needed to compute derivatives, but it is imp ortant to b e aware that many more subtleties exist. The deep learning comm community unity has been somewhat isolated from the broader computer science communit community y and has largely dev develop elop eloped ed its own cultural attitudes The deep learning comm unity has been somewhat isolated theautomatic broader concerning how to p erform differen differentiation. tiation. More generally generally, , the from field of computer science communit y and has largely dev elop ed its own cultural attitudes differ differentiation entiation is concerned with how to compute deriv derivatives atives algorithmically algorithmically.. The concerning how to algorithm p erform differen tiation. generally , the fieldto of automatic automatic bac back-propagation k-propagation describ described ed hereMore is only one approach differentiation with how computeclass derivofatives algorithmically . The differen differentiation. tiation.isItconcerned is a sp special ecial case of to a broader techniques called reverse bac k-propagation algorithm describ ed here is only one approach to automatic mo mode de ac accumulation cumulation cumulation.. Other approaches ev evaluate aluate the sub subexpressions expressions of the chain rule differen tiation. It is a sp ecial case of a broader class of techniques called reverse in different orders. In general, determining the order of ev evaluation aluation that results in modelo accumulation . Othercost approaches evaluate the sub expressions of the chain rule the lowest west computational is a difficult problem. Finding the optimal sequence in op different orders. In general, determining the order of (ev aluation ,that in of operations erations to compute the gradient is NP-complete Naumann 2008results ), in the the lowest computational cost is a difficult problem. Finding the optimal sequence of op erations to compute the gradient 221 is NP-complete (Naumann, 2008), in the


sense that it ma may y require simplifying algebraic expressions in into to their least exp expensive ensive form. sense that it may require simplifying algebraic expressions into their least exp ensive For example, supp suppose ose we ha have ve variables p 1, p 2, . . . , pn represen representing ting probabilities form. and variables z1 , z2 , . . . , zn represen representing ting unnormalized log probabilities. Supp Suppose ose F or example, supp ose w e ha ve v ariables represen ting probabilities p , p , . . . , p we define and variables z , z , . . . , z representingexp( unnormalized log probabilities. Supp ose exp(z zi) qi = P , (6.57) we define exp(zzi ) i exp( exp(z ) q =out of exp , (6.57) where we build the softmax function exponentiation, onentiation, summation and division P exp(z ) op operations, erations, and construct a cross-entrop cross-entropy y loss J = − i p i log qi . A human where we build the softmax function out of exp summation andesdivision mathematician can observe that the deriv derivativ ativ ativeeonentiation, of J with resp respect ect to zi tak takes a very J = p log q op erations, and construct a cross-entrop y loss . A human simple form: q i − pi. The back-propagation algorithm is not capable of simplifying J with−resp mathematician observe that the deriv ative ofpropagate ect totsz through takes a all very the gradien gradientt thiscan way ay, , and will instead explicitly gradien gradients of P simple form: q andpexp . The back-propagation is not capable of simplifying the logarithm exponen onen onentiation tiation op operations erationsalgorithm in the original graph. Some softw software are the gradien t hthis ay, and (will instead ts through all to of libraries suc such as−w Theano Bergstra et explicitly al., 2010;propagate Bastien etgradien al., 2012 ) are able P the logarithm tiationsubstitution op erations in original graph. Someprop softw are p erform some and kindsexp ofonen algebraic to the improv improve e ov over er the graph proposed osed libraries suc h as Theano ( Bergstra et al. , 2010 ; Bastien et al. , 2012 ) are able to by the pure back-propagation algorithm. p erform some kinds of algebraic substitution to improve over the graph prop osed When the forward graph G has a single output no node de and eac each h partial deriv derivativ ativ ativee b∂u y the pure back-propagation algorithm. can b e computed with a constant amount of computation, back-propagation ∂u When the forward graph has a single output no de and each partial derivative guaran guarantees tees that the number of computations for the gradient computation is of can border e computed withb er aG constant amount of back-propagation the same as the num numb of computations forcomputation, the forward computation: this guarantees that the number of computations for the gradient computation is of ∂u can b e seen in Algorithm 6.2 b ecause eac each h lo local cal partial deriv derivative ative ∂u needs the same order as the numb er of computations for the forward computation: this to b e computed only once along with an asso associated ciated multiplication and addition can the b e recursive seen in Algorithm b ecause eac h lo6.49 cal ).partial deriv needs for chain-rule6.2 formulation (Eq. The ov overall erallative computation is to b e computed only once along asso ciated and addition O (# edges). therefore How Howev ev ever, er, itwith can an p otentially b emultiplication reduced by simplifying the for the recursive chain-rule formulation (Eq. 6.49). The erall computation is computational graph constructed by back-propagation, and ov this is an NP-complete O (# edges). therefore er,Theano it can pand otentially b e reduced by simplifying task. Implemen Implementations tationsHow suc such hevas TensorFlow use heuristics basedthe on computational graph constructed by back-propagation, and this is an NP-complete matc matching hing kno known wn simplification patterns in order to iteratively attempt to simplify task. Implemen tationsbac suc h as Theano only and for TensorFlow use heuristics basedofon the graph. We defined back-propagation k-propagation the computation of a gradient a matc hing kno wn simplification patterns in order to iteratively attempt to simplify scalar output but bac back-propagation k-propagation can b e extended to compute a Jacobian (either the graph. W e defined bac computation of gradient of ka of k differen differentt scalar no nodes desk-propagation in the graph,only or offora the tensor-v tensor-valued alued no node dea containing scalar output but bac k-propagation can b e extended to compute a Jacobian (either values). A naiv naivee implemen implementation tation may then need k times more computation: for kh differen of t ternal scalar no node desininthe theoriginal graph,forward or of a tensor-v alued no containing k eac each scalar in internal node graph, the naiv naive e de implementation vcomputes alues). Aknaiv e implemen tation then need k When times more computation: for gradien gradients ts instead of amay single gradient. the num numb b er of outputs eacthe h scalar no de in the graph, naive implementation of graphinternal is larger than the original number forward of inputs, it isthe sometimes preferable to k computes gradien ts instead of a single gradient. When the num b er of outputs. use another form of automatic differen differentiation tiation called forwar forward d mo mode de ac accumulation cumulation cumulation. oforward the graph larger than has thebneen umber of inputs, it is sometimes to F mo mode deiscomputation prop proposed osed for obtaining real-timepreferable computation usegradients another in form of automatic differen tiation called forwar d mo de ac cumulation of recurrent net networks, works, for example (Williams and Zipser , 1989 ). This. Forward mothe de computation prop osed for obtaining computation also av avoids oids need to storehas thebveen alues and gradients for thereal-time whole graph, trading of gradients in recurrent net works, for example ( Williams and Zipser , 1989 This off computational efficiency for memory memory.. The relationship b etw etween een forward).mo mode de also avoids the need to store the values and gradients for the whole graph, trading off computational efficiency for memory222 . The relationship b etween forward mo de


and backw backward ard mo mode de is analogous to the relationship b etw etween een left-multiplying versus righ right-multiplying t-multiplying a sequence of matrices, such as and backward mo de is analogous to the relationship b etween left-multiplying versus AB ABC Csuch D , as (6.58) right-multiplying a sequence of matrices, where the matrices can b e thought AB of as matrices. For example,(6.58) if D C Jacobian D, A is a column vector while has many rows, this corresponds to a graph with a D where matrices canybinputs, e thought as Jacobian matrices. For example, single the output and man many andof starting the m ultiplications from theifend is a column vector while has many rows, this corresponds to acorresp graph onds with to a and going backw backwards ards onlyArequires matrix-vector pro products. ducts. This corresponds single output and man y inputs, and starting the m ultiplications from the end the backw backward ard mo mode. de. Instead, starting to multiply from the left would inv involve olve a and going backw ards only requires matrix-vector pro ducts. This corresp onds to series of matrix-matrix pro products, ducts, whic which h mak makes es the whole computation muc much h more the backw mo de.ifInstead, starting multiply the left ould invto olve A has few D hasfrom exp expensiv ensiv ensive. e. ard How Howev ev ever, er, fewer er ro rows wsto than columns, it iswcheaper runa series of matrix-matrix pro ducts, which onding makes the whole computation the multiplications left-to-righ left-to-right, t, corresp corresponding to the forward mo mode. de. much more A D exp ensive. However, if has fewer rows than has columns, it is cheaper to run many communities outside of machine learning, it is mo more the In multiplications left-to-righ t, corresp onding to the forward de. common to implemen implementt differen differentiation tiation softw software are that acts directly on traditional programming In many communities outside machine learning, it isgenerates more common to language co code, de, suc such h as Python or Cofco code, de, and automatically programs implemen t differen tiation softwin arethese thatlanguages. acts directly on traditional programming that different functions written In the deep learning communit community y, language co de, suc h as Python or C co de, and automatically generates programs computational graphs are usually represented by explicit data structures created by that different functions written in theseapproach languages. thedrawbac deep learning communit y, sp specialized ecialized libraries. The sp specialized ecialized hasInthe drawback k of requiring the computational graphs are usually represented data structures createdthe by library developer to define the metho methods dsby forexplicit every op operation eration and limiting sp ecialized libraries. ecialized approach thebdrawbac k of How requiring user of the library to The onlysp those op operations erations thathas ha have ve een defined. Howev ev ever, er, the the library developer to define the metho ds for every op eration and limiting the sp specialized ecialized approach also has the b enefit of allowing customized back-propagation user library to only those op erations that hathe ve b een defined. Howev the rules of tothe b e developed for each op operation, eration, allowing developer to impro improve veer, sp speed eed sp ecialized also has the b enefit allowing pro customized back-propagation or stability approach in non-obvious wa ways ys that an of automatic procedure cedure would presumably rules to b e developed for each op eration, allowing the developer to impro ve sp eed b e unable to replicate. or stability in non-obvious ways that an automatic pro cedure would presumably Bac Back-propagation k-propagation is therefore not the only wa way y or the optimal wa way y of computing b e unable to replicate. the gradient, but it is a very practical method that con contin tin tinues ues to serve the deep Back-propagation is therefore the only wa y or the optimal wa y of computing learning communit community y very well. Innot the future, differen differentiation tiation tec technology hnology for deep the gradient, but it is a very practical method that con tin ues to serve the deep net networks works ma may y impro improve ve as deep learning practitioners b ecome more aw aware are of adv advances ances learning communit well. Indifferentiation. the future, differentiation technology for deep in the broader fieldyofvery automatic networks may improve as deep learning practitioners b ecome more aware of advances in the broader field of automatic differentiation. Some soft software ware framew frameworks orks supp support ort the use of higher-order deriv derivatives. atives. Among the deep learning softw software are frameworks, this includes at least Theano and TensorFlo ensorFlow. w. Some soft ware framew orks supp ort the use of higher-order deriv atives. Among the These libraries use the same kind of data structure to describ describee the expressions for deep learning softw are frameworks, this includes at least and tiated. TensorFlo w. deriv derivatives atives as they use to describ describee the original function bTheano eing differen differentiated. This These the same kind of datamachinery structure to e the expressions for meanslibraries that theuse sym symb b olic differentiation candescrib b e applied to deriv derivatives. atives. derivatives as they use to describ e the original function b eing differentiated. This In the context of deep learning, it is rare to compute a single second deriv derivative ative means that the symb olic differentiation machinery can b e applied to derivatives. of a scalar function. Instead, we are usually interested in prop properties erties of the Hessian In the context of deep learning, it is rare to compute a single second derivative 223 interested in prop erties of the Hessian of a scalar function. Instead, we are usually


matrix. If we ha hav ve a function f : Rn → R, then the Hessian matrix is of size n × n. nRwill b e the number of parameters in the In typical deep learning applications, R f : n n matrix. If we ha v e a function , then the Hessian matrix is of size mo model, del, which could easily num numb b er in the billions. The en entire tire Hessian matrix is. n In typical deeptolearning applications, the th thus us infeasible even represen represent. t. → will b e the number of parameters in × mo del, which could easily numb er in the billions. The entire Hessian matrix is Instead of explicitly computing the Hessian, the typical deep learning approac approach h thus infeasible to even represent. is to use Krylov metho methods ds ds.. Krylo Krylov v metho methods ds are a set of iterative techniques for Instead of explicitly computing Hessian, the tin ypical deep learningorapproac h p erforming various op operations erations lik likeethe approximately inv verting a matrix finding is to use Krylov to metho ds. Krylo v metho ds vare a set of iterative for appro approximations ximations its eigenv eigenvectors ectors or eigen eigenv alues, without using techniques any op operation eration p erforming v arious op erations lik e approximately in v erting a matrix or finding other than matrix-vector pro products. ducts. approximations to its eigenvectors or eigenvalues, without using any op eration In than ordermatrix-vector to use Krylovpro metho methods other ducts.ds on the Hessian, we only need to b e able to compute the pro product duct b etw etween een the Hessian matrix H and an arbitrary vector v. A In order to use Krylov metho ds on ,the Hessian, we only to b e able to straigh straightforward tforward tec technique hnique (Christianson 1992 ) for doing so is need to compute H compute the pro duct b etween the Hessian matrix and an arbitrary vector v. A h i > v = ∇x (∇ straightforward technique (H Christianson , 1992 so is to compute (6.59) . x f (x)))forvdoing

Both of the gradien gradientt computations may y b e computed automatif (x)) v .ma H v = in this ( expression (6.59) cally by the appropriate soft software ware library library. . Note that the outer gradien gradientt expression ∇ ∇ Both of the gradien t computations in this expression ma y b e computed automatitak takes es the gradient of a function of the inner gradien gradientt expression. cally by the appropriate software library. Note that the outer gradient expression If vthe is gradient itself a vector pro produced duced byhainner computational graph, it is imp important ortant to takes of a function of the gradienit expression. sp specify ecify that the automatic differen differentiation tiation softw software are should not differentiate through v If is itself a vector pro duced by a computational graph, it is imp ortant to the graph that pro produced duced v . sp ecify that the automatic differentiation software should not differentiate through While computing the Hessian is usually not advisable, it is p ossible to do with the graph that pro duced v . Hessian vector pro products. ducts. One simply computes H e(i) for all i = 1, . . . , n, where While computing the Hessian(i)is usually not advisable, it is p ossible to do with e(i) is the one-hot vector with e i = 1 and all other entries equal to 0. Hessian vector pro ducts. One simply computes H e for all i = 1, . . . , n, where e is the one-hot vector with e = 1 and all other entries equal to 0.

6.6

Historical Notes

F eedforward netw networks orks Notes can b e seen as efficient nonlinear function appro approximators ximators 6.6 Historical based on using gradient descent to minimize the error in a function approximation. From eedforward netw canthe b emo seen as feedforward efficient nonlinear function approximators F this p oin oint t oforks view, modern dern netw network ork is the culmination of based on using gradient descent to minimize the error in a function approximation. cen centuries turies of progress on the general function approximation task. From this p oint of view, the mo dern feedforward network is the culmination of The chain rule that underlies the back-propagation algorithm was inv inven en ented ted centuries of progress on the general function approximation task. in the 17th cen century tury (Leibniz, 1676; L’Hôpital, 1696). Calculus and algebra ha have ve The c hain rule that underlies the back-propagation algorithm was inv en ted long b een used to solve optimization problems in closed form, but gradient descen descentt in the 17th cen tury ( Leibniz , 1676 ; L’Hôpital , 1696 ). Calculus and algebra hato ve was not in intro tro troduced duced as a technique for iterativ iteratively ely appro approximating ximating the solution long b een used to solve until optimization closed optimization problems the 19thproblems cen century tury (inCauc Cauchy hy, form, 1847).but gradient descent was not intro duced as a technique for iteratively approximating the solution to Beginning in the 1940s, these function approximation techniques were used to optimization problems until the 19th century (Cauchy, 1847). motiv motivate ate machine learning mo models dels suc such h as the p erceptron. How Howev ev ever, er, the earliest Beginning in the 1940s, these function approximation techniques were used to motivate machine learning mo dels such224 as the p erceptron. However, the earliest


mo models dels were based on linear mo models. dels. Critics including Marvin Minsky p ointed out several of the fla flaws ws of the linear mo model del family family,, such as it inability to learn the mo dels w ere based on linear mo dels. Critics including Marvinnetw Minsky p ointed XOR function, which led to a backlash against the entire neural network ork approach. out several of the flaws of the linear mo del family, such as it inability to learn the Learning required thethe developmen development t of netw a multila multilay yer p erXOR function,nonlinear which ledfunctions to a backlash against entire neural ork approach. ceptron and a means of computing the gradient through such a mo model. del. Efficien Efficientt Learningofnonlinear theprogramming developmentb egan of a to multila yerinpthe erapplications the chain functions rule based required on dynamic app appear ear ceptron a means offor computing the gradient through such a moand del. Denham Efficient, 1960s andand 1970s, mostly con control trol applications (Kelley , 1960 ; Bryson applications rule based on, dynamic programming b egan appsensitivity ear in the 1961 ; Dreyfusof, the 1962chain ; Bryson and Ho 1969; Dreyfus , 1973) but alsotofor 1960s and 1970s, mostly for con trol applications ( Kelley , 1960 ; Bryson and Denham analysis (Linnainmaa, 1976). Werbos (1981) prop proposed osed applying these tec techniques hniques, 1961 ; Dreyfus , 1962; neural Brysonnet and Ho, 1969 ) but also for in sensitivity to training artificial networks. works. The; Dreyfus idea was, 1973 finally developed practice analysis ( Linnainmaa , 1976 ). W erbos ( 1981 ) prop osed applying these tec after b eing indep independently endently redisco rediscov vered in differen differentt wa ways ys (LeCun, 1985; hniques Park Parker er er,, to training artificial neural net works. The idea was finally developed in practice 1985; Rumelhart et al. al.,, 1986a). The b o ok Par Paral al allel lel Distribute Distributed d Pr Pro ocessing presen presented ted after b eing indep endently redisco v ered in differen t wa ys ( LeCun , 1985 ; Park the results of some of the first successful exp experimen erimen eriments ts with back-propagation inera, 1985 ; Rumelhart et al.et, 1986a ). The b o okcontributed Paral lel Distribute ocessing presented chapter (Rumelhart al., 1986b ) that greatlydtoPrthe p opularization theback-propagation results of some ofand the initiated first successful erimen ts dwith back-propagation in a of a very exp activ active e p erio eriod of researc research h in multi-la multi-layer yer chapternetw (Rumelhart et al.er, , 1986b ) thatput contributed greatly to the p opularization neural networks. orks. Ho Howev wev wever, the ideas forw forward ard by the authors of that b ook of back-propagation and initiated a v ery activ e p erio d of researc h in multi-layer and in particular by Rumelhart and Hinton go muc much h b eyond back-propagation. neuralinclude networks. Howev er,ab the forwcomputational ard by the authors of that b ook They crucial ideas about outideas the pput ossible implemen implementation tation of and in particular b y Rumelhart and Hinton go muc h b eyond back-propagation. sev several eral central asp aspects ects of cognition and learning, whic which h came under the name of They include crucial ideas ab out the p ossible computational implemen of “connectionism” b ecause of the imp importance ortance given the connections b etw etween eentation neurons sev eral central asp ects of cognition and learning, whic h came under the name of as the lo locus cus of learning and memory memory.. In particular, these ideas include the notion “connectionism” b ecause of the(Hin impton ortance of distributed representation Hinton et al. al.,given , 1986the ). connections b etween neurons as the lo cus of learning and memory. In particular, these ideas include the notion Following the success of back-propagation, neural net network work researc research h gained p opof distributed representation (Hinton et al., 1986). ularit ularity y and reac reached hed a p eak in the early 1990s. Afterwards, other machine learning F ollowing the ofopular back-propagation, neuraldeep network researc h gained that p optec techniques hniques b ecamesuccess more p until the mo modern dern learning renaissance ularit and reached a p eak in the early 1990s. Afterwards, other machine learning b egany in 2006. techniques b ecame more p opular until the mo dern deep learning renaissance that Theincore ideas b ehind modern feedforward netw networks orks hav havee not changed subb egan 2006. stan stantially tially since the 1980s. The same back-propagation algorithm and the same The coretoideas b ehind modern feedforward netwoforks e not changed subapproac approaches hes gradien gradient t descent are still in use. Most the hav improv improvement ement in neural stan tiallyp erformance since the 1980s. back-propagation algorithm and the First, same net network work from The 1986same to 2015 can b e attributed to tw two o factors. approacdatasets hes to gradien t descentthe are degree still in to use.which Moststatistical of the improv ement in neural larger hav havee reduced generalization is a net work p erformance from 1986 to 2015 can b e attributed to tw o factors. First, challenge for neural netw networks. orks. Second, neural netw networks orks hav havee b ecome muc much h larger, larger datasets hav e reduced the degree to which statistical generalization is aa due to more p ow owerful erful computers, and b etter softw software are infrastructure. How However, ever, challenge for neural networks. changes Second, ha neural netwved orksthe havpeerformance b ecome muc larger, small number of algorithmic hav ve impro improved ofh neural due to more p owerful net networks works noticeably noticeably. . computers, and b etter software infrastructure. However, a small number of algorithmic changes have improved the p erformance of neural One of these algorithmic changes was the replacement of mean squared error networks noticeably. with the cross-en cross-entropy tropy family of loss functions. Mean squared error was popular in these algorithmic wasreplaced the replacement of mean squared the One 1980sofand 1990s, but waschanges gradually by cross-entrop cross-entropy y losses anderror the with the cross-entropy family of loss functions. Mean squared error was popular in the 1980s and 1990s, but was gradually225 replaced by cross-entropy losses and the


principle of maximum likelihoo likelihood d as ideas spread b etw etween een the statistics comm community unity and the mac machine hine learning communit community y. The use of cross-en cross-entropy tropy losses greatly principle maximum likelihoo d asdels ideas spread b etwand een the statistics commwhich unity impro improved vedofthe p erformance of mo models with sigmoid softmax outputs, and previously the machine learning communit y. and The slow use of cross-en tropyusing losses had suffered from saturation learning when thegreatly mean impro ved the p erformance of mo dels with sigmoid and softmax outputs, which squared error loss. had previously suffered from saturation and slow learning when using the mean The other ma major jor algorithmic change that has greatly improv improved ed the p erformance squared error loss. of feedforward netw networks orks was the replacemen replacementt of sigmoid hidden units with piecewise The other ma jor algorithmic change that has greatly improvusing ed thethe p erformance linear hidden units, suc such h as rectified linear units. Rectification max max{ {0, z } of feedforward netw orks w as the replacemen t of sigmoid hidden units with piecewise function was in intro tro troduced duced in early neural netw network ork mo models dels and dates back at least 0, z linear units, such and as rectified linear units. Rectification max early as far hidden as the Cognitron Neo Neocognitron cognitron (Fukushima , 1975,using 1980).theThese function wasnot intro duced in early neural netw orkinstead mo dels applied and dates back at{ least } mo models dels did use rectified linear units, but rectification to as far as the Cognitron and Neo cognitron (Fukushima , 1975, 1980 ). These early nonlinear functions. Despite the early p opularity of rectification, rectification was mo dels did not use rectified linear units, but instead applied rectification to largely replaced by sigmoids in the 1980s, p erhaps b ecause sigmoids p erform b etter nonlinear functions. thesmall. early As p opularity of rectification, rectification was when neural net networks worksDespite are very of the early 2000s, rectified linear units largely replaced in the 1980s, p erhaps b ecause sigmoids erform bwith etter w ere avoided dueby tosigmoids a somewhat sup superstitious erstitious b elief that activ activation ation p functions when neural tiable networks are very small. As of This the early 2000s, rectified linear non-differen non-differentiable p oin oints ts must b e av avoided. oided. b egan to change in ab about out units 2009. w ere a voided due to a somewhat sup erstitious b elief that activ ation functions with Jarrett et al. (2009) observ observed ed that “using a rectifying nonlinearity is the single most non-differen tiableinp oin ts must b e av This of b egan to changesystem” in ab outamong 2009. imp importan ortan ortantt factor improving the p oided. erformance a recognition Jarrett al. (2009 ) observ that “using a rectifying nonlinearity sev several eral et differen different t factors ofed neural net network work arc architecture hitecture design. is the single most imp ortant factor in improving the p erformance of a recognition system” among For differen small datasets, al.work (2009 ) hitecture observed design. that using rectifying nonseveral t factors Jarrett of neuraletnet arc linearities is ev even en more imp important ortant than learning the weigh weights ts of the hidden la layers. yers. For small datasets, Jarretttoetpropagate al. (2009)useful observed that using rectifying nonRandom weigh weights ts are sufficient information through a rectified linearities is even more than learning the to weigh ts how of the layers. linear netw network, ork, allo allowing wing imp the ortant classifier lay layer er at the top learn to hidden map differen different t Random weigh ts are sufficient to propagate useful information through a rectified feature vectors to class identities. linear network, allowing the classifier layer at the top to learn how to map different When more data is available, learning b egins to extract enough useful kno knowledge wledge feature vectors to class identities. to exceed the p erformance of randomly chosen parameters. Glorot et al. (2011a) When more data is is available, learning b egins to extract useful knoinwledge sho showed wed that learning far easier in deep rectified linearenough netw networks orks than deep to exceed the p erformance of randomly c hosen parameters. Glorot et al. ( 2011a net networks works that hav havee curv curvature ature or two-sided saturation in their activ activation ation functions.) showed that learning is far easier in deep rectified linear networks than in deep Rectified linear units are also of historical in interest terest b ecause they show that networks that have curvature or two-sided saturation in their activation functions. neuroscience has con continued tinued to ha have ve an influence on the dev development elopment of deep Rectified linear units are also of historical in terest b ecause they units show from that learning algorithms. Glorot et al. (2011a) motiv motivate ate rectified linear neuroscience has con tinued to ha ve an influence on the dev elopment of deep biological considerations. The half-rectifying nonlinearity was in intended tended to capture learning algorithms. Glorot et al. ( 2011a ) motiv ate rectified linear neurons units from these prop properties erties of biological neurons: 1) For some inputs, biological are biological considerations. half-rectifying nonlinearity was intended to ortional capture completely inactive. 2) ForThe some inputs, a biological neuron’s output is prop proportional these erties of biological neurons: 1) Forneurons some inputs, neurons are to its prop input. 3) Most of the time, biological op operate eratebiological in the regime where completely inactive. 2) they For some inputs, biological neuron’s they are inactive (i.e., should hav haveeasp sparse arse activations activations). ). output is prop ortional to its input. 3) Most of the time, biological neurons op erate in the regime where the mo modern dern they resurgence of deep learning b egan). in 2006, feedforward theyWhen are inactive (i.e., should hav e sparse activations net networks works contin continued ued to hav havee a bad reputation. From about 2006-2012, it was widely When the mo dern resurgence of deep learning b egan in 2006, feedforward networks continued to have a bad reputation. From about 2006-2012, it was widely 226


b eliev elieved ed that feedforward netw networks orks would not p erform well unless they were assisted by other mo models, dels, such as probabilistic mo models. dels. To day day,, it is now known that with the b eliev ed that feedforward netw orks would not p erform unless pthey werevery assisted righ rightt resources and engineering practices, feedforw feedforward ardwell net networks works erform well. b y other mo dels, such as probabilistic mo dels. T o day , it is now known that with the To day day,, gradien gradient-based t-based learning in feedforw feedforward ard net networks works is used as a tool to develop righ t resources and engineering practices, feedforw ard net works p erform very w ell. probabilistic mo models, dels, such as the variational auto autoenco enco encoder der and generative adv adversarial ersarial T o day , gradien t-based in 20 feedforw ard than networks is used as aastool develop net networks, works, describ described ed inlearning Chapter . Rather b eing view viewed ed an to unreliable probabilistic mo dels, as the variational enco der gradient-based and generative adv ersarial tec technology hnology that mustsuch b e supp supported orted by otherauto techniques, learning in net works, describ ed in Chapter 20 . Rather than b eing view ed as an unreliable feedforw feedforward ard net networks works has b een view viewed ed since 2012 as a p ow owerful erful technology that tec hnology that m ust b e supp orted by other techniques, gradient-based learning in ma may y b e applied to many other mac machine hine learning tasks. In 2006, the communit community y feedforw ardervised networks has bto eensupp view edsup since 2012learning, as a p owand erfulno technology that used unsup unsupervised learning support ort supervised ervised now, w, ironically ironically, , it mamore y b e common applied to other maclearning hine learning tasks. In 2006, thelearning. community is to many use sup supervised ervised to supp support ort unsup unsupervised ervised used unsup ervised learning to supp ort sup ervised learning, and now, ironically, it Feedforward netw networks orks contin continue ue to ha have ve unfulfilled p otential. In the future, we is more common to use sup ervised learning to supp ort unsup ervised learning. exp expect ect they will b e applied to man many y more tasks, and that adv advances ances in optimization Feedforward netw contin ue improv to havee unfulfilled p otential.even In the future, we algorithms and mo model delorks design will improve their p erformance further. This ect they b e applied to man more tasks, andork that advances in dels. optimization cexp hapter has will primarily describ described ed ythe neural netw network family of mo models. In the algorithms and mo del design will improv e their p erformance even further. subsequen subsequentt chapters, we turn to how to use these mo models—ho dels—ho dels—how w to regularizeThis and ctrain hapter has primarily describ ed the neural netw ork family of mo dels. In the them. subsequent chapters, we turn to how to use these mo dels—how to regularize and train them.

227

Chapter 7 Chapter 7

Regularization for Deep Learning Regularization for Deep Learning A cen central tral problem in mac machine hine learning is ho how w to mak makee an algorithm that will perform well not just on the training data, but also on new inputs. Man Many y strategies A cenintral problem in mac hine learningdesigned is how to to reduce make an will used mac machine hine learning are explicitly thealgorithm test error,that possibly perform well notofjust on the training training error. data, but alsostrategies on new inputs. Manycollectiv strategies at the exp expense ense increased These are known collectively ely used in mac hine learning are explicitly designed to reduce the test error, possibly as regularization. As we will see there are a great many forms of regularization the expto ense increased training error. These are known collectiv ely aatvailable theof deep learning practitioner. In strategies fact, dev developing eloping more effective as regularization. As wehas willbeen see there great many forms of regularization regularization strategies one ofare theama major jor researc research h efforts in the field. available to the deep learning practitioner. In fact, developing more effective Chapter 5 introduced the basic concepts of generalization, underfitting, ov overfiterfitregularization strategies has been one of the ma jor research efforts in the field. ting, bias, variance and regularization. If you are not already familiar with these Chapter 5 introduced thecbasic of generalization, underfitting, overfitnotions, please refer to that hapterconcepts before contin one. continuing uing with this ting, bias, variance and regularization. If you are not already familiar with these In this chapter, e describ describe e regularization in more focusing notions, please referwto that chapter before contin uing detail, with this one. on regularization strategies for deep models or mo models dels that ma may y be used as building blo blocks cks In this c hapter, w e describ e regularization in more detail, focusing on regularto form deep models. ization strategies for deep models or models that may be used as building blocks Some sections of this chapter deal with standard concepts in machine learning. to form deep models. If you are already familiar with these concepts, feel free to skip the relev relevant ant Some sections of this chapter deal with standard concepts in machine learning. sections. Ho How wev ever, er, most of this chapter is concerned with the extension of these If you are already withcase these concepts, feel basic concepts to thefamiliar particular of neural netw networks. orks.free to skip the relevant sections. However, most of this chapter is concerned with the extension of these In Sec. 5.2.2, we defined regularization as “an “any y modification we mak makee to a basic concepts to the particular case of neural networks. learning algorithm that is intended to reduce its generalization error but not In Sec. 5.2.2 , weThere defined as “any strategies. modification we mak to a its training error.” areregularization many regularization Some put eextra learning algorithm that is intended to reduce its generalization error but constrain constraints ts on a mac machine hine learning model, suc such h as adding restrictions on not the its trainingvalues. error.” Some Thereadd areextra manyterms regularization strategies. Some putcan extra parameter in the ob objectiv jectiv jective e function that be constrain ts on a mac hine learning model, suc h as adding restrictions on the though thoughtt of as corresponding to a soft constrain constraintt on the parameter values. If chosen parameter v alues. Some add extra terms in the jectiv function that can be carefully carefully,, these extra constrain constraints ts and penalties canob lead toeimpro improv ved performance thought of as corresponding to a soft constraint on the parameter values. If chosen carefully, these extra constraints and penalties can lead to improved performance 228 228

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

on the test set. Sometimes these constrain constraints ts and penalties are designed to encode sp specific ecific kinds of prior kno knowledge. wledge. Other times, these constrain constraints ts and penalties on the test set. Sometimes these constrain ts and p enalties are designed encode are designed to express a generic preference for a simpler model class intoorder to sp ecific kinds of prior kno wledge. Other times, these constrain ts and p enalties promote generalization. Sometimes penalties and constraints are necessary to make areunderdetermined designed to express a generic preference for aforms simpler model class inknown order to an problem determined. Other of regularization, as promote Sometimes penalties and constraints arethe necessary make ensem ensemble blegeneralization. metho methods, ds, com combine bine multiple hyp ypotheses otheses that explain trainingtodata. an underdetermined problem determined. Other forms of regularization, known as In the context of deep learning, most regularization strategies are based on ensemble methods, combine multiple hypotheses that explain the training data. regularizing estimators. Regularization of an estimator works by trading increased context of deep An learning, most regularization strategies based on biasInforthe reduced variance. effective regularizer is one that mak makes esare a profitable regularizing estimators. of annot estimator orks by trading increased trade, reducing varianceRegularization significantly while ov overly erly w increasing the bias. When bias for reduced v ariance. An effective regularizer is one that mak es a profitable we discussed generalization and ov overfitting erfitting in Chapter 5, we fo focused cused on three trade, reducing v ariance significantly while not ov erly increasing the bias. situations, where the mo model del family being trained either (1) excluded theWhen true w e discussed generalization and ov erfitting in Chapter 5 , w e fo cused on three data generating process—corresp process—corresponding onding to underfitting and inducing bias, or (2) situations, the mo del family bcess, eing or trained either the (1) generating excluded the true matc matched hed thewhere true data generating pro process, (3) included pro process cess data generating process—corresp onding to underfitting and inducing bias, or (2) but also man many y other possible generating processes—the ov overfitting erfitting regime where matc hed rather the true data or (3) included the generating process v ariance than biasgenerating dominatespro thecess, estimation error. The goal of regularization but also man y other p ossible generating processes—the ov erfitting regime where is to tak takee a mo model del from the third regime in into to the second regime. variance rather than bias dominates the estimation error. The goal of regularization In practice, an ov overly erly complex mo model del family does not necessarily include the is to take a model from the third regime into the second regime. target function or the true data generating pro process, cess, or even a close appro approximation ximation In practice, an ov erly complex mo del family does not necessarily include of either. We almost never hav havee access to the true data generating pro process cessthe so target function or the true data generating pro cess, or even a close appro ximation we can nev never er kno know w for sure if the mo model del family being estimated includes the of either. W e cess almost never hav e access the true data generating process so generating pro process or not. How However, ever, most to applications of deep learning algorithms we can never kno w for the generating model family being estimated includes the are to domains where thesure trueifdata pro process cess is almost certainly outside generating cess. or not.learning However,algorithms most applications of deep learning the mo model del pro family family. Deep are typically applied to algorithms extremely are to domains where the true data generating pro cess is almost certainly outside complicated domains such as images, audio sequences and text, for which the true the model family . Deep learning algorithms are typically applied to extremely generation pro process cess essentially inv involves olves simulating the entire universe. To some complicated domains asto images, audio psequences and generating text, for which the true exten extent, t, we are alwa always ys such trying fit a square eg (the data pro process) cess) into generation pro cess essentially inv olves simulating the entire universe. T o some a round hole (our model family). extent, we are always trying to fit a square peg (the data generating process) into What this means is that con controlling trolling the complexity of the mo model del is not a a round hole (our model family). simple matter of finding the mo model del of the right size, with the right num numb ber of What this means is that con trolling the complexity of the mo del is not a parameters. Instead, we migh mightt find—and indeed in practical deep learning scenarios, simple matter of do finding the mo delbof right size, with right number of w e almost alw alwa ays find—that the estthe fitting mo model del (in thethe sense of minimizing parameters. Instead, we migh t find—and indeed in practical deep learning scenarios, generalization error) is a large model that has been regularized appropriately appropriately. . we almost always do find—that the best fitting model (in the sense of minimizing We now review several strategies for ho how w to create such a large, deep, regularized generalization error) is a large model that has been regularized appropriately. mo model. del. We now review several strategies for how to create such a large, deep, regularized model.

229


7.1

Parameter Norm Penalties

Regularization has been used for decades prior to the adv advent ent of deep learning. Linear 7.1 Parameter Norm Penalties mo models dels such as linear regression and logistic regression allow simple, straightforw straightforward, ard, Regularization has b een used for decades prior to the adv ent of deep learning. Linear and effectiv effectivee regularization strategies. models such as linear regression and logistic regression allow simple, straightforward, Man Many y regularization approac approaches hes are based on limiting the capacit capacity y of mo models, dels, and effective regularization strategies. suc such h as neural netw networks, orks, linear regression, or logistic regression, by adding a paManynorm regularization approac are based on limiting capacit of models, θ J . Wethe rameter penalty Ω( Ω(θ ) to thehes ob objective jective function denote theyregularized suc h as neural netw orks, linear regression, or logistic regression, by adding a paob objectiv jectiv jectivee function by J˜: rameter norm penalty Ω(θ) to the objective function J . We denote the regularized ob jective function by J˜:J˜(θ; X , y) = J (θ; X , y) + αΩ(θ) (7.1) J˜(θ; X , y) = J (that θ; X ,wyeigh ) + ts αΩ( θ) relative con (7.1) where α ∈ [0, ∞) is a hyperparameter eights the contribution tribution of the norm penalty term, Ω, relative to the standard ob objectiv jectiv jectivee function J (x; θ). α , [0 where ) is a hyperparameter that w eigh ts the relative conond tribution of α corresp Setting α to 0 results in no regularization. Larger values of correspond to more the norm∈penalty ∞ term, Ω, relative to the standard ob jective function J (x; θ). regularization. Setting α to 0 results in no regularization. Larger values of α correspond to more When our training algorithm minimizes the regularized ob objective jective function J˜ it regularization. will decrease both the original ob objectiv jectiv jectivee J on the training data and some measure J˜ itt When our training algorithm minimizes the regularized ob jective function of the size of the parameters θ (or some subset of the parameters). Differen Different decrease the original thedifferen training data andbeing somepreferred. measure J on in cwill hoices for theboth parameter normobΩjectiv can eresult different t solutions θ (or some of this the size of the the parameters). In section, we parameters discuss the effects of the subset variousofnorms when used as Differen penaltiest choices for theparameters. parameter norm Ω can result in different solutions being preferred. on the model In this section, we discuss the effects of the various norms when used as penalties Before delving in into to the regularization behavior of different norms, we note that on the model parameters. for neural net networks, works, we typically cho hoose ose to use a parameter norm penalty Ω that Before delving in to the regularization of different norms, weand note that penalizes only the weigh transformation at eac each h lay layer er leav leaves es weights ts of the affinebehavior for neural net works, w e typically c ho ose to use a parameter norm penalty Ω that the biases unregularized. The biases typically require less data to fit accurately only theEac weigh ts oft the penalizes affine how transformation at eac h layerFitting and leav es than the w eigh eights. ts. Each h weigh weight sp specifies ecifies two variables interact. the the biases The biases lessydata to fit accurately weigh requires observing both vtypically ariables require in a variet of conditions. Each eight t wellunregularized. ariety than the w eigh ts. Eac h weigh t sp ecifies how t wo v ariables interact. Fitting the bias con controls trols only a single variable. This means that we do not induce to too o muc uch h w eigh t w ell requires observing b oth v ariables in a v ariet y of conditions. Each variance by lea leaving ving the biases unregularized. Also, regularizing the bias parameters bias con trols only a singlet v ariable.ofThis means that do not induce o mucwh can in intro tro troduce duce a significan significant amount underfitting. Wewe therefore use the to vector variance by all leaving biases unregularized. Also, regularizing biasyparameters to indicate of thethe weigh eights ts that should be affected by a normthe penalt enalty , while the can intro a significan t amount of underfitting. e therefore useunregularized the vector w θ duce w and the v ector denotes all of the parameters, including bW oth to indicate all of the weights that should be affected by a norm penalty, while the parameters. vector θ denotes all of the parameters, including both w and the unregularized In the con context text of neural net networks, works, it is sometimes desirable to use a separate parameters. α penalt enalty y with a different co coefficien efficien efficientt for eac each h la lay yer of the netw network. ork. Because it can In the con text of hneural net works,vit is sometimes to use a separate be exp expensiv ensiv ensive e to searc search for the correct alue of multiple desirable hyp yperparameters, erparameters, it is still α p enalt y with a different co efficien t for eac h la y er of the netw ork. Because it can reasonable to use the same weigh weightt deca decay y at all la lay yers just to reduce the searc search h b e exp ensiv e to searc h for the correct v alue of multiple h yp erparameters, it is still space. reasonable to use the same weight decay at all layers just to reduce the search 230 space.


L2 Parameter Regularization L W e ha hav ve already seen, in Sec. 5.2.2, one of the simplest and most common kinds 7.1.1 Parameter Regularization 7.1.1

of parameter norm penalty: the L2 parameter norm penalty commonly kno known wn as W e ha v e already seen, in Sec. 5.2.2 , one of the simplest and most common kinds 1 weight de deccay ay.. This regularization strategy driv drives es the weigh weights ts closer to the origin of parameter norm penalty: the parameter penalty commonly kno wn as L 1 kwnorm 2 θ ) = 2 k2 to the ob by adding a regularization term Ω( Ω(θ objectiv jectiv jectivee function. In weight de c ay . This regularization strategy drivis es also the known weightsascloser the origin 2 other academic comm communities, unities, L regularization ridgeto regr gression ession or w θ ) = b y adding a regularization term Ω( to the ob jectiv e function. In Tikhonov regularization gularization.. other academic communities, L regularization k isk also known as ridge regression or W e can gain some insight into the b ehavior of weigh weightt decay regularization Tikhonov regularization. by studying the gradient of the regularized ob objective jective function. To simplify the Wetation, can gain some insight the behavior weigh decay regularization w . tSuc presen presentation, we assume no biasinto parameter, so θ isofjust Such h a mo model del has the b y studying the gradient of the regularized ob jective function. T o simplify the follo following wing total ob objective jective function: presentation, we assume no bias parameter, so θ is just w . Such a model has the α (7.2) J˜(w ; X , y) = w>w + J (w; X , y ), following total objective function: 2 α (7.2) (w; X , y) =gradient w w + J (w; X , y ), with the correspondingJ˜parameter 2 ∇wparameter J˜(w; X , y)gradient = αw + ∇w J (w; X , y). (7.3) with the corresponding To tak takee a single gradien gradienttJ˜step to, up update the eights, this update: (w; X y) date = αw + weigh J (ts, w; w Xe, p yerform ). (7.3) ∇wt ← ∇ w to − up (αdate w +the ∇ww J eigh (w; X y)) (7.4) To take a single gradien step ts,, w e p. erform this update: Written another way ay,, the w is: w update (αw + J (w; X , y)) . w← ← (1 − − α α) Written another way, the update is:)w − ∇∇w J (w; X , y).

(7.4) (7.5)

We can see that the addition decay modified dified the learning w (1of the α)w weight  deca J (ywterm ; X , yhas ). mo (7.5) rule to multiplicativ multiplicatively ely shrink the weigh weight t vector b y a constan constant t factor on each step, ← of−the weight − ∇decay term has modified the learning W e can see that the addition just before performing the usual gradient up update. date. This describ describes es what happ happens ens in to multiplicativ ely shrink theov weigh t vector y a constan t factor on each step, arule single step. But what happens over er the entire bcourse of training? just before performing the usual gradient update. This describes what happens in We will further simplify the analysis by making a quadratic approximation a single step. But what happens over the entire course of training? to the ob objective jective function in the neighborho neighborhoo od of the value of the weigh weights ts that W e will further simplify the analysis by making a quadratic approximation ∗ obtains minimal unregularized training cost, w = arg minw J (w). If the ob objective jective to the obisjective function inasthe neighborho od of the valueregression of the weigh ts with that function truly quadratic, in the case of fitting a linear model arg min J (w). If the objective obtainssquared minimal unregularized training cost, wis = mean error, then the appro approximation ximation perfect. function is truly quadratic, as in the case of fitting a linear regression model with 1 ∗ mean squared error,Jˆthen appro perfect. (θ) =the J (w ) +ximation (w − wis∗)> H (w − w ∗) (7.6) 2 1 1 (θ) =regularize J (w ) +the (parameters w w ) to Hbe (wnearwany ) specific point in (7.6) More generally, weJˆcould space 2 and, surprisingly, still get a regularization effect,− but better results will be obtained for a value − closer to the true one, with zero being a default value that makes sense when we do not know if the correct value should be positive or negative. Since it is far more common to regularize the model parameters towards zero, we will focus on this special case in our exposition. 231


where H is the Hessian matrix of J with resp respect ect to w ev evaluated aluated at w∗ . There is no first-order term in this quadratic approximation, because w∗ is defined to be a H iswhere w evaluated where the Hessian matrix of J withLikewise, respect to at w . There minim minimum, um, the gradien gradient t vanishes. because lo location cation of is a w∗ is the w no first-order term in this quadratic approximation, because is defined to b e a minim minimum um of J , we can conclude that H is positive semidefinite. minimum, where the gradien t v anishes. Likewise, because is the lo cation of a w The minim minimum um of Jˆ occurs where its gradient minimum of J , we can conclude that H is positive semidefinite. ˆ The minimum of Jˆ occurs∇where = gradient H (w − w ∗) (7.7) w J (w )its is equal to 0.

Jˆ(w) = H (w

w )

(7.7)

∇ t deca −dify Eq. 7.7 by adding the weigh To study weight decay y, we mo modify eightt is equal to 0. the effect of weigh deca decay y gradien gradient. t. We can no now w solv solvee for the minim minimum um of the regularized version of Jˆ. o study the effect of represen weight deca we modify Eq.minimum. 7.7 by adding the weight ˜ to We T use the variable w represent t they,location of the decay gradient. We can now solve for the minimum of the regularized version of Jˆ. ˜ to represen We use the variable w t the location ˜+ ˜ − w∗ ) of = 0the minimum. (7.8) αw H (w ∗ (H + αI )w ˜ = Hw (7.9) ˜ + H (w ˜ w )=0 (7.8) αw −1 ∗ w ˜ = ( H + α I ) H w . (7.10) − (H + αI )w ˜ = Hw (7.9)

˜ = (H + αsolution I ) H ww ˜. approac As α approac approaches hes 0, the w regularized approaches hes w∗ . But (7.10) what happ happens ens as α gro grows? ws? Because H is real and symmetric, we can decomp decompose ose it ˜ ofapproac α approac w hes 0,Λ the solution hes w Q . , But in into toAs a diagonal matrix andregularized an orthonormal basis eigen eigenvectors, vectors, suc such hwhat that α H happ ens as gro ws? Because is real and symmetric, we can decomp ose it > H = QΛQ . Applying the decomp decomposition osition to Eq.7.10, we obtain: into a diagonal matrix Λ and an orthonormal basis of eigenvectors, Q, such that > ,∗we obtain: H = QΛQ . Applyingw˜the toQEq. 7.10 = (decomp QΛQ >osition + αI ) −1 ΛQ w (7.11) h i−1 w˜ = (Q Q+ + QΛ (Λ αIα)IQ)> QΛ QQ ΛQw> w∗ (7.11) (7.12) > ∗ = QQ (Λ I I)−1 .Q w (Λ++αα )QΛQ w QΛ

(7.13) (7.12)

=eigh Q(Λ + αIy)is to ΛQrescale w . w ∗ along the axes defined (7.13) We see that the effect of w eight t deca decay rescalew by ∗ the eigen eigenv vectors of H . Sp Specifically ecifically ecifically,, the comp componen onen onentt of w that is aligned with the e see that the effect of rescaled weigh y factor is to irescale along the axes by iw h t deca iW -th eigenv eigenvector ector of H is by a of λiλ+α . (Y (You ou may wishdefined to review the eigen v ectors of . Sp ecifically , the comp onen t of that is aligned with the H w ho how w this kind of scaling works, first explained in Fig. 2.3). i-th eigenvector of H is rescaled by a factor of . (You may wish to review Along the directions where the eigenv eigenvalues alues of H are relatively large, for example, how this kind of scaling works, first explained in Fig. 2.3). where λi  α, the effect of regularization is relativ relatively ely small. Ho How wev ever, er, comp componen onen onents ts H Along the directions where the eigenv alues of are relatively large, for example, with λi  α will be shrunk to hav havee nearly zero magnitude. This effect is illustrated α λ where , the effect of regularization is relatively small. However, components in Fig. 7.1. with λ α will be shrunk to have nearly zero magnitude. This effect is illustrated Only directions along which the parameters con contribute tribute significan significantly tly to reducing in Fig. 7.1 . the ob objectiv jectiv jectivee function are preserved relativ relatively ely in intact. tact. In directions that do not Only directions alongthe which the parameters tribute significan tlythe to reducing con contribute tribute to reducing ob objective jective function, con a small eigenv eigenvalue alue of Hessian the ob jectiv e function are preserved relativ ely in tact. In directions that do not tells us that mov movement ement in this direction will not significantly increase the gradien gradient. t. contribute to reducing the ob jective function, a small eigenvalue of the Hessian tells us that movement in this direction232 will not significantly increase the gradient.


w2

w∗ ˜ w

w1

Figure 7.1: An illustration of the effect of L 2 (or weight decay) regularization on the value of the optimal w. The solid ellipses represen representt contours of equal value of the unregularized L (or wof Figure 7.1: The An illustration of the effect ofcontours eight decay) on the value ob objective. jective. dotted circles represent equal valueregularization of the L2 regularizer. At w of the optimal . The solid ellipses represen t contours of equal In value thedimension, unregularized w ˜ , these the p oint competing ob objectives jectives reach an equilibrium. the of first the L regularizer. ob jective. dotted circles of equal value of the At eigen eigenv value The of the Hessian of Jrepresent is small. contours The ob objective jective function do does es not increase muc much h w ˜ ∗ the p oint , these competing jectives an equilibrium. the firstdo dimension, the w .reach when moving horizon horizontally tally awa way yobfrom Because the ob objective jectiveInfunction does es not express J direction, value of the Hessian is small. The jective function do es not increase h aeigen strong preference along of this the ob regularizer has a strong effect on thismuc axis. w when moving horizon tally a wa y from . Because the ob jective function do es not express The regularizer pulls w1 close to zero. In the second dimension, the ob objective jective function ∗ . The corresp a strong preference along this direction, thewregularizer has aonding strongeigenv effectalue on this axis. is very sensitiv sensitive e to mo mov vements awa way y from corresponding eigenvalue is large, w close The regularizer pullsature. zero. weigh In thet second dimension, the ob jective function indicating high curv curvature. As atoresult, weight decay affects the p osition of w2 relativ relatively ely w is v ery sensitiv e to mo v ements a wa y from . The corresp onding eigenv alue is large, little. indicating high curvature. As a result, weight decay affects the p osition of w relatively little.

233


Comp Componen onen onents ts of the weight vector corresponding to suc such h unimportant directions are deca decay yed awa way y through the use of the regularization throughout training. Components of the weight vector corresponding to such unimportant directions So far we hav havee discussed weight decay in terms of its effect on the optimization are decayed away through the use of the regularization throughout training. of an abstract, general, quadratic cost function. Ho How w do these effects relate to So far we hav e discussed w eight decay in terms of its effect on the regression, optimizationa mac machine hine learning in particular? We can find out by studying linear of an costisfunction. w therefore do these amenable effects relate to mo model del abstract, for whic which hgeneral, the true quadratic cost function quadratic Ho and to the machine learning in particular? We can findApplying out by studying linearagain, regression, same kind of analysis we ha have ve used so far. the analysis we willa mo h the costcase function quadratic andbut therefore amenable the b e del ablefor towhic obtain a true sp special ecial of theissame results, with the solutiontonow same kind of analysis we ha ve used so far. Applying the analysis again, we will phrased in terms of the training data. For linear regression, the cost function is be able special case of the same results, but with the solution now the sum to of obtain squareda errors: phrased in terms of the training data. For linear regression, the cost function is (7.14) the sum of squared errors: (X w − y )> (X w − y ). When we add L2 regularization, the ob objective changes to (X w y )jective (X w function y ). − −1 > ob jective function When we add L regularization, changes to (X w − y )the (X w − y ) + αw > w . 2 1 (Xequations w y ) (for X wthe solution y ) + αw w. This changes the normal 2 from − − −1 > w = (for X >the X )solution X y from This changes the normal equations to

w = (X X ) X y w = (X >X + αI )−1X >y.

(7.14) (7.15) (7.15) (7.16) (7.16) (7.17)

to 1 > X. The matrix X >X in Eq. 7.16 prop proportional covariance ariance matrix m X w =is(X Xortional + αI ) toXthe y.cov (7.17)   −1 Using L 2 regularization replaces this matrix with X >X + αI in Eq. 7.17. The matrix X X in Eq. 7.16 is proportional to the covariance matrix X X . The new matrix is the same as the original one, but with the addition of α to the X αI variance Using L regularization replaces in Eq.of 7.17 diagonal. The diagonal entries of this this matrix matrix with corresp correspond ondXto+the eac each h. The new matrixWise the as the one, butcauses with the α to the L2 original input feature. can same see that regularization the addition learning of algorithm diagonal. The diagonal entries of this matrix corresp ond to the v ariance of eac to “p “perceive” erceive” the input X as having higher variance, whic which h makes it shrink theh L regularization input feature. We can see cov that causes the is learning algorithm weigh on features whose with the output target low compared to eights ts covariance ariance   X to “p erceive” the input as having higher v ariance, whic h makes it shrink the this added variance. weights on features whose covariance with the output target is low compared to this added variance.

L1 Regularization L 7.1.2 L 2 weigh Regularization While eight t deca decay y is the most common form of weight deca decay y, there are other

7.1.2

ways to penalize the size of the mo model del parameters. Another option is to use L 1 While L weight decay is the most common form of weight decay, there are other regularization. ways to penalize the size of the model parameters. Another option is to use L Formally ormally,, L1 regularization on the model parameter w is defined as: regularization. X Formally, L regularization w is defined as: (7.18) Ω(θon ) =the ||wmodel || 1 = parameter |wi |, Ω(θ) = w234 = || ||

i

X

w , | |

(7.18)


that is, as the sum of absolute values of the individual parameters.2 We will no now w discuss the effect of L1 regularization on the simple linear regression mo model, del, that is, as the sum of absolute v alues of the individual parameters. W e will 2 with no bias parameter, that we studied in our analysis of L regularization. In now discusswethe of L in regularization on differences the simpleblinear del, L1 and L2mo particular, areeffect in interested terested delineating the et etw weenregression forms with no bias parameter, that w e studied in our analysis of regularization. In L of regularization. As with L 2 weigh eightt deca decay y, L1 weigh eightt deca decay y con controls trols the strength L erparameter particular, we are interested in delineating theΩdifferences between and L forms α. of the regularization by scaling the penalty using a positive hyp yperparameter L L of regularization. As with w eigh t deca y , w eigh t deca y con trols the strength ˜ Th Thus, us, the regularized ob objective jective function J (w; X , y) is given by of the regularization by scaling the penalty Ω using a positive hyperparameter α. Thus, the regularized ob ;X y); X is ,given J (, w y), by (7.19) J˜ (jective w; X , yfunction ) = α||wJ˜||(1w+ + J (w; X ,t): y), (w; X , y(actually ) = α w, sub-gradien with the correspondingJ˜gradient (actually, sub-gradient): || || with the corresponding sub-gradien t):, y; w) ∇w J˜(gradient w; X , y)(actually = αsign(, w ) + ∇w J (X

(7.19) (7.20)

J˜(w ; Xsign , y) of =α w) +elemen J (t-wise. X , y; w) (7.20) where sign(w) is simply the wsign( applied element-wise. ∇ ∇ By sign( insp inspecting ecting Eq. 7.20 we can immediately that the effect of L 1 reguwhere w ) is simply the, sign of wsee applied elemen t-wise. larization is quite differen differentt from that of L2 regularization. Sp Specifically ecifically ecifically,, we can By insp ecting Eq. 7.20 , w e can see immediately that the effect of Llinearly regusee that the regularization contribution to the gradient no longer scales larization differen that factor of L regularization. Specifically can with each is instead it is tafrom constant with a sign equal to sign . One w i;quite sign((w, i)we see that the regularization no longer scalessee linearly consequence of this form ofcontribution the gradien gradientt to is the thatgradient we will not necessarily clean with each ; instead it is a constant factor with a sign equal to ) . w sign ( w L2 algebraic solutions to quadratic appro approximations ximations of J (X , y; w) as we did forOne consequence of this form of the gradien t is that we will not necessarily see clean regularization. algebraic solutions to quadratic approximations of J (X , y; w) as we did for L Our simple linear mo model del has a quadratic cost function that we can represen representt regularization. via its Taylor series. Alternately Alternately,, we could imagine that this is a truncated Taylor Our simple linear mocost del has a quadratic cost function that we can series appro approximating ximating the function of a more sophisticated mo model. del. Therepresen gradientt viathis its T aylor series. Alternately , we could imagine that this is a truncated Taylor in setting is giv given en by series approximating the cost function of a more sophisticated model. The gradient in this setting is given by ∇w Jˆ(w) = H (w − w∗), (7.21) ˆ(w) =ofHJ(w ), ect to w ev where, again, H is the Hessian J matrix withwresp respect evaluated aluated at(7.21) w∗ . ∇es not admit clean − algebraic expressions in the case Because the L1 penalt enalty y do does where, again, H is the Hessian matrix of J with respect to w evaluated at w . of a fully general Hessian, we will also mak makee the further simplifying assumption the L ispenalt y doesHnot clean algebraic case = admit diag diag([ ([ ([H H thatBecause the Hessian diagonal, ]),,expressions where eac each hinHthe 1,1 , . . . , Hn,n ]) i,i > 0. of a fully general Hessian, we will the further simplifying assumption This assumption holds if the dataalso for mak the elinear regression problem has been H = diag ([ H , . . . , H > that the Hessian is diagonal, ]) , where eac h prepro preprocessed cessed to remo remove ve all correlation betw etween een the input features, whic which hHma may y b0e. This assumption holds if the data for the linear regression problem has b een accomplished using PCA. preprocessed to remove all correlation between the input features, which may be 2 As with L 2 regularization, parameters towards a value that is not accomplished using PCA. we could regularize(othe ) 1 zero, but instead towards some parameter value that case the L regularization would P w . In (o) introduce the term Ω(θ ) = ||w − w(o) || 1 = i |wi − w i |. 235


Our quadratic appro approximation ximation of the L1 regularized ob objective jective function decomposes in into to a sum over the parameters: Our quadratic approximation of theL regularized objective function decom X 1 poses inˆto a sum over the∗parameters: J (w; X , y) = J (w ; X , y) + Hi,i (wi − w ∗i )2 + α|wi | . (7.22) 2 i 1 Jˆ(w; X , y) = J (w ; X , y) + H (w w ) + α w . (7.22) 2 cost function has an analytical solution The problem of minimizing this approximate − | | (for eac each h dimension i), with the follo following wing form: The problem of minimizing this approximate  cost functionhas an analytical solution   α (for each dimension i), with the follo wing form: (7.23) wi = sign( sign(w w∗i )X max |w∗i | − ,0 . H i,i α (7.23) w = sign(∗ w ) max w ,0 . Consider the situation where wi > 0 for all i. There two o possible outcomes: H are tw | |− Consider theαsituation where w > 0 for all i. There are two possible objectiv outcomes: 1. w∗i ≤ jectiv jectivee is Hi,i . Here the optimal value of wi under the regularized ob   simply wi = 00.. This occurs because the contribution of J (w ; X , y ) to the 1. w . Here the optimal value of w under the regularized ob jective is1 regularized ob objectiv jectiv jectivee J˜(w ; X , y ) is overwhelmed—in direction i—b —by y the L J ( w ; X , y simply This o ccurs b ecause the contribution of ) to the ≤ w = 0.whic regularization pushes the v alue of w to zero. which h i ˜ regularized ob jective J (w ; X , y ) is overwhelmed—in direction i—by the L ∗ > α 2. w the hregularization doesofnot move e the optimal value of wi to regularization pushes the value w mov to zero. i Hi,i , herewhic zero but instead it just shifts it in that direction by a distance equal to Hαi,i . 2. w > , here the regularization does not move the optimal value of w to zero but instead it just shifts it in that direction by a distance equal w to . A similar process happ happens ens when w ∗i < 0, but with the L1 penalt enalty y making i less negativ negativee by Hαi,i , or 0. A similar process happens when w < 0, but with the L penalty making w less 2 1 In comparison negativ e by , orto0.L regularization, L regularization results in a solution that is more sp sparse arse arse.. Sparsit Sparsity y in this con context text refers to the fact that some parameters L L yregularization In comparison to regularization, results in that ha hav ve an optimal value of zero. The sparsit sparsity of L1 regularization is aa solution qualitativ qualitatively ely is more sp arse . Sparsit y in this con text refers to the fact that some parameters 2 ˜ differen differentt beha ehavior vior than arises with L regularization. Eq. 7.13 gav gavee the solution w ha v e an optimal v alue of zero. The sparsit y of regularization is a qualitativ ely L 2 for L regularization. If we revisit that equation using the assumption of a diagonal w ˜ differentH beha viorwthan arises with 7.13 gave the Hessian that e introduced for Lourregularization. analysis of L 1 Eq. regularization, wesolution find that Hi,i L regularization. for usingnonzero. the assumption of a diagonal ∗ ∗ If we revisit that equation w ˜i = ˜ i remains This demonstrates Hi,i+α wi . If wi was nonzero, then w Hessian H that we introduced for our analysis of L regularization, we find that that L2 regularization does not cause the parameters to become sparse, while L 1 w ˜ = w . If w was nonzero, then w ˜ remains nonzero. This demonstrates regularization ma may y do so for large enough α. that L regularization does not cause the parameters to become sparse, while L The sparsit sparsity y property induced by L1 regularization has been used extensively regularization may do so for large enough α. as a fe featur atur aturee sele selection ction mechanism. Feature selection simplifies a mac machine hine learning L The sparsit y property induced b y regularization has been used problem by choosing which subset of the av available ailable features should bextensively e used. In as a fe atur e sele ction mechanism. F eature selection simplifies a mac hine learning particular, the well kno known wn LASSO (Tibshirani, 1995) (least absolute shrink shrinkage age and problem op byerator) choosing which subset an of L the 1 available features should b e used. In selection operator) mo model del in integrates tegrates penalt enalty y with a linear mo model del and a least particular, thefunction. well known (Tibshirani ) (least absolute shrink and L 1 penalt squares cost TheLASSO enalty y causes, a1995 subset of the weights to bage ecome L selection op erator) mo del in tegrates an p enalt y with a linear mo del and a least zero, suggesting that the corresponding features may safely be discarded. squares cost function. The L penalty causes a subset of the weights to become zero, suggesting that the corresponding236 features may safely be discarded.


In Sec. 5.6.1, we sa saw w that man many y regularization strategies can be in interpreted terpreted as MAP Ba Bay yesian inference, and that in particular, L 2 regularization is equiv equivalent alent In Sec. 5.6.1 , we sa w that man y regularization strategies can b e in terpreted as 1 to MAP Ba Bay yesian inference with a P Gaussian prior on the weigh weights. ts. F For or L reguL MAP Bayesian and regularization is function equivalent Ω(w w ) that = α ini particular, |w i| used to regularize larization, the pinference, enalt enalty y αΩ( a cost is to MAP Ba y esian inference with a Gaussian prior on the weigh ts. F or reguL equiv equivalen alen alentt to the log-prior term that is maximized by MAP Bay Bayesian esian inference = α distribution w used to(Eq. larization, the is penalt y αΩ(w )Laplace regularize a cost when the prior an isotropic 3.26) over w: function is equivalent to the log-prior term that is| maximized by MAP Bayesian inference | X X 1 when the prior is an isotropic Laplace distribution (Eq. 3.26) over w: |w i| + log α − log 2 (7.24) log p(w) = log Laplace( Laplace(w wi ; 0, ) = −α α i i 1 w + log α log 2 (7.24) log p(w) = log Laplace(w ; P 0, ) = α w From the point of view of learning viaα minimization with resp respect ect to , w e can − | | − ignore the log α − log 2 terms because they do not dep on w . depend end From the point of view of learning via minimization with respect to w , we can ignore the log α X log 2 terms because they do not X depend on w. 7.2 Norm −Penalties as Constrained Optimization Consider the costPfunction regularized by a parameterOptimization norm penalt enalty: y: 7.2 Norm enalties as Constrained Consider the cost function a parameter penalty: J˜(θregularized ; X , y) = J (bθy; X , y) + αΩ(θnorm ).

(7.25)

˜(θ; X , y) = J (θ; X , y) + αΩ(θ). (7.25) Recall from Sec. 4.4 Jthat we can minimize a function sub subject ject to constrain constraints ts by constructing a generalized Lagrange function, consisting of the original ob objectiv jectiv jectivee Recall from Sec. 4.4 that we can minimize a function sub ject to constrain ts by function plus a set of penalties. Eac Each h penalty is a pro product duct bet etw ween a coefficient, constructing a generalizeduck Lagrange function, consisting the original ob jective called a Karush–Kuhn–T Karush–Kuhn–Tuck ucker er (KKT) multiplier, and aof function representing functionthe plusconstrain a set oftpisenalties. product bΩ( etw a ecoefficient, θ)een whether constraint satisfied.Eac If hwepenalty wantedistoa constrain Ω(θ to b less than called a Karush–Kuhn–T uck er (KKT) multiplier, and a function representing some constan constantt k, we could construct a generalized Lagrange function whether the constraint is satisfied. If we wanted to constrain Ω(θ) to be less than some constant k, we could construct a generalized Lagrange function L(θ, α; X , y) = J (θ; X , y) + α(Ω(θ) − k). (7.26) (θ,constrained α; X , y) = Jproblem (θ; X , yis ) +giv αen (Ω(by θ) k). The solution to the given L − The solution to the constrained (θ,en α).by θ ∗ = arg problem min maxisLgiv θ

α,α≥0

(7.26) (7.27)

(7.27) θ = arg min max (θ, α). θ As described in Sec. 4.4, solving this problem requires modifying b oth and L 2 α . Sec. 4.5 pro provides vides a work orked ed example of linear regression with an L constrain constraint. t. θ As described in Sec. 4.4 , solving this problem requires modifying b oth and Man Many y different pro procedures cedures are possible—some ma may y use gradient descent, while α . Sec.may L constrain 4.5 pro a worked example linear the regression with t. others usevides analytical solutions forofwhere gradien gradient t is an zero—but in all Man y different proincrease cedures are possible—some maydecrease use gradient descent, α must θ ) > k and θ)while < k. pro procedures cedures whenev whenever er Ω( Ω(θ whenev whenever er Ω( Ω(θ others may use analytical solutions for where the gradien t is zero—but in all ∗ θ) to shrink. The optimal value α will encourage Ω( θ) All positive α encourage Ω( Ω(θ Ω(θ α θ ) > k θ ) < k. pro cedures m ust increase whenev er Ω( and decrease whenev er Ω( to shrink, but not so strongly to mak makee Ω(θ) become less than k. All positive α encourage Ω(θ) to shrink. The optimal value α will encourage Ω(θ) to shrink, but not so strongly to make 237 Ω(θ) become less than k.


To gain some insigh insightt into the effect of the constrain constraint, t, we can fix α ∗ and view the problem as just a function of θ : To gain some insight into the effect of the constraint, we can fix α and view (7.28) α∗θ): = arg minJ (θ; X , y) + α ∗Ω(θ). θ ∗just = arg min L(θ,of the problem as a function θ

θ

(7.28) θ = arg min (θ, α ) = arg minJ (θ; X , y) + α Ω(θ). This is exactly the same as the regularized training problem of minimizing J˜. L We can th thus us think of a parameter norm penalty as imp imposing osing a constrain constraintt on the J˜. This is exactly the same as the regularized training problem of lie minimizing 2 norm, L weigh If Ω is the then the w eights are constrained to in a L2 ball. eights. ts. W thus a parameter norm enalty as imp on the If eΩcan is the L1think norm,ofthen the weights arepconstrained toosing lie inaaconstrain region oft limited w ts. IfUsually Ω is theweL donorm, then thesize weights constrained to lie ball. 1 Leigh norm. not kno know w the of theare constrain constraint t region thatinwae Limpose If Ω is the L norm, then the w eights are constrained to lie in a region of limited by using weight decay with co coefficient efficient α∗ because the value of α ∗ do does es not directly L norm. Usually w e do not kno w one the can size solv of the constrain t region that webimpose tell us the value of k . In principle, for , but the relationship etw k solvee etween een α α b y using w eight decay with co efficient b ecause the v alue of do es not directly ∗ k and α dep depends ends on the form of J . While we do not kno know w the exact size of the tell us thet vregion, alue of w can solv for k , but the relationship betw een ke. In α in constrain constraint canprinciple, con control trol itone roughly bye increasing or decreasing order k α J and dep ends on the form of . While w e do not kno w the exact size of thet to gro grow w or shrink the constrain constraintt region. Larger α will result in a smaller constrain constraint constrain t region, e can control roughly by increasing region. Smaller αw will result in a itlarger constrain constraint t region.or decreasing α in order to grow or shrink the constraint region. Larger α will result in a smaller constraint Sometimes we may wish to use explicit constraints rather than penalties. As region. Smaller α will result in a larger constraint region. describ described ed in Sec. 4.4, we can modify algorithms suc such h as sto stochastic chastic gradient descen descentt Sometimes w e may wish to use explicit constraints rather than p enalties. As to take a step downhill on J (θ ) and then pro project ject θ bac back k to the nearest point describ ed in Sec. can modify h as stoidea chastic gradient descen θ4.4 ) <, we k . This that satisfies Ω( Ω(θ can bealgorithms useful if wsuc e hav have e an of what value of kt J (θt)toand θ back to appropriate take a step and downhill thentime pro ject thevalue nearest oint is do notonwan want sp spend end searching fortothe of α pthat θ )

consisten consistently tly increase the size of the weigh eights, ts, then θ rapidly mov moves es aw away ay from the origin un until til numerical ov overflow erflow occurs. Explicit constraints with repro reprojection jection θ consisten tly increase the size of the w eigh ts, then rapidly mov es aw ay from allo allow w us to terminate this feedback loop after the weights ha have ve reac reached hed a certain the origin unHin til numerical erflow occurs. Explicit repro jection magnitude. Hinton ton et al. (ov 2012c ) recommend using constraints constraints with com combined bined with a allo w us to terminate this feedback loop after the w eights ha ve reac hed a certain high learning rate to allo allow w rapid exploration of parameter space while main maintaining taining magnitude. Hin some stabilit stability y. ton et al. (2012c) recommend using constraints combined with a high learning rate to allow rapid exploration of parameter space while maintaining In particular, Hin Hinton ton et al. (2012c) recommend constraining the norm of eac each h some stability. column of the weigh weightt matrix of a neural net lay layer, er, rather than constraining the In particular, Hin ton et al. ( 2012c ) recommend constraining theby norm of eac h Frob robenius enius norm of the en entire tire weigh weightt matrix, a strategy in intro tro troduced duced Srebro and column of the weigh t matrix of athe neural layer,column ratherseparately than constraining they Shraibman (2005 ). Constraining normnet of each prev prevents ents an any Frobhidden enius norm theha enving tire weigh t matrix, a ts. strategy trovduced by Srebro andt one unit of from having very large weigh eights. If we in con conv erted this constrain constraint Shraibman (2005 the normit ofwould each column separately any in into to a penalt enalty y in). aConstraining Lagrange function, be similar to L2 wprev eigh eightents t decay one hidden unit from ha ving v ery large w eigh ts. If w e con v erted this constrain but with a separate KKT multiplier for the weigh weights ts of each hidden unit. Each oft into aKKT penalt in a Lagrange function, it would be similar to L w t edecay these myultipliers would b e dynamically up updated dated separately toeigh mak make eac each h but with a separate KKT multiplier for the weigh ts of each hidden unit. Each of hidden unit ob obey ey the constraint. In practice, column norm limitation is alwa always ys these KKT m ultipliers w ould b e dynamically up dated separately to mak e eac h implemen implemented ted as an explicit constrain constraintt with repro reprojection. jection. hidden unit obey the constraint. In practice, column norm limitation is always implemented as an explicit constraint with repro jection.

7.3

Regularization and Under-Constrained Problems

In cases, regularization and is necessary for machine learning Problems problems to be 7.3someRegularization Under-Constrained prop properly erly defined. Man Many y linear mo models dels in machine learning, including linear reIn some cases, regularization is necessary machine learning to be X >X gression and PCA, dep depend end on in inv verting theformatrix . This problems is not possible properly y linear dels incan machine learning, including linear reX >X is Man whenev whenever er defined. singular. Thismo matrix be singular whenev whenever er the data truly X gression and PCA, depend on inverting thethere matrix This is not possible X) has no variance in some direction, or when areX few fewer er .examples (ro (rows ws of ofX X X whenev er is singular. This matrix can b e singular whenev er the data truly than input features (columns of X). In this case, many forms of regularization has no ond variance someXdirection, > X + αI or when there are fewer examples (rows of X ) corresp correspond to in inv vin erting instead. This regularized matrix is guaran guaranteed teed than features (columns of X). In this case, many forms of regularization to be input in invertible. vertible. correspond to inverting X X + αI instead. This regularized matrix is guaranteed These linear problems ha hav ve closed form solutions when the relev relevant ant matrix to be invertible. is in invertible. vertible. It is also possible for a problem with no closed form solution to be These linear problems have isclosed form solutionsapplied when to thea relev ant matrix underdetermined. An example logistic regression problem where is inclasses vertible. is also possible for Ifa aproblem solution to be the areItlinearly separable. weigh weightt with vectornowclosed is ableform to achiev achieve e perfect underdetermined. example ishiev logistic regression appliedand to ahigher problem where w will classification, then 2An also ac achiev hieve e perfect classification likelihoo likelihood. d. w the classes are linearly separable. If a weigh t vector is able to achiev e p erfect An iterativ iterativee optimization pro procedure cedure lik likee sto stochastic chastic gradient descent will con contin tin tinually ually w classification, then 2 will also ac hiev e p erfect classification and higher likelihoo d. increase the magnitude of w and, in theory theory,, will never halt. In practice, a numerical An iterativtation e optimization pro ceduret lik e sto chastic descenttly will contin ually implemen implementation of gradien gradient t descen descent will even eventually tuallygradient reach sufficien sufficiently large weigh weights ts w increase the magnitude of and, in theory , will never halt. In practice, a n umerical to cause numerical ov overflow, erflow, at which point its behavior will dep depend end on ho how w the implemen tation of gradien t descen t will even tually reach sufficien tly large weigh ts programmer has decided to handle values that are not real num umb bers. to cause numerical overflow, at which point its behavior will depend on how the Most forms regularization are vable guarantee convergence vergence iterativee programmer hasofdecided to handle aluestothat are notthe realcon num bers. of iterativ 239to guarantee the convergence of iterative Most forms of regularization are able


metho methods ds applied to underdetermined problems. F For or example, weigh weightt deca decay y will cause gradien gradientt descent to quit increasing the magnitude of the weights when the metho dsthe applied toounderdetermined example, slop slopee of lik likeliho eliho elihoo d is equal to the wproblems. eigh eightt deca decay yFor co coefficien efficien efficient. t. weight decay will cause gradient descent to quit increasing the magnitude of the weights when the ofeliho using solv solvete deca underdetermined slopThe e of idea the lik odregularization is equal to thetoweigh y coefficient. problems extends bey eyond ond mac machine hine learning. The same idea is useful for sev several eral basic linear algebra The idea of using regularization to solve underdetermined problems extends problems. beyond machine learning. The same idea is useful for several basic linear algebra As we saw in Sec. 2.9, we can solve underdetermined linear equations using problems. the Mo Moore-P ore-P ore-Penrose enrose pseudoin pseudoinv verse. Recall that one definition of the pseudoinv pseudoinverse erse As w e saw in Sec. 2.9 , w e can solve underdetermined linear equations using + X of a matrix X is the Moore-Penrose pseudoinverse. Recall that one definition of the pseudoinverse X + = lim (X > X + αI )−1X > . (7.29) X of a matrix X is α&0

= as limp(erforming X X + αlinear I ) Xregression . We can no now w recognize Eq.X7.29 with weigh eightt (7.29) deca decay y. Sp Specifically ecifically ecifically,, Eq. 7.29 is the limit of Eq. 7.17 as the regularization co coefficient efficient shrinks W can no Eq. 7.29 as p erforming linear regression with weight decay. toezero. Wwe recognize can thus in the pseudoin erse as stabilizing underdetermined interpret terpret pseudoinv v Specifically , Eq. regularization. 7.29 is the limit of Eq. 7.17 as the regularization coefficient shrinks problems using to zero. We can thus interpret the pseudoinverse as stabilizing underdetermined problems using regularization.

7.4

Dataset Augmen Augmentation tation

The wa way y to make a mac machine hinetation learning mo model del generalize better is to train it on 7.4bestDataset Augmen more data. Of course, in practice, the amoun amountt of data we ha hav ve is limited. One way The b est wa y to make a mac hine learning mo del generalize b is totraining train it set. on to get around this problem is to create fake data and add itetter to the more data. Of course, in practice, the amoun t of data w e ha v e is limited. One w ay For some mac machine hine learning tasks, it is reasonably straigh straightforward tforward to create new to get around this problem is to create fake data and add it to the training set. fak fake e data. For some machine learning tasks, it is reasonably straightforward to create new This approac approach h is easiest for classification. A classifier needs to tak takee a complifake data. cated, high dimensional input x and summarize it with a single category iden identity tity y . This approac h is easiest for classification. A classifier needs to tak e a compliThis means that the main task facing a classifier is to be inv invariant ariant to a wide variety x and summarize cated, high dimensional with easily a single category identity y . of transformations. We input can generate new (x, y )itpairs just by transforming This the training main taskset. facing a classifier is to be invariant to a wide variety the xmeans inputsthat in our of transformations. We can generate new (x, y ) pairs easily just by transforming This approac approach h is not as readily applicable to many other tasks. For example, it the x inputs in our training set. is difficult to generate new fak fakee data for a density estimation task unless we hav havee This solv approac h isdensit not as readily applicable to many other tasks. For example, it already solved ed the density y estimation problem. is difficult to generate new fake data for a density estimation task unless we have Dataset augmen augmentation tation has been a particularly effective tec technique hnique for a sp specific ecific already solved the density estimation problem. classification problem: ob object ject recognition. Images are high dimensional and include Dataset augmen tation has bofeen a particularly effective technique for simulated. a specific an enormous variet ariety y of factors variation, many of which can be easily classification problem: ob jectthe recognition. Imagesa are dimensional and include Op Operations erations lik like e translating training images fewhigh pixels in each direction can an enormous v ariet y of factors of v ariation, many of which can be easily simulated. often greatly impro improve ve generalization, ev even en if the model has already been designed to Op erations lik e translating the training images a vfew pixels in peach direction can be partially translation in inv varian ariantt by using the con conv olution and ooling techniques often greatly improve generalization, even if the model has already been designed to 240 the convolution and p o oling techniques be partially translation invariant by using


describ described ed in Chapter 9. Many other op operations erations suc such h as rotating the image or scaling the image ha have ve also pro proven ven quite effective. described in Chapter 9. Many other operations such as rotating the image or One must be careful not to apply transformations that would change the correct scaling the image have also proven quite effective. class. For example, optical character recognition tasks require recognizing the One must been e careful not’d’ to and apply transformations correct difference bet etw w ’b’ and the difference bet etw wthat een w ’6’ould andchange ’9’, so the horizon horizontal tal class.and For180 example, optical character recognition require recognizing the ◦ flips rotations are not appropriate wa ways ys of tasks augmen augmenting ting datasets for these difference between ’b’ and ’d’ and the difference between ’6’ and ’9’, so horizontal tasks. flips and 180 rotations are not appropriate ways of augmenting datasets for these There are also transformations that we would like our classifiers to be in inv variant tasks. to, but whic which h are not easy to perform. For example, out-of-plane rotation can not There areted alsoastransformations that operation we would like our input classifiers to be invariant be implemen implemented a simple geometric on the pixels. to, but which are not easy to perform. For example, out-of-plane rotation can not Dataset augmen augmentation tation is effect for sp speec eec eech h recognition tasks as well (Jaitly and be implemented as a simple geometric operation on the input pixels. Hin Hinton ton, 2013). Dataset augmentation is effect for speech recognition tasks as well (Jaitly and Injecting noise in the input to a neural net netw work (Sietsma and Do Dow w, 1991) Hinton, 2013). can also be seen as a form of data augmentation. For man many y classification and Injecting noise in the input to a neural net w ork ( Sietsma and Do ev even en some regression tasks, the task should still be possible to solv solve e ev even enwif, 1991 small) can also b e seen as a form of data augmentation. F or man y classification and random noise is added to the input. Neural netw networks orks prov provee not to be very robust ev some regression tasks, task should still be wa possible to solv even if small to en noise, how and the Eliasmith , 2010). One robustness however ever (Tang way y to impro improv ve ethe random added to the input. Neural provenoise not toapplied be verytorobust of neuralnoise net netw wisorks is simply to train them netw withorks random their to noise, how ever ( T ang and Eliasmith , 2010 ). One wa y to impro v e the robustness inputs. Input noise injection is part of some unsup unsupervised ervised learning algorithms such of neural net w orks is simply to train them with random applied their as the denoising auto autoenco enco encoder der (Vincen Vincentt et al. al.,, 2008 ). Noisenoise injection alsotoworks inputs.theInput injection is part of some unsup ervised such when noisenoise is applied to the hidden units, whic which h can blearning e seen asalgorithms doing dataset as the denoising auto encolevels der (Vincen t et al., P 2008 injection alsosho works augmen augmentation tation at m ultiple of abstraction. oole).etNoise al. (2014 ) recently showed wed when the noise is applied to the hidden units, whic h can b e seen as doing dataset that this approach can be highly effective provided that the magnitude of the augmen at m ultipleDrop levelsout, of abstraction. Poole et al. (2014 ) recently noise is tation carefully tuned. Dropout, a pow owerful erful regularization strategy that sho willwed be that this approach can b e highly effective provided that the magnitude of the describ in Sec. 7.12 , can b e seen as a pro of constructing new inputs by described ed process cess noise is carefully tuned. Drop out, a p ow erful regularization strategy that will be multiplying by noise. described in Sec. 7.12, can be seen as a process of constructing new inputs by When comparing machine hine learning benchmark results, it is important to tak takee multiplying by noise. mac the effect of dataset augmen augmentation tation into account. Often, hand-designed dataset When comparing mac hine learning reduce benchmark results, it is important to tak augmen schemes can dramatically the generalization error of a mac augmentation tation machine hinee the effecttec ofhnique. dataset tation into account. Often, hand-designed dataset learning technique. Toaugmen compare the performance of one mac machine hine learning algorithm augmen tation can dramatically reduce the generalization error ofcomparing a machine to another, it schemes is necessary to perform controlled experiments. When learning tec hnique. T o compare the performance of one mac hine learning mac machine hine learning algorithm A and machine learning algorithm B, it is algorithm necessary to another, it is necessary to p erform controlled experiments. When comparing to mak makee sure that both algorithms were ev evaluated aluated using the same hand-designed machineaugmen learning algorithm A and machine learning algorithm B, it p isonecessary dataset augmentation tation schemes. Supp Suppose ose that algorithm A performs orly with to mak e sure that both algorithms w ere ev aluated using the same hand-designed no dataset augmen augmentation tation and algorithm B performs well when com combined bined with dataset augmen tation schemes. Supp ose that algorithm A p erforms p orly with numerous synthetic transformations of the input. In such a case it isolikely the no dataset augmen tation and algorithm B p erforms w ell when com bined with syn synthetic thetic transformations caused the improv improved ed performance, rather than the use numerous of the input. In such a case an it isexp likely thet of mac machine hinesynthetic learning transformations algorithm B. Sometimes deciding whether experimen erimen eriment synthetic transformations caused the improved performance, rather than the use of machine learning algorithm B. Sometimes deciding whether an experiment 241


has been prop properly erly con controlled trolled requires sub subjective jective judgment. For example, mac machine hine learning algorithms that inject noise into the input are performing a form of dataset has beentation. properly controlled requires subare jective judgment. For example, hine augmen augmentation. Usually Usually, , operations that generally applicable (such asmac adding learning algorithms that inject into thepart input erforming a form algorithm, of dataset Gaussian noise to the input) arenoise considered of are thepmac machine hine learning augmen tation. Usually , operations that are generally applicable (such as adding while op operations erations that are sp specific ecific to one application domain (such as randomly Gaussian an noise to theare input) are considered part of pre-pro the maccessing hine learning cropping image) considered to be separate pre-processing steps. algorithm, while operations that are specific to one application domain (such as randomly cropping an image) are considered to be separate pre-processing steps.

7.5

Noise Robustness

Sec. has motiv motivated ated the use of noise applied to the inputs as a dataset aug7.5 7.4Noise Robustness men mentation tation strategy strategy.. For some mo models, dels, the addition of noise with infinitesimal Sec. 7.4 has motiv ated the use of applied inputs dataset variance at the input of the mo model delnoise is equiv equivalent alentto tothe imp imposing osing aaspaenalty on augthe men tation strategy . F or some mo dels, the addition of noise with infinitesimal norm of the weigh weights ts (Bishop, 1995a,b). In the general case, it is imp important ortant to vremem ariance the noise inputinjection of the mo delbeismuc equiv alent pto penalty on the rememb beratthat can much h more owimp erfulosing thanasimply shrinking norm of the weigh tsecially (Bishop , 1995a ). Inisthe general case, it is imp ortant to the parameters, esp especially when the,bnoise added to the hidden units. Noise remembto er the thathidden noise injection can an be imp mucortan h more powas erful applied units is such importan ortant t topic to than merit simply its ownshrinking separate the parameters, esp ecially when the noise is added to the hidden units. Noiset discussion; the drop dropout out algorithm describ described ed in Sec. 7.12 is the main developmen development applied to the hidden units is such an important topic as to merit its own separate of that approac approach. h. discussion; the dropout algorithm described in Sec. 7.12 is the main development Another wa way y that noise has been used in the service of regularizing mo models dels of that approach. is by adding it to the weigh weights. ts. This technique has been used primarily in the Another wa y that noise has b een (used in al. the, 1996 service ofvregularizing models con context text of recurren recurrentt neural netw networks orks Jim et ; Gra Grav es, 2011). This can ise by adding it as to athe weigh ts.implemen This technique been usedinference primarily b interpreted sto stochastic chastic implementation tation ofhas a Bay Bayesian esian ovin er the the con text of recurren t neural netw orks ( Jim et al. , 1996 ; Gra v es , 2011 ). This can weigh eights. ts. The Ba Bay yesian treatment of learning would consider the mo model del we weigh igh ights ts b e interpreted as a sto chastic implemen tation of a Bay esian inference o v er the to be uncertain and represen representable table via a probabilit probability y distribution that reflects this w eigh ts. The Ba y esian treatment of learning would consider the mo del ights uncertain uncertaintty. Adding noise to the weights is a practical, stochastic wa way y towe reflect to beuncertain uncertain represen table this uncertaint ty and (Gra Graves ves, 2011 ). via a probability distribution that reflects this uncertainty. Adding noise to the weights is a practical, stochastic way to reflect This can also be in interpreted terpreted as equiv equivalen alen alentt (under some assumptions) to a this uncertainty (Graves, 2011). more traditional form of regularization. Adding noise to the weigh weights ts has been This can also b e in terpreted as equiv alen t (under some assumptions) to a sho shown wn to be an effectiv effectivee regularization strategy in the context of recurren recurrentt neural more traditional of regularization. to thewe weigh has been net netw works (Jim et form al., 1996 ; Gra Grav ves, 2011).Adding In thenoise following, willtspresent an sho wn to b e an effectiv e regularization strategy in the context of recurren t neural analysis of the effect of weigh eightt noise on a standard feedforw feedforward ard neural netw network ork (as net w orks ( Jim et al. , 1996 ; Gra v es , 2011 ). In the following, we will present an in intro tro troduced duced in Chapter 6). analysis of the effect of weight noise on a standard feedforward neural network (as Wduced e study regression intro in the Chapter 6). setting, where we wish to train a function yˆ( x) that maps a set of features x to a scalar using the least-squares cost function betw between een y ˆ ( x W e study the regression setting, where we wish to train a function ) that the mo model del predictions yˆ(x) and the true values y: maps a set of features x to a scalar using cost function between  the least-squares  2 ˆv(alues x) − y =E (7.30) p(x,y ) (y the model predictions yˆ(x) Jand the true y:) . E The training set consists of Jm=labeled examples )}. (yˆ(x) y{()x (1) . , y (1)), . . . , (x (m), y (m)(7.30) − 242 The training set consists of m labeled examples (x , y ), . . . , (x , y ) . { }  


We no now w assume that with eac each h input presen presentation tation we also include a random perturbation W ∼ N (; 0, ηI ) of the net netw work weigh weights. ts. Let us imagine that we W e no w assume that with eac h input presen tation we del alsoasinclude random ha hav ve a standard l-la -lay yer MLP MLP.. We denote the perturb erturbed ed mo model yˆW (x)a. Despite  (  ; 0 , η I perturbation ) ofin the netwin orkminimizing weights. Let imagine that we the injection of noise, we are still interested terested the us squared error of the ha v e a standard -la y er MLP . W e denote the p erturb ed mo del as ) . Despite l y ˆ ( x ∼ N The ob output of the net network. work. objective jective function th thus us becomes: the injection of noise, we are still interested in minimizing the squared error of the h i 2 output of the network. jectiveˆfunction thus becomes: (7.31) J˜W =The Ep(xob ,y,W ) (y W (x) − y )   E 2 ) )− 2y y)ˆW (x) + y2 . (7.32) (7.31) (x J˜ = Ep(x,y,W ) yˆ(yˆW (x E −2y yˆ (x) + y . = (x) added (7.32) For small η, the minimization ofyˆJ with weight noise (with co cov variance − an additional regularization term: ηI ) is equiv equivalent to minimization of J with  alent F or small , the minimization of with addedi w eight noisethe (with covariance η J h 2 ηEp(x,y) k∇ W yˆ(x)k . This form of regularization encourages parameters to η I J ) is equiv alent to minimization of with an additional r e gularization term: goE to regions of parameter space where small perturbations of the weigh weights ts hav havee  regularization encourages  the parameters to y ˆ ( x ) η . This form of a relatively small influence on the output. In other words, it pushes the mo model del go to regions of parameter space where small p erturbations of the weigh ts hav e k∇ k in regions where the mo is relativ insensitive to small v ariations in the into to model del relatively ely a eigh relatively smallpoin influence other words, it pushes the model w eights, ts, finding oints ts that on arethe notoutput. merely In minima, but minima surrounded by in to regions where the mo del is relativ ely insensitive to small v ariations in the flat regions (Hochreiter and Schmidh Schmidhub ub uber er, 1995). In the simplified case of linear finding  that w eigh ts, p oin ts are not merely but minima surrounded by b ), this regularization regression (where, for instance, yˆ( x) = w >x +minima, term collapses   flat Schmidh uber, 1995 ). In the simplified case does of linear ηEp(x) (kHochreiter xk2 , whic in into to regions which hand is not a function of parameters and therefore not regression (where, for instance,˜yˆ( x) = w x + b ), this regularization term collapses con contribute tribute to the gradien gradientt of J W with resp respect ect to the model parameters. E x , which is not into η a function of parameters and therefore does not contribute tok the k gradient of J˜ with respect to the model parameters.

7.5.1

Injecting Noise at the Output Targets

7.5.1 datasets Injecting Noiseamoun at the Most hav amount t ofOutput mistakes T inargets the y lab labels. els. It can be harmful  ha ve some to maximize log p(y | x) when y is a mistak mistake. e. One way to prev preven en entt this is to y Most datasets ha v e some amoun t of mistakes in the lab els. It can be harmful explicitly mo model del the noise on the lab labels. els. For example, we can assume that for some log p ( y y x to maximize ) when is a mistak e. One w a y to prev en t this to small constant , the training set lab label el y is correct with probabilit probability y 1 − , isand explicitly mo delofthe onpthe labels. For might example, we can assume that for some |other otherwise any thenoise ossible lab labels els be correct. This assumption is  y  small constant , the training set lab el is correct with probabilit y 1 , and easy to incorp incorporate orate in into to the cost function analytically analytically,, rather than by explicitly otherwise any of the other p ossible lab els be correct. Thisaassumption is − based dra drawing wing noise samples. F For or example, lab label el might smo smoothing othing regularizes mo model del easy to incorp orate in to the cost function analytically , rather than by explicitly on a softmax with k output values by replacing the hard 0 and 1 classification drawingwith noisetargets samples. or example, el smoothing regularizes model based lab targets of kFand 1 − k−1 , respectively respectively. . The standarda cross-en cross-entropy tropy k k on a softmax with output v alues by replacing the hard 0 and 1 classification loss may then be used with these soft targets. Maxim Maximum um lik likeliho eliho elihoood learning with a  targets with targets of and 1 , respectively . The standard tropy softmax classifier and hard targets ma may y actually nev never er con conv verge—thecross-en softmax can loss may then be used with these soft targets. Maxim um lik eliho o d learning with − nev never er predict a probabilit probability y of exactly 0 or exactly 1, so it will con continue tinue to learna softmax classifier and hardmaking targetsmore may extreme actually nev er converge—the can larger and larger w eights, predictions forever. Itsoftmax is possible nevpreven er predict y of exactly 0 or exactly 1strategies , so it will lik con to learn to prevent t thisa probabilit scenario using other regularization like e tinue weigh weight t deca decay y. larger and larger w eights, making more extreme predictions forever. It is possible Lab Label el smo smoothing othing has the adv advantage antage of prev preventing enting the pursuit of hard probabilities to preven t this scenario using other regularization strategies weigh t deca without discouraging correct classification. This strategy haslik beeen used sincey. Label smoothing has the advantage of preventing the pursuit of hard probabilities 243 without discouraging correct classification. This strategy has been used since


the 1980s and con contin tin tinues ues to be featured prominen prominently tly in mo modern dern neural netw networks orks (Szegedy et al., 2015). the 1980s and continues to be featured prominently in modern neural networks (Szegedy et al., 2015).

7.6

Semi-Sup Semi-Supervised ervised Learning

In of semi-supervised learning, both unlab unlabeled eled examples from P (x) 7.6the paradigm Semi-Sup ervised Learning and lab labeled eled examples from P (x, y ) are used to estimate P (y | x) or predict y from In the paradigm of semi-supervised learning, both unlabeled examples from P (x) x. and labeled examples from P (x, y ) are used to estimate P (y x) or predict y from context text of deep learning, semi-sup semi-supervised ervised learning usually refers to x. In the con | h = f ( x ) . learning a represen representation tation The goal is to learn a represen representation tation so the context deep learning, semi-sup ervised learning usually refers to thatInexamples fromofthe same class ha hav ve similar representations. Unsup Unsupervised ervised h cues = f (for x ). how learning can a represen tation The to goal is toexamples learn a represen tation so learning provide useful group in representation that examples from the same class ha v e similar representations. Unsup ervised space. Examples that cluster tigh tightly tly in the input space should be mapp mapped ed to learning can provide useful cues for how to group examples in representation similar represen representations. tations. A linear classifier in the new space may achiev achievee better space. Examples thaty cluster tightlyand in the input space should mapp ed generalization in man many cases (Belkin Niy Niyogi ogi , 2002 ; Chap Chapelle elle etbeal. al., , 2003 ). to A similar represen tations. A linear classifier in the new space may achiev e b etter long-standing varian ariantt of this approach is the application of principal comp componen onen onents ts generalization in man y cases ( Belkin and Niy ogi , 2002 ; Chap elle et al. , 2003 ). A analysis as a pre-pro pre-processing cessing step before applying a classifier (on the pro projected jected long-standing variant of this approach is the application of principal components data). analysis as a pre-processing step before applying a classifier (on the pro jected Instead of having separate unsup unsupervised ervised and sup supervised ervised comp componen onen onents ts in the data). mo model, del, one can construct mo models dels in which a generative model of either P (x) or Instead of having separate ervised and sup ervised onen ts in can the P ( x, y) shares parameters withunsup (y | x a discriminative mo model del of Pcomp ). One P (x) or model, one canthe construct models in which model either − loga Pgenerative (y | x) with then trade-off sup supervised ervised criterion the of unsup unsupervised ervised or P ( x, y ) shares parameters delgenerative of P (y x ). One then can P (x)aordiscriminative − log P (x, y )).mo generativ generative e one (suc (such h as − logwith The criterion Pout x) solution (y the then trade-off the supervised with thetounsup ervised or |the sup expresses a particular form of criterion prior belieflog ab about supervised ervised log P log P ( x ( x , y generativ e one (suc h as ) or ) ). The generative criterion then learning problem (Lasserre et al., 2006− ), namely |that the structure of P(x ) is expresses a particular form of prior b elief out the solution to the supervised − of P (y | x−) in aab connected to the structure wa way y that is captured by the shared learning problem By (Lasserre et al. , w2006 ),hnamely that the structure of included P(x ) is parametrization. con controlling trolling ho how muc much of the generative criterion is P (ya bxetter connected the structure ) in atrade-off way that is captured by the shared in the totalto criterion, one canoffind than with a purely generative parametrization. By conetrolling how the generative criterion is included | much of or a purely discriminativ discriminative training criterion (Lasserre et al., 2006 ; Larochelle and in the total criterion, one can find a b etter trade-off than with a purely generative Bengio, 2008). or a purely discriminative training criterion (Lasserre et al., 2006; Larochelle and In the con context scarcity y of labeled data (and abundance of unlabeled data), Bengio , 2008 ).text of scarcit deep arc architectures hitectures ha hav ve shown promise as well. Salakhutdino Salakhutdinov v and Hinton (2008) In the con text of scarcit y of labeled data (and abundance unlabeled data), describ describee a metho method d for learning the kernel function of a kernelofmachine used for deep arc hitectures ha v e shown promise as w ell. Salakhutdino v and Hinton ( 2008 P ( x regression, in whic which h the usage of unlabeled examples for mo modeling deling ) impro improv ves) describ a metho d for learning the kernel function of a kernel machine used for P (y | xe) quite significantly significantly. . regression, in which the usage of unlabeled examples for modeling P (x ) improves et al. (2006. ) for more information ab about out semi-sup semi-supervised ervised learning. P (ySeex)Chapelle quite significantly | Chapelle et al. (2006) for more information about semi-supervised learning. See 244


7.7

Multi-T Multi-Task ask Learning

Multi-task learningask (Caruana , 1993) is a wa way y to improv improvee generalization by pooling 7.7 Multi-T Learning the examples (whic (which h can be seen as soft constraints imp imposed osed on the parameters) Multi-task learning ( Caruana , 1993 ) is a wa y to improv e generalization y pooling arising out of several tasks. In the same wa way y that additional trainingbexamples the examples (whic h can b e seen as soft constraints imp osed on the parameters) put more pressure on the parameters of the model tow towards ards values that generalize arising out part of several tasks. the same y thatthat additional training w ell, when of a mo model del is In shared acrosswatasks, part of the modelexamples is more put more pressure on the parameters of the model tow ards v alues that generalize constrained to tow wards go goo od values (assuming the sharing is justified), often yielding w ell, when part of a mo better generalization. del is shared across tasks, that part of the model is more constrained towards good values (assuming the sharing is justified), often yielding better generalization. y (1)

y (2)

h(1)

h(2)

h(3)

h(shared)

x

Figure 7.2: Multi-task learning can be cast in sev several eral wa ways ys in deep learning framew frameworks orks and this figure illustrates the common situation where the tasks share a common input but Figure Multi-task be cast in lo sev eral wa orks in inv volve 7.2: differen different t target learning random vcan ariables. The lower wer lay layers ersysofina deep deep learning net network work framew (whether it and this figureand illustrates the common situation where the tasks share a common input but is sup supervised ervised feedforward or includes a generative comp componen onen onent t with do down wn wnw ward arrows) involve differenacross t targets uc random The lo wer parameters layers of a deep networkresp (whether it can b e shared uch h tasks,variables. while task-sp task-specific ecific (associated respectively ectively (1) (2) is supthe ervised and feedforward orhincludes a generative onen wnward arrows) with weigh weights ts in into to and from and h ) can b e comp learned ont with top ofdothose yielding a (shared) can b e representation shared across h s uc h tasks, while task-sp ecific parameters ectively shared . The underlying assumption is that(associated there existsresp a common with weightsthat intoexplain and from and h in) the caninput b e learned on eac tophoftask those yielding a p ool the of factors thehvariations each is associated x , while h shared representation The is thatassumed there exists common with a subset of these factors. . In thisunderlying example, assumption it is additionally thata top-lev top-level el x , while p ool of units factors that thesp vecialized ariationstoineach the input eachpredicting task is associated hidden andexplain specialized task (resp (respectively ectively h(1) h(2) are y(1) and (2) a subset of these factors. In this example, it is (shared) with assumed top-levIn el ) while some in intermediate-level termediate-level representation h additionally is shared acrossthat all tasks. y h h y hidden units and are sp ecialized to each task (resp ectively predicting and the unsup unsupervised ervised learning con context, text, it mak makes es sense for some of the top-lev top-level el factors to b e yasso h are theisfactors )ciated while with somenone intermediate-level representation sharedthat across all tasks. associated of the output tasks (h (3)): these explain some In of (1) of the (2)top-level factors to b e the input unsup ervised learning con text, it mak es sense for some variations but are not relev relevant ant for predicting y or y . associated with none of the output tasks (h ): these are the factors that explain some of y y . the Fig. input7.2 variations butaare notcommon relevantform for predicting illustrates very of multi-task or learning, in whic which h different

sup supervised ervised tasks (predicting y(i) giv given en x) share the same input x, as well as some Fig. 7.2 illustrates a very common formcapturing of multi-task learning, in of whic h different (shared) h in representation a common pool factors. The intermediate-lev termediate-lev termediate-level el supervised tasks (predicting y given x) share the same input x, as well as some intermediate-level representation h 245capturing a common p o ol of factors. The


mo model del can generally be divided into two kinds of parts and associated parameters: mo1. delTcan generally be divided(whic into htwo kinds of parts ask-sp ask-specific ecific parameters (which only benefit fromand the associated examples ofparameters: their task to achiev achievee go goo od generalization). These are the upp upper er lay layers ers of the neural 1. net T ask-sp ecific parameters (whic h only b enefit from the examples of their task netw work in Fig. 7.2. to achieve good generalization). These are the upper layers of the neural 2. Generic parameters, (which h benefit from the network in Fig. 7.2. shared across all the tasks (whic pooled data of all the tasks). These are the low lower er lay layers ers of the neural net network work 2. in Generic parameters, shared across all the tasks (whic h b enefit from the Fig. 7.2. pooled data of all the tasks). These are the lower layers of the neural network in Fig. 7.2. Impro Improv ved generalization and generalization error bounds (Baxter, 1995) can be ac achiev hiev hieved ed because of the shared parameters, for which statistical strength can be Improved generalization and generalization error bounds (Baxter , 1995) for canthe be greatly improv improved ed (in prop proportion ortion with the increased num umb ber of examples achieved because ofcompared the shared which statistical canthis be shared parameters, to parameters, the scenario for of single-task mo models). dels).strength Of course greatly improv ed (in prop ortion with the increased n um b er of examples for the will happen only if some assumptions ab about out the statistical relationship betw etween een shared parameters, compared to the scenario of single-task mo dels). Of course this the differen tasks are v alid, meaning that there is something shared across some differentt will happen of the tasks. only if some assumptions about the statistical relationship between the different tasks are valid, meaning that there is something shared across some From the poin ointt of view of deep learning, the underlying prior belief is the of the tasks. follo following: wing: among the factors that explain the variations observed in the F the poinwith t of view deep learning, the underlying prior belieftw isothe datarom asso associated ciated the of differen different t tasks, some are shared across two or among the factors that explain the variations observed in the following: more tasks. data associated with the different tasks, some are shared across two or more tasks.

7.8

Early Stopping

When mo models dels with sufficien sufficientt representational capacit capacity y to ov overfit erfit 7.8 training Early large Stopping the task, we often observe that training error decreases steadily over time, but When training large models with sufficien t representational capacit y to overfit validation set error begins to rise again. See Fig. 7.3 for an example of this behavior. the task, we often observe that training error decreases steadily over time, but This beha ehavior vior occurs very reliably reliably. . validation set error begins to rise again. See Fig. 7.3 for an example of this behavior. This means we can obtain a mo model del with better validation set error (and th thus, us, This behavior occurs very reliably. hop hopefully efully better test set error) by returning to the parameter setting at the poin ointt This means w e can obtain a mo del with better v alidation set error (and th us, in time with the low lowest est validation set error. Instead of running our optimization hopefully bun etter test set by minim returning tovthe parameter poin algorithm until til we reac reach h error) a (lo (local) cal) minimum um of alidation error,setting we runatit the un until til thet in time lowestset validation set error. Instead running our optimization error on with the vthe alidation has not improv improved ed for some of amoun amount t of time. Every time algorithm un til we reac h a (lo cal) minim um of v alidation error, we run it until the the error on the validation set improv improves, es, we store a cop copy y of the mo model del parameters. error on the v alidation set has not improv ed for some amoun t of time. Every time When the training algorithm terminates, we return these parameters, rather than the latest error on the validation improv we store amore copyformally of the mo parameters. the parameters. Thisset pro procedure cedurees, is sp specified ecified indel Algorithm 7.1. When the training algorithm terminates, we return these parameters, rather than This strategy is kno known wn as early stopping stopping.. It is probably the most commonly the latest parameters. This procedure is specified more formally in Algorithm 7.1. used form of regularization in deep learning. Its popularity is due both to its Thiseness strategy is kno wn asy.early stopping. It is probably the most commonly effectiv effectiveness and its simplicit simplicity used form of regularization in deep learning. Its popularity is due both to its 246 effectiveness and its simplicity.

Loss (negative log likelihood)


Learning curves

0.20

Training set loss Learning curves Validation set loss Training set loss Validation set loss

0.15 0.10 0.05 0.00

0

50

100

150

200

250

Time (epochs)

Figure 7.3: Learning curves sho showing wing ho how w the negativ negativee log-likelihoo log-likelihood d loss changes over time (indicated as num numb b er of training iterations over the dataset, or ep epo ochs chs). ). In this Figure 7.3: Learning curves sho wing ho w the negativ e log-likelihoo d loss c hanges over example, we train a maxout netw network ork on MNIST. Observe that the training ob objective jective ep o chs time (indicated as num b er of training iterations o ver the dataset, or ). In decreases consistently over time, but the validation set av average erage loss even eventually tually b eginsthis to example,again, we train a maxout network on MNIST. Observe that the training ob jective increase forming an asymmetric U-shap U-shaped ed curv curve. e. decreases consistently over time, but the validation set average loss eventually b egins to increase again, forming an asymmetric U-shap ed curve.

One wa way y to think of early stopping is as a very efficient hyperparameter selection algorithm. In this view, the num numb ber of training steps is just another hyperparameter. to Fig. think7.3 of early is as a very efficient selection We One can wa seey in that stopping this hyp yperparameter erparameter has a hyperparameter U-shap U-shaped ed validation set algorithm. In this view, the num b er of training steps is just another hyperparameter. performance curve. Most hyperparameters that con control trol mo model del capacit capacity y hav havee such a W e can see in Fig. 7.3 that this h yp erparameter has a U-shap ed v alidation U-shap U-shaped ed validation set performance curve, as illustrated in Fig. 5.3. In the caseset of p erformance curve. Most hyperparameters that con trol mo del capacit y hav e such a early stopping, we are controlling the effective capacit of the mo by determining capacity y model del U-shap edyvsteps alidation settake performance as illustrated Fig. 5.3. In themcase ho how w man many it can to fit thecurve, training set. Most hinyp yperparameters erparameters ust bofe e are controlling the chec effective capacit y of the determining cearly hosenstopping, using anwexp expensive ensive guess and check k process, where we mo setdel a hby yp yperparameter erparameter ho w man y steps it can take to fit the training set. Most h yp erparameters mustThe be at the start of training, then run training for several steps to see its effect. chosen using an hyperparameter expensive guess and check process, we set a h erparameter “training time” is unique in thatwhere by definition ayp single run of at the start training, training for several steps seesignifican its effect.t The training triesofout man many y vthen aluesrun of the hyp yperparameter. erparameter. Theto only significant cost “training time” hyperparameter is unique in that by definition a single run of to cho hoosing osing this hyperparameter automatically via early stopping is running the training tries out man y v alues of the h yp erparameter. The only significan t cost validation set ev evaluation aluation perio eriodically dically during training. Ideally Ideally,, this is done in to c ho osing this hyperparameter automatically via early stopping is running the parallel to the training pro process cess on a separate machine, separate CPU, or separate vGPU alidation aluation perio dically training.areIdeally this is then donethe in from set the ev main training pro process. cess. Ifduring such resources not av, ailable, parallel to theperio training pro cess onma ay separate machine, separate CPU, orsetseparate cost of these eriodic dic ev evaluations aluations may be reduced by using a validation that is GPU from the main training pro cess. If such resources are not a v ailable, then the small compared to the training set or by ev evaluating aluating the validation set error less cost of these p erio dic ev aluations ma y b e reduced by using a v alidation set that is frequen frequently tly and obtaining a lo lower wer resolution estimate of the optimal training time. small compared to the training set or by evaluating the validation set error less An tly additional cost toaearly the needoftothe maintain copy oftime. the frequen and obtaining lowerstopping resolutionis estimate optimal atraining best parameters. This cost is generally negligible, because it is acceptable to store Anparameters additionalincost tower early is the need to (for maintain a copy of the these a slo slower andstopping larger form of memory example, training in best parameters. This cost is generally negligible, because it is acceptable to store these parameters in a slower and larger247 form of memory (for example, training in


GPU memory memory,, but storing the optimal parameters in host memory or on a disk driv drive). e). Since the best parameters are written to infrequen infrequently tly and nev never er read during GPU memory , but storing the optimal parameters in host memory or on atime. disk training, these occasional slow writes ha have ve little effect on the total training drive). Since the best parameters are written to infrequently and never read during Early these stopping is a very unobtrusiv unobtrusive e ve form of regularization, in that it requires training, occasional slow writes ha little effect on the total training time. almost no change in the underlying training pro procedure, cedure, the ob objectiv jectiv jectivee function, Early is a v ery unobtrusiv e form regularization, thattoituse requires or the setstopping of allow allowable able parameter values. Thisofmeans that it isin easy early almost no change in the underlying training pro cedure, the ob jectiv e function, stopping without damaging the learning dynamics. This is in con contrast trast to weigh weightt or the set of allow able parameter v alues. This means that it is easy to use deca decay y, where one must be careful not to use to too o muc uch h weight deca decay y and trapearly the stopping without damaging the learning dynamics. This is in con trast to weigh net netw work in a bad lo local cal minimum corresp corresponding onding to a solution with pathologicallyt decay,wwhere small eigh eights. ts.one must be careful not to use too much weight decay and trap the network in a bad local minimum corresponding to a solution with pathologically Early stopping may be used either alone or in conjunction with other regularizasmall weights. tion strategies. Even when using regularization strategies that mo modify dify the ob objective jective Early stopping may b e used either alone or in conjunction with other regularizafunction to encourage better generalization, it is rare for the best generalization to when of using strategies that modify the ob jective otion ccurstrategies. at a localEven minimum the regularization training ob objective. jective. function to encourage better generalization, it is rare for the best generalization to Early requires of a vthe alidation set,obwhich means some training data is not occur at astopping local minimum training jective. fed to the model. To best exploit this extra data, one can perform extra training stopping requires a vearly alidation set, which means someIntraining data is not afterEarly the initial training with stopping has completed. the second, extra fed to thestep, model. Tothe best exploitdata this is extra data, one canare perform extrastrategies training training all of training included. There two basic aftercan theuse initial training with early stopping has completed. In the second, extra one for this second training procedure. training step, all of the training data is included. There are two basic strategies (Algorithm 7.2) is toprocedure. initialize the mo model del again and retrain on all one One can strategy use for this second training of the data. In this second training pass, we train for the same num numb ber of steps as ) is to initialize the moin delthe again retrain on are all the One earlystrategy stopping(Algorithm procedure7.2 determined was optimal firstand pass. There of thesubtleties data. In this second training we trainFor forexample, the same there numbis er not of steps as some associated with thispass, pro procedure. cedure. a goo good d the early stopping procedure determined was optimal in the first pass. There are way of kno knowing wing whether to retrain for the same number of parameter up updates dates or some subtleties associated with this pro cedure. F or example, there is not a good the same num er of passes through the dataset. On the second round of training, umb b w ayhofpass knothrough wing whether to retrain the same umber of parameter datesthe or eac each the dataset will for require morenparameter up updates dates bup ecause the sameset num er of passes through the dataset. On the second round of training, training is bbigger. each pass through the dataset will require more parameter updates because the Another strategy for using all of the data is to keep the parameters obtained training set is bigger. from the first round of training and then con now w using all of continue tinue training but no Another strategy for using all of the data is to k eep the parameters the data. At this stage, we no now w no longer ha have ve a guide for when to stop obtained in terms con tinue from the first round of training and then training but no w using of of a num numb ber of steps. Instead, we can monitor the av average erage loss function onallthe the data. A t this stage, w e no w no longer ha ve a guide for when to stop in terms validation set, and contin continue ue training until it falls belo elow w the value of the training of a num b er of steps. Instead, we can monitor the av erage This loss function the set ob objective jective at whic which h the early stopping procedure halted. strategy on avoids vthe alidation set,ofand continue until scratch, it falls bbut elowis the alue of the training high cost retraining thetraining mo model del from not vas well-b ell-behav ehav ehaved. ed. For set ob jective at whic h the early stopping procedure halted. This strategy avoids example, there is not any guarantee that the ob objectiv jectiv jectivee on the validation set will the high cost of retraining the mo del from scratch, but is not as w ell-b ehav ed. For ev ever er reac reach h the target value, so this strategy is not ev even en guaranteed to terminate. example, there is is presen not any that the ob jective on This pro procedure cedure presented tedguarantee more formally in Algorithm 7.3the . validation set will ever reach the target value, so this strategy is not even guaranteed to terminate. Early stopping is also useful because it reduces the computational cost of the This procedure is presented more formally in Algorithm 7.3. 248it reduces the computational cost of the Early stopping is also useful because


˜ w

w1

w∗ w2

w2

w∗ ˜ w

w1

Figure 7.4: An illustration of the effect of early stopping. (L (Left) eft) The solid contour lines indicate the contours of the negative log-lik log-likeliho eliho elihoo o d. The dashed line indicates the (Left) Figure 7.4:tak An of the effect of early The solid contour tra trajectory jectory taken en illustration by SGD beginning from the origin.stopping. Rather than stopping at the p oin ointt ∗ lines indicate the contours of the negative log-lik eliho o d. The dashed line indicates the trajectory jectory stopping at an earlier w that minimizes the cost, early stopping results in the tra 2 tra jectory taken bAn y SGD beginning from the origin. Rather than stopping at the pThe oint p oin oint t w˜. (Right) illustration of the effect of L regularization for comparison. w thatcircles 2 minimizes the cost, early stopping results in ythe tra jectory at um an earlier dashed indicate the contours of the L p enalt enalty , which causesstopping the minim minimum of the w˜. (Right) p oint cost An illustration of thethe effect of L regularization for comparison. The total to lie nearer the origin than minimum of the unregularized cost. L dashed circles indicate the contours of the p enalty, which causes the minimum of the total cost to lie nearer the origin than the minimum of the unregularized cost.

training pro procedure. cedure. Besides the obvious reduction in cost due to limiting the num numb ber of training iterations, it also has the benefit of providing regularization without training pro Besides the obvious in function cost due to the numbof er requiring thecedure. addition of penalt enalty y terms reduction to the cost or limiting the computation of training iterations, it also has the benefit of providing regularization without the gradien gradients ts of suc such h additional terms. requiring the addition of penalty terms to the cost function or the computation of the gradients of such additional terms. hav ve stated that early Ho How w early stopping acts as a regularizer: So far we ha stopping is a regularization strategy strategy,, but we hav havee supp supported orted this claim only by Ho w early stopping a vregularizer: So far have stated that What early sho showing wing learning curves acts whereasthe alidation set error haswae U-shap U-shaped ed curve. stopping is a mechanism regularization strategy , butstopping we haveregularizes supportedthe thismo claim by is the actual by whic which h early model? del? only Bishop sho wing learning curves where the v alidation set error has a U-shap ed curve. What (1995a) and Sjöb Sjöberg erg and Ljung (1995) argued that early stopping has the effect of is the actual by pro whic h earlytostopping regularizes the moofdel? Bishop restricting themechanism optimization procedure cedure a relativ relatively ely small volume parameter (space 1995a)inand erg and oLjung (1995 ) argued that early stopping has the effect of, θ o. More theSjöb neighborho neighborhoo d of the initial parameter value sp specifically ecifically ecifically, restricting the optimization prosteps cedure to a relativelytosmall volume of parameter τ optimization τ training imagine taking (corresponding iterations) and θ space in the neighborho o d of the initial parameter v alue . More sp ecifically with learning rate . We can view the product τ as a measure of effectiv effectivee capacit capacity y,. τ optimization imagine taking stepsrestricting (corresponding to τnum training Assuming the gradien gradient t is bounded, both the umb ber ofiterations) iterations and and  τ with learning rate . W e can view the product as a measure of effectiv e capacit the learning rate limits the volume of parameter space reac reachable hable from θo . In thisy. Assuming ounded, restricting both the numused ber of and sense, τ bthe eha ehav vgradien es as if titiswbere the recipro reciprocal cal of the coefficient foriterations weigh weightt deca decay y. the learning rate limits the volume of parameter space reachable from θ . In this Indeed, wevcan show how—in w—in casecal of of a simple linear mo model del with a quadratic sense, es assho if w it ho were the the recipro the coefficient used for weigh t decay. τ beha error function and simple gradient descent—early stopping is equiv equivalent alent to L2 Indeed, we can show how—in the case of a simple linear model with a quadratic 249 error function and simple gradient descent—early stopping is equivalent to L


regularization. In order to compare with classical L 2 regularization, we examine a simple regularization. setting where the only parameters are linear weigh eights ts (θ = w). We can mo model del L In order to compare with classical regularization, w e examine a simple the cost function J with a quadratic appro approximation ximation in the neighborho neighborhoo od of the setting where the only parameters are linear w eigh ts ( ). W e can model θ = w ∗ empirically optimal value of the weigh weights ts w : the cost function J with a quadratic approximation in the neighborhood of the 1 w: ∗ > empirically optimalJˆv(alue (7.33) w − w ) H (w − w ∗), θ) =ofJ (the w ∗ )weigh + (ts 2 1 (7.33) (w respect w ) to Hw (wev w ), at w∗ . Given Jˆ(θ) matrix = J (wof) J + with where H is the Hessian evaluated aluated the 2 ∗ assumption that w is a minim minimum um of J (w ),−we know that−H is positiv ositivee semidefinite. H J w where is the Hessian matrix of with respect to ev aluated Under a local Taylor series appro approximation, ximation, the gradient is given at by:w . Given the assumption that w is a minimum of J (w ), we know that H is positive semidefinite. ∇w Jˆximation, (w) = H (w w∗). (7.34) Under a local Taylor series appro the−gradient is given by: Jˆ(jectory w) = H (w edwby ). the parameter vector during (7.34) We are going to study the tra trajectory follow followed 3 ∇ set the initial parameter − training. For simplicit simplicity y, let us vector to the origin, that W (0)e are going to study the tra jectory followed by the parameter vector during is w = 0. Let us supp suppose ose that we up update date the parameters via gradien gradientt descen descent: t: training. For simplicity, let us set the initial parameter vector to the origin, that 1) τ) w (that =w we(τ− (w (τ−1)) via gradient descen − ∇ (7.35) is w = 0. Let us suppose update the t: w Jparameters (τ−1) (τ−1) ∗ − H (w = (7.36) =w w J (w − )w ) (7.35) (τ−1) ∗ w (τ ) = (7.37) ∇ H (w − w )w ) = (wI − H−)(w (7.36) −)(w of the eigen Let us no now w rewrite this eigenvectors ofH H , exploiting w expression w = (in I the Hspace w−) vectors of (7.37) > the eigendecomp eigendecomposition osition of−H: H = Q − ΛQ , where Λ − is a diagonal matrix and Q Letanusorthonormal now rewrite basis this expression in the space of the eigenvectors of H , exploiting is of eigen eigenvectors. vectors. the eigendecomposition of H: H = QΛQ , where Λ is a diagonal matrix and Q (τ ) w of )(w w (τ−1) − w ∗) − w ∗vectors. = ((I I − QΛQ>)( (7.38) is an orthonormal basis eigen

w − w∗

∗ > (τ−1) ∗ QQ − ww Q>(ww(τ ) − w (w )) (7.39) )(w w) = = ((II − Λ Q)Λ (7.38) (0) (w −wthat − ) = (Iis−chosen Λ)Q to wenough (w be small ) (7.39) Assuming that wQ = 0 and to guarantee |1 − λi | < 1, the parameter τ trajectory jectory−during training after parameter up updates dates − tra − Assuming that w = 0 and that  is chosen to be small enough to guarantee is as follo follows: ws: 1 λ < 1, the parameter tra jectory during training after τ parameter updates (7.40) Q >w (τ ) = [I − (I − Λ)τ ]Q>w ∗ . is | as − follo | ws:

˜ in = No Now, w, the expression for Q Eq. can be rearranged Q >w w (7.40) [I 7.13 (I forL Λ2) regularization ]Q w . as: − − ˜ in Eq. 7.13 for L regularization can be rearranged Now, the expression for Q w Q>w˜ = (Λ + αI ) −1 ΛQ >w ∗ (7.41) as: 3

For neural networks, to obtain between Q w˜ symmetry = (Λ + αbreaking I ) ΛQ w hidden units, we cannot initialize (7.41) all the parameters to 0, as discussed in Sec. 6.2. However, the argument holds for any other initial value w(0) . 250


Q>w˜ = [I − (Λ + αI )−1 α]Q>w∗

(7.42)

]Q hw w˜ = [I, we(Λsee + that αI ) if αthe Comparing Eq. 7.40 andQEq. 7.42 yp yperparameters erparameters , α, (7.42) and τ − are chosen suc such h that Comparing Eq. 7.40 and Eq. 7.42 , theα,hyperparameters , α, (7.43) and τ (Λthat + αIif)−1 (I − Λ)wτ e=see are chosen such that then L2 regularization and(IearlyΛ stopping can equivalent alent (at(7.43) least αIb)e seen ) = (Λ + α, to be equiv under the quadratic approximation of the ob objectiv jectiv jectivee function). Going ev even en further, − stopping L then regularization and early can b e seen to b e equiv alent (at least by taking logarithms and using the series expansion for log (1 + x), we can conclude under the quadratic approximation of the ob jectiv e function). Going ev en further, that if all λi are small (that is, λi  1 and λi/α  1) then by taking logarithms and using the series expansion for log (1 + x), we can conclude 1 λ /α 1 and 1) then that if all λ are small (that is, λ τ≈ , (7.44) α   11 (7.44) ατ ≈ α., (7.45) τ  ≈ 1 α . That is, under these assumptions, the num umb plays ys (7.45) a role τb  er of training iterations τ pla 2 ≈ in inv versely prop proportional ortional to the L regularization parameter, and the inv inverse erse of τ  τ That is, under these assumptions, the n um b er of training iterations pla ys a role pla plays ys the role of the weight deca decay y co coefficient. efficient. inversely proportional to the L regularization parameter, and the inverse of τ  Parameter values corresponding onding to directions of significant curv curvature ature (of the plays the role of the wcorresp eight deca y coefficient. ob objectiv jectiv jectivee function) are regularized less than directions of less curv curvature. ature. Of course, Parameter values onding to directions of significant ature (of the in the con context text of early corresp stopping, this really means that parameterscurv that correspond ob jectiv e function) are regularized less tend than to directions of less curvature. Of course, to directions of significan significant t curv curvature ature learn early relative to parameters in the con text of early stopping, this really means that parameters that correspond corresp corresponding onding to directions of less curv curvature. ature. to directions of significant curvature tend to learn early relative to parameters Theonding deriv derivations ations in this section hav haveeature. shown that a tra trajectory jectory of length τ ends corresp to directions of less curv 2 at a point that corresponds to a minim minimum um of the L -regularized ob objectiv jectiv jective. e. Early The deriv ations in this section hav e shown that a tra jectory of length ends stopping is of course more than the mere restriction of the tra trajectory jectory τlength; L at a point that corresponds to a minim um of the -regularized ob jectiv e. Early instead, early stopping typically in inv volv olves es monitoring the validation set error in stopping is ofthe course more at than the mere restriction tra jectory length; order to stop tra trajectory jectory a particularly goo good d pointofin the space. Early stopping instead, early stopping typically involvtesdecay monitoring the stopping validation set error in therefore has the adv advantage antage over weigh weight that early automatically order to stop the tra jectory at a particularly goo d p oint in space. Early stopping determines the correct amount of regularization while weight deca decay y requires man many y therefore has the adv antage o ver weigh t decay that early stopping automatically training exp experimen erimen eriments ts with differen differentt values of its hyperparameter. determines the correct amount of regularization while weight decay requires many training experiments with different values of its hyperparameter.

7.9

Parameter Tying and Parameter Sharing

Th Thus us far, this chapter,Twhen hav havee P discussed addingSharing constraints or penalties 7.9 Pin arameter yingweand arameter to the parameters, we ha hav ve alw alwa ays done so with resp respect ect to a fixed region or point. Th far, in this hapter, when (or we hav e discussed or penalties L2 cregularization Forusexample, weight deca decay) y) padding enalizesconstraints mo model del parameters for to the parameters, w e ha v e alw a ys done so with resp ect to a fixed region or p oint. deviating from the fixed value of zero. Ho Howev wev wever, er, sometimes we ma may y need other L F or example, regularization (or w eight deca y) p enalizes mo del parameters for ways to express our prior kno knowledge wledge about suitable values of the mo model del parameters. deviating from the fixed value of zero. However, sometimes we may need other 251 suitable values of the mo del parameters. ways to express our prior knowledge about


Sometimes we might not know precisely what values the parameters should tak takee but we know, from knowledge of the domain and mo model del architecture, that there Sometimes we might not knowbetw precisely values the parameters should take should be some dependencies etween een thewhat mo model del parameters. but we know, from knowledge of the domain and model architecture, that there A common yp ypee of dep dependency endency that often want to express is that certain should be sometdependencies between thewemo del parameters. parameters should be close to one another. Consider the following scenario: we common typpeerforming of dependency thatclassification we often want to express is that ha hav veA tw two o mo models dels the same task (with the samecertain set of parameters should b e close to one another. Consider the following scenario: we classes) but with somewhat different input distributions. Formally ormally,, we ha have ve model havwith e twoparameters models performing same classification taskw(with the tsame set of (B ) . The A w (A) and the B with mo model del parameters wo mo models dels classes) but with somewhat different input distributions. F ormally , we ha ve model map the input to tw two o differen different, t, but related outputs: yˆ (A) = f( w(A), x) and A w with parameters and mo del B with parameters w . The two models ( B ) ( B ) yˆ = g(w , x). map the input to two different, but related outputs: yˆ = f( w , x) and Let us imagine that the tasks are similar enough (p (perhaps erhaps with similar input yˆ = g(w , x). and output distributions) that we believe the mo model del parameters should be close Let us imagine that (perhaps with similar input (A) the tasks are similar (Benough ) to eac each h other: ∀i , wi should be close to wi . We can lev leverage erage this information and output distributions) that we believe the model parameters should be close through regularization. Sp Specifically ecifically ecifically,, we can use a parameter norm penalty of the i w w . We can leverage to eac h other: , should bewclose this yinformation ( A ) ( B ) (B )k 2to 2 enalty w , w ) = kw(A) − form: Ω( Ω(w , but other 2. Here we used an L p enalt regularization. cthrough hoices are also∀ possible.Specifically, we can use a parameter norm penalty of the form: Ω(w , w ) = w . Here we used an L penalty, but other w This kind of approach was prop proposed osed by Lasserre et al. (2006), who regularized choices are also possible.k − k the parameters of one mo model, del, trained as a classifier in a sup supervised ervised paradigm, to This kind of approach w as prop osed by Lasserre et al. ( 2006 ),ervised who regularized be close to the parameters of another mo model, del, trained in an unsup unsupervised paradigm the capture parameters of one model,oftrained as aedclassifier in a sup ervised paradigm, to (to the distribution the observ observed input data). The architectures were be close to the parameters of another model, trained in an unsupervised constructed such that many of the parameters in the classifier mo model del paradigm could be (to capture the distribution of the observ ed input data). The architectures were paired to corresponding parameters in the unsup unsupervised ervised mo model. del. constructed such that many of the parameters in the classifier model could be While a parameter norm penalt enalty y is one way to regularize parameters to be paired to corresponding parameters in the unsupervised model. close to one another, the more popular way is to use constraints: to force sets of While a parameter norm penalt y is one way to regularize parameters . This method of regularization is often referred to to base parameters to be equal close to one another, the more p opular w ay is to use constraints: to force sets par arameter ameter sharing sharing,, where we in interpret terpret the various mo models dels or mo model del comp componen onen onents ts of as parameters to b e equal . This method of regularization is often referred to as sharing a unique set of parameters. A significan significantt adv advantage antage of parameter sharing ameter sharing,the where we interpret various dels orpenalt model onen ts as opar ver regularizing parameters to bethe close (via mo a norm enalty) y) comp is that only a sharing a unique set of parameters. A significan t adv antage of parameter sharing subset of the parameters (the unique set) need to be stored in memory memory.. In certain o v er regularizing the parameters to b e close (via a norm p enalt y) is only a mo models—suc dels—suc dels—such h as the con conv volutional neural netw network—this ork—this can lead tothat significant subset of the parameters (the unique set) need to b e stored in memory . In certain reduction in the memory footprint of the model. models—such as the convolutional neural network—this can lead to significant reduction in the memory footprint of the model. extensivee use Con Convolutional volutional Neural Netw Networks orks By far the most popular and extensiv of parameter sharing occurs in convolutional neur neural al networks (CNNs) applied to Convolutional computer vision. Neural Networks By far the most popular and extensive use of parameter sharing occurs in convolutional neural networks (CNNs) applied to Natural images ha have ve many statistical prop properties erties that are inv invarian arian ariantt to translation. computer vision. For example, a photo of a cat remains a photo of a cat if it is translated one pixel Natural images have many statistical properties that are invariant to translation. For example, a photo of a cat remains a photo of a cat if it is translated one pixel 252


to the righ right. t. CNNs take this property into accoun accountt by sharing parameters across multiple image lo locations. cations. The same feature (a hidden unit with the same weigh weights) ts) to the righ t. CNNs take this property into accoun t by sharing parameters across is computed over differen differentt lo locations cations in the input. This means that we can find a m ultiple image lo cations. The same feature hidden unit at with the same ts) i or weigh cat with the same cat detector whether the(acat app appears ears column column is computed o ver differen t lo cations in the input. This means that w e can find a i + 1 in the image. cat with the same cat detector whether the cat appears at column i or column Parameter sharing has allow allowed ed CNNs to dramatically lo low wer the num numb ber of unique i + 1 in the image. mo model del parameters and to significantly increase netw network ork sizes without requiring a P arameter sharing has allow ed CNNs to dramatically lowoferthe the bnum er of unique corresp corresponding onding increase in training data. It remains one est bexamples of mo del parameters and to significantly increase netw ork sizes without requiring ho how w to effectiv effectively ely incorp incorporate orate domain knowledge in into to the net netw work arc architecture. hitecture. a corresponding increase in training data. It remains one of the best examples of will ely be discussed in more detail in Chapter . network architecture. howCNNs to effectiv incorporate domain knowledge into 9the CNNs will be discussed in more detail in Chapter 9.

7.10

Sparse Represen Representations tations

W eigh eightt deca decay y acts byRepresen placing a penalty directly on the mo model del parameters. Another 7.10 Sparse tations strategy is to place a penalty on the activ activations ations of the units in a neural net network, work, W eigh t deca y acts by placing a penalty directly on the mo del parameters. Another encouraging their activ activations ations to be sparse. This indirectly imposes a complicated strategy is to place a p enalty on the activations of the units in a neural network, penalt enalty y on the model parameters. encouraging their activations to be sparse. This indirectly imposes a complicated 1 L W e ha have ve already discussed (in Sec. 7.1.2 ) ho how w p enalization induces a sparse penalty on the model parameters. parametrization—meaning that man many y of the parameters become zero (or close to L penalization W e ha ve already discussed (in Sec. 7.1.2 ) howhand, a sparse zero). Represen Representational tational sparsity sparsity,, on the other describ describes es ainduces represen representation tation parametrization—meaning that man y of the parameters b ecome zero (or close to where many of the elements of the represen representation tation are zero (or close to zero). zero). Represen tational , on the hand, describ es con a represen A simplified view of thissparsity distinction can other be illustrated in the context text of tation linear where many of the elements of the represen tation are zero (or close to zero). regression: A simplified view of this distinction can be illustrated inthe con text of linear      2 regression: 18 4 0 0 −2 0 0    5   0 0 −1 0   3  3 0 2  18     −2  2 0 0  15  =  04 50 00   0 0 0   3   5  0 0  (7.46) −01 03 −04   −52   −9   1 0 01 −   1 153 = 10 05 −00 − 00 −05 00 (7.46) − 45 9 1 0 0 1 0 4   m× n m  ∈−R  1 0 A0∈ R− y ∈1 R n 3  0 5 −0  x −  4  R −  −R     x 0 R   y A    −14 3 1     −1 2 −5 4   2  ∈ ∈   ∈   1    3    00   14    43 21 −23 −15 14  12     1  =  −1 5  4 2 − 3 −  1   4   02  (7.47) 2 3 1 1 3  −2   3 −1 −3 0 −3   0  2 −   −3 1 = −51 45 − −42 −22 −53 −12 23 (7.47) 00 1 2 3 −0 − 3 2 −3   3 n ×n y ∈23Rm  5 4 B ∈2Rm− 2 5 −1  h ∈ R −0  − R253 − −   y R  − h R B             ∈  ∈     ∈         


In the first expression, we ha have ve an example of a sparsely parametrized linear regression mo model. del. In the second, we ha have ve linear regression with a sparse representaIn the first expression, we ha ve an example a sparsely parametrized linear tion h of the data x . That is, h is a function of of in some sense, represen represents ts x that, regression mo del. In the second, w e ha ve linear regression with a sparse representathe information presen presentt in x, but do does es so with a sparse vector. tion h of the data x . That is, h is a function of x that, in some sense, represents Represen Representational tational regularization is accomplished by the same sorts of mechanisms the information present in x, but does so with a sparse vector. that we ha have ve used in parameter regularization. Representational regularization is accomplished by the same sorts of mechanisms Norm penalt enalty y regularization of represen representations tations is performed by adding to the that we have used in parameter regularization. loss function J a norm penalty on the repr epresentation esentation esentation,, denoted Ω( Ω(h h). As before, we Norm p enalt y regularization of represen tations is p erformed by adding to the ˜ denote the regularized loss function by J : loss function J a norm penalty on the representation, denoted Ω(h). As before, we denote the regularized loss J˜(θfunction ; X , y) =bJy (Jθ˜:; X , y) + αΩ(h) (7.48) (θ ; X , y) = con J (θtribution ; X , y) +ofαΩ( where α ∈ [0, ∞) weigh eights tsJ˜the relative contribution theh)norm penalt enalty y term,(7.48) with larger values of α corresponding to more regularization. where α [0, )1weights the relative contribution of the norm penalty term, with1 Just as an L penalt enalty y on the parameters induces parameter sparsity sparsity,, an L larger values of α corresponding to more regularization. ∈ ∞ penalt enalty y on the elements of the representation induces representational sparsit sparsity: y: Ppenalty on the parameters L L Just as an induces parameter sparsity , an 1 Ω( Ω(h h ) = ||h||1 = i |hi|. Of course, the L penalt enalty y is only one choice of penalty p enalt y on the elements of the representation induces sparsit y: that can result in a sparse represen representation. tation. Others includerepresentational the penalty derived from Ω( h ) = = L h h . Of course, the p enalt y is only one c hoice of p enalty a StudentStudent-tt prior on the representation (Olshausen and Field, 1996; Bergstra, 2011) that KL can||divergence result tation. include the )penalty derived from || in a sparse |penalties | represen and (Laro Larochelle chelle Others and Bengio , 2008 that are esp especially ecially a Studentprior ontations the representation (Olshausen and Field ; Bergstra , 2011 useful for trepresen representations with elements constrained to lie, 1996 on the unit interv interval. al.) and et KL enalties (Laro chelle and , 2008examples ) that are esp ecially Lee al.divergence (2008) andpGo Goo odfellow et al. (2009 ) bBengio oth provide ofP strategies Ptations with elements constrained to lie on the unit useful for represen interv al. 1 based on regularizing the av average erage activ activation ation across sev several eral examples, m i h(i) , to Lee et al.some (2008 ) andvGo odfellow et aal. (2009with ) both examples b e near target alue, such as vector .01provide for eac each h entry entry.. of strategies h , to based on regularizing the average activation across several examples, Other approac approaches hes obtain represen representational tational sparsit sparsity y with a hard constrain constraintt on be near some target value, such as a vector with .01 for each entry. the activ activation ation v values. alues. For example, ortho orthogonal gonal matching pursuit (Pati et al., Other approac hes obtain represen tational sparsit with a es hard t on h ythat 1993 1993)) enco encodes des an input x with the representation solv solves theconstrain constrained the activationproblem values. For example, orthogonal matching pursuit (Pati et al., optimization P 1993) encodes an input x witharg the representation kx − W hk2 , h that solves the constrained min (7.49) optimization problem h,khk0

Ian Goodfellow, Yoshua Bengio, Aaron Courville-Deep Learning [pre-pub version]-MIT Press (2016).pdf

Recommend Documents