Deep Learning Deep Learning Ian Go Goo odfello dfellow w Yosh oshua ua Bengio Ian GoCourville odfellow Aaron Yoshua Bengio Aaron Courville
Con Conten ten tents ts Contents Website
vii
Wcebsite A kno knowledgmen wledgmen wledgments ts
vii viii
Acknowledgments Notation
viii xi
Notation 1 In Intro tro troduction duction 1.1 Who Should Read This Bo Book? ok? . . . . . . . . 1 1.2 Introduction Historical Trends in Deep Learning . . . . . 1.1 Who Should Read This Book? . . . . . . . . 1.2 Historical Trends in Deep Learning . . . . . I Applied Math and Mac Machine hine Learning Basics I Applied Math and Machine Learning Basics 2 Linear Algebra 2.1 Scalars, Vectors, Matrices and Tensors . . . 2 2.2 LinearMultiplying Algebra Matrices and Vectors . . . . . . 2.1 Scalars, ectors, Matrices and T 2.3 Iden Identit tit tity yV and In Inverse verse Matrices . ensors . . . . .. .. .. 2.2 Linear Multiplying Matrices and Vectors 2.4 Dep Dependence endence and Span . . .. .. .. .. .. .. 2.3 Norms Identity .and 2.5 . . In . verse . . . .Matrices . . . . . .. .. .. .. .. .. .. .. 2.4 Sp Linear endence and Span . . . . .. .. .. 2.6 Special ecial Dep Kinds of Matrices and V. ectors 2.5 Eigendecomp Norms . . . osition . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. 2.7 Eigendecomposition 2.6 Sp ecial Kinds Matrices and V 2.8 Singular ValueofDecomp Decomposition osition . ectors . . . . .. .. .. 2.7 The Eigendecomp osition Pseudoinv . . . . . . erse . . .. .. .. .. .. .. 2.9 Mo MooreP oreP orePenrose enrose Pseudoinverse 2.8 The Singular Value Decomp. osition 2.10 Trace Op Operator erator . . . . .. .. .. .. .. .. .. .. 2.9 The The Determinan Mo orePenrose 2.11 Determinant t . Pseudoinv . . . . . . erse . . .. .. .. .. .. .. 2.10 The T race Op erator . . . . . . .Analysis . . . . . .. 2.12 Example: Principal Comp Components onents 2.11 The Determinant . . . . . . . . . . . . . . . 2.12 Example: Principal Components Analysis . 3 Probabilit Probability y and Information Theory 3.1 Wh Why y Probabilit Probability? y? . . . . . . . . . . . . . . . 3 Probability and Information Theory 3.1 Why Probability? . . . . . . . . . . . . . . . i i
. . . .
. . .. .. .. .. .. .. .. .. .. .. . . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. . . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. . . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. . . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. . . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. . . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. . . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. . . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. . . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. . . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. . . .
. . . .
xi1 8 1 11 8 11 29
. . .. .. .. .. .. .. .. .. .. .. . . .
29 31 31 31 34 31 36 34 37 36 39 37 40 39 42 40 44 42 45 44 46 45 47 46 48 47 48 53 54 53 54
. . . . . . . . . . . .
CONTENTS
3.2 Random Variables . . . . . . . . . . . . . . 3.3 Probabilit Probability y Distributions . . . . . . . . . . . 3.2 Random V ariables y. .. .. .. .. .. .. .. .. .. .. .. .. .. 3.4 Marginal Probabilit Probability 3.3 Conditional Probability Distributions 3.5 Probabilit Probability y .. .. .. .. .. .. .. .. .. .. .. 3.4 Marginal Probabilit y . . . . . .Probabilities . . . . . . . 3.6 The Chain Rule of Conditional 3.5 Indep Conditional y . . . Indep . . . .endence . . . . 3.7 Independence endenceProbabilit and Conditional Independence 3.6 Exp The Chain Rule of Conditional Probabilities 3.8 Expectation, ectation, Variance and Co Cov variance . . . 3.7 Indep endence and Conditional Indep endence 3.9 Common Probabilit Probability y Distributions . . . . . 3.8 Exp ectation, V ariance and CovFariance 3.10 Useful Prop Properties erties of Common unctions . .. .. 3.9 Ba Common Probabilit 3.11 Bay yes’ Rule . . . . y. Distributions . . . . . . . . .. .. .. .. .. 3.10 T Useful Prop erties of of Con Common Functions 3.12 echnical Details Contin tin tinuous uous Variables . .. 3.11 Information Bayes’ Rule Theory . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. 3.13 3.12 T echnical Details of Contin uous 3.14 Structured Probabilistic Mo Models dels V . ariables . . . . . .. 3.13 Information Theory . . . . . . . . . . . . . . Structured Probabilistic Models . . . . . . . 4 3.14 Numerical Computation 4.1 Ov Overﬂo erﬂo erﬂow w and Underﬂo Underﬂow w . . . . . . . . . . . 4 4.2 Numerical Computation Poor Conditioning . . . . . . . . . . . . . . 4.1 Ov erﬂowtBased and Underﬂo w . . . .. .. .. .. .. .. .. .. 4.3 Gradien GradientBased Optimization 4.2 Constrained Poor Conditioning . . . . .. .. .. .. .. .. .. .. .. .. 4.4 Optimization 4.3 Example: GradientBased 4.5 LinearOptimization Least Squares. .. .. .. .. .. .. .. 4.4 Constrained Optimization . . . . . . . . . . 4.5 Example: Linear Least Squares . . . . . . . 5 Mac Machine hine Learning Basics 5.1 Learning Algorithms . . . . . . . . . . . . . 5 5.2 Machine Learning Basicsand Underﬁtting . . . Capacit Capacity y, Overﬁtting 5.1 Hyp Learning Algorithms . alidation . . . . . .Sets . . .. .. .. .. 5.3 Hyperparameters erparameters and V 5.2 Estimators, Capacity, Overﬁtting and Underﬁtting 5.4 Bias and V ariance . . . . . .. .. .. 5.3 Hyp erparameters and V alidation 5.5 Maxim Maximum um Lik Likeliho eliho elihoo od Estimation Sets . . .. .. .. .. 5.4 Ba Estimators, Bias and. V. ariance 5.6 Bay yesian Statistics . . . . .. .. .. .. .. .. .. .. 5.5 Sup Maxim um Lik elihoodAlgorithms Estimation. .. .. .. .. .. .. 5.7 Supervised ervised Learning 5.6 Unsup Ba yesian Statistics . . Algorithms . . . . . . . .. .. .. .. .. 5.8 Unsupervised ervised Learning 5.7 Sto Sup Learning Algorithms 5.9 Stoccervised hastic Gradien Gradient t Descen Descent t . . .. .. .. .. .. .. .. 5.8 Unsup ervised Learning Algorithms . . . .. .. 5.10 Building a Machine Learning Algorithm 5.9 Challenges Stochastic Gradien t Descen . . . . . .. .. .. .. 5.11 Motiv Motivating ating Deept Learning 5.10 Building a Machine Learning Algorithm . . 5.11 Challenges Motivating Deep Learning . . . . II Deep Net Netw works: Mo Modern dern Practices II Deep Deep FNet works: dern Practices 6 eedforw eedforward ardMo Netw Networks orks 6.1 Example: Learning XOR . . . . . . 6 6.2 Deep Gradien Feedforw ard Netw orks. . . . . . GradientBased tBased Learning 6.1 Example: Learning XOR . . . . . . 6.2 GradientBased Learning . . ii. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .
. . . .
56 56 56 58 56 59 58 59 59 60 59 60 60 62 60 67 62 70 67 71 70 72 71 75 72 75 80 80 80 82 80 82 82 93 82 96 93 96 98 99 98 110 99 120 110 122 120 131 122 135 131 139 135 145 139 150 145 152 150 154 152 154 165 165 167 170 167 176 170 176
CONTENTS
7 7
8 8
9 9
6.3 Hidden Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 6.4 Arc Architecture hitecture Design . . . . . . . . . . . . . . . . . . . . . . . . . . 196 6.3 Hidden Units . . and . . . Other . . . .Diﬀeren . . . . tiation . . . . Algorithms . . . . . . . .. .. .. .. .. 203 190 6.5 Bac BackPropagation kPropagation Diﬀerentiation 6.4 Historical Architecture Design 196 6.6 Notes . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 224 6.5 BackPropagation and Other Diﬀerentiation Algorithms . . . . . 203 6.6 Historical Notes . . . Learning . . . . . . . . . . . . . . . . . . . . . . . . . 228 224 Regularization for Deep 7.1 Parameter Norm Penalties . . . . . . . . . . . . . . . . . . . . . . 230 Regularization for Deep LearningOptimization . . . . . . . . . . . . 228 7.2 Norm Penalties as Constrained 237 7.1 P arameter Norm P enalties . . . . . . . . . . . . . . . . . . . . . . 230 7.3 Regularization and UnderConstrained Problems . . . . . . . . . 239 7.2 Dataset Norm Penalties as Constrained 237 7.4 Augmen Augmentation tation . . . . .Optimization . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. 240 7.3 Noise Regularization and. UnderConstrained 239 7.5 Robustness . . . . . . . . . . . Problems . . . . . . .. .. .. .. .. .. .. .. .. 242 7.4 SemiSup Dataset Augmen . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 244 240 7.6 SemiSupervised ervised tation Learning 7.5 MultiT Noise Robustness . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 245 242 7.7 MultiTask ask Learning 7.6 SemiSup ervised Learning . . . . . . . . . . . . . . . . . . . . . . 244 7.8 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 7.7 P MultiT ask TLearning . arameter . . . . . .Sharing . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. 245 7.9 arameter ying and P 251 7.8 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 7.10 Sparse Represen Representations tations . . . . . . . . . . . . . . . . . . . . . . . . 253 7.9 P arameter T ying and Parameter Sharing 251 7.11 Bagging and Other Ensemble Metho Methods ds . .. .. .. .. .. .. .. .. .. .. .. .. .. .. 255 7.10 Sparse Represen tations . . . . . . . . . . . . . . . . . . . . . . . . 253 7.12 Drop Dropout out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 7.11 Bagging andTOther 7.13 Adv dversarial ersarial rainingEnsemble . . . . .Metho . . . ds . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 255 267 7.12 T Drop out Distance, . . . . . .Tangent . . . . .Prop, . . . and . . .Manifold . . . . . T.angent . . . . Classiﬁer . . . . . 268 257 7.14 angent 7.13 Adversarial Training . . . . . . . . . . . . . . . . . . . . . . . . . 267 7.14 Tangent Distance, Tangent Prop, and 268 Optimization for Training Deep Mo Models delsManifold Tangent Classiﬁer 274 8.1 Ho How w Learning Diﬀers from Pure Optimization . . . . . . . . . . . 275 Optimization raining Deep Models 8.2 Challengesfor in T Neural Netw Network ork Optimization . . . . . . . . . . . . 274 282 8.1 Basic How Learning Diﬀers 275 8.3 Algorithms . . from . . .P . ure . . Optimization . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. 294 8.2 P Challenges in Neural Netw ork Optimization 8.4 arameter Initialization Strategies . . . . . .. .. .. .. .. .. .. .. .. .. .. .. 282 301 8.3 Basic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 8.5 Algorithms with Adaptiv Adaptivee Learning Rates . . . . . . . . . . . . . 306 8.4 P arameter Initialization Strategies 301 8.6 Appro Approximate ximate SecondOrder Metho Methods ds. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 310 8.5 Optimization Algorithms with Adaptivand e Learning Rates . . .. .. .. .. .. .. .. .. .. .. .. 318 306 8.7 Strategies MetaAlgorithms 8.6 Approximate SecondOrder Methods . . . . . . . . . . . . . . . . 310 8.7 Optimization Strategies 318 Con Conv volutional Netw Networks orks and MetaAlgorithms . . . . . . . . . . . 331 9.1 The Con Conv volution Op Operation eration . . . . . . . . . . . . . . . . . . . . . 332 Con v olutional Netw orks 9.2 Motiv Motivation ation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 336 9.1 P The Con.volution 332 9.3 ooling . . . . .Op . .eration . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 340 9.2 Con Motiv ation . and . . .Po. oling . . . as . .an . .Inﬁnitely . . . . . Strong . . . . Prior . . . .. .. .. .. .. .. .. 346 336 9.4 Conv volution 9.3 V Pariants ooling . of. the . . .Basic . . . Con . . v. olution . . . .F . unction . . . . . .. .. .. .. .. .. .. .. .. .. .. .. 348 340 9.5 Conv 9.4 Con v olution and P o oling as an Inﬁnitely Strong Prior . . . . . . . 346 9.6 Structured Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . 359 9.5 Data Variants ofesthe 9.7 Typ ypes . .Basic . . . Con . . v. olution . . . .F . unction . . . . . .. .. .. .. .. .. .. .. .. .. .. .. 348 361 9.6 Structured Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . 359 9.8 Eﬃcien Eﬃcientt Con Conv volution Algorithms . . . . . . . . . . . . . . . . . . 363 9.7 Data T yp es . . . ervised . . . . .Features . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 364 361 9.9 Random or Unsup Unsupervised 9.8 Eﬃcient Convolution Algorithms . . . . . . . . . . . . . . . . . . 363 iii 9.9 Random or Unsupervised Features . . . . . . . . . . . . . . . . . 364
CONTENTS
10 10
11 11
12 12
III
9.10 The Neuroscien Neuroscientiﬁc tiﬁc Basis for Conv Convolutional olutional Netw Networks orks . . . . 9.11 Con Conv volutional Net Networks works and the History of Deep Learning . 9.10 The Neuroscientiﬁc Basis for Convolutional Networks . . . . 9.11 ConvMo olutional NetRecurrent works and the History of Deep Learning . Sequence Modeling: deling: and Recursiv Recursive e Nets 10.1 Unfolding Computational Graphs . . . . . . . . . . . . . . . Sequence Motdeling: 10.2 Recurren Recurrent Neural Recurrent Net Netw works .and . . .Recursiv . . . . . e. Nets . . . . . . . . 10.1 Bidirectional Unfolding Computational 10.3 RNNs . . . .Graphs . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 10.2 Enco Recurren t Neural Networks . . . . . . .Architectures . . . . . . . .. .. .. .. 10.4 EncoderDeco derDeco derDecoder der SequencetoSequence 10.3 Deep Bidirectional RNNs . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 10.5 Recurren Recurrent t Net Netw w. orks 10.4 Enco derDeco der SequencetoSequence 10.6 Recursiv Recursivee Neural Net Netw works . . . . . . . .Architectures . . . . . . . .. .. .. .. 10.5 The DeepChallenge RecurrentofNet workserm. .Dep . .endencies . . . . . .. .. .. .. .. .. .. .. .. .. 10.7 LongT LongTerm Dependencies 10.6 Ec Recursiv e Neural Netw.orks 10.8 Echo ho State Net Netw works . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 10.7 Leaky The Challenge of Other LongT erm Dependencies . . Time . . . .Scales . . . .. 10.9 Units and Strategies for Multiple 10.8 Ec ho State Net w orks . . . . . . . . . . . . . . . . . . . 10.10 The Long ShortT ShortTerm erm Memory and Other Gated RNNs .. .. .. 10.9 Optimization Leaky Units and Strategies for Multiple 10.11 for Other LongT LongTerm erm Dep Dependencies endencies . . Time . . . .Scales . . . .. 10.10 Explicit The LongMemory ShortTerm 10.12 . . .Memory . . . . .and . . Other . . . . Gated . . . .RNNs . . . .. .. .. 10.11 Optimization for LongTerm Dependencies . . . . . . . . . . 10.12 Explicit Memory Practical metho methodology dology. . . . . . . . . . . . . . . . . . . . . . . . 11.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . Practical metho dology 11.2 Default Baseline Mo Models dels . . . . . . . . . . . . . . . . . . . . 11.1 P erformance Metrics . . . . .More . . .Data . . . .. .. .. .. .. .. .. .. .. 11.3 Determining Whether .to. Gather 11.2 Selecting Default Baseline Models . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 11.4 Hyp Hyperparameters erparameters 11.3 Determining Whether 11.5 Debugging Strategies .to. Gather . . . . .More . . .Data . . . .. .. .. .. .. .. .. .. .. 11.4 Example: Selecting Hyp erparameters . . Recognition . . . . . . . .. .. .. .. .. .. .. .. .. .. 11.6 MultiDigit Number 11.5 Debugging Strategies . . . . . . . . . . . . . . . . . . . . . . 11.6 Example: MultiDigit Number Recognition . . . . . . . . . . Applications 12.1 Large Scale Deep Learning . . . . . . . . . . . . . . . . . . . Applications 12.2 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . 12.1 Large Deep Learning 12.3 Sp Speec eec eech hScale Recognition . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 12.2 Natural Computer Vision Pro . . cessing . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 12.4 Language Processing 12.3 Other SpeechApplications Recognition .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 12.5 12.4 Natural Language Processing . . . . . . . . . . . . . . . . . 12.5 Other Applications . . . . . . . . . . . . . . . . . . . . . . . Deep Learning Researc Research h
III Linear Deep F Learning Researc h 13 actor Mo Models dels 13.1 Probabilistic PCA and Factor Analysis . 13 13.2 LinearIndep Factor Mo dels onent Analysis (ICA) Independen enden endent t Comp Component 13.1 Probabilistic PCA and F.actor 13.3 Slo Slow w Feature Analysis . . . Analysis . . . . . .. 13.2 Sparse Independen t Comp 13.4 Co Coding ding . . .onent . . . Analysis . . . . . (ICA) . . . . 13.3 Slow Feature Analysis . . . . . . . . . . 13.4 Sparse Coding . . . . . . . . iv . . . . . . .
. . .. .. . .
. . .. .. . .
. . .. .. . .
. . .. .. . .
. . .. .. . .
. . .. .. . .
. . .. .. . .
. . .. .. . .
. . .. .. . .
. . .. .. . .
. . .. .. . .
. . . . . . .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. .. . . . . .. .. .. . .
. . .. .. . .
. . . . . . .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. .. . . . . .. .. .. . .
. . .. .. . .
. 365 . 372 . 365 . 374 372 . 376 . 374 379 376 .. 396 .. 379 397 .. 399 396 .. 397 401 399 .. 403 401 .. 406 403 .. 409 .. 406 411 .. 409 415 411 .. 419 . 415 . 424 419 . 425 . 424 428 .. 429 425 428 .. 430 .. 439 429 430 .. 443 . 439 . 446 443 . 446 . 446 455 .. 446 461 455 .. 464 461 .. 480 . 464 . 480 489 . . .. .. . .
489 492 493 492 494 493 496 494 499 496 499
CONTENTS
13.5 Manifold Interpretation of PCA . . . . . . . . . . . . . . . . . . . 502 Manifold 14 13.5 Auto Autoenco enco encoders ders Interpretation of PCA . . . . . . . . . . . . . . . . . . . 14.1 Undercomplete Auto Autoenco enco encoders ders . . . . . . . . . . . . . . . . . . . . 14 14.2 Autoenco ders Regularized Auto Autoenco enco encoders ders . . . . . . . . . . . . . . . . . . . . . . 14.1 Undercomplete Auto enco ders . . . and . . .Depth . . . .. .. .. .. .. .. .. .. .. .. .. 14.3 Represen Representational tational Power, La Lay yer Size 14.2 Sto Regularized Auto enco dersDeco . .ders . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 14.4 Stocchastic Enco Encoders ders and Decoders 14.3 Denoising Representational Power, 14.5 Auto Autoenco enco encoders ders La . y. er. .Size . . and . . .Depth . . . .. .. .. .. .. .. .. .. .. .. .. 14.4 Learning StochasticManifolds Encoderswith and Deco ders . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 14.6 Auto Autoenco enco encoders ders 14.5 Denoising Auto enco ders . . . . . . 14.7 Con Contractiv tractiv tractivee Auto Autoenco enco encoders ders . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 14.6 Predictiv Learning with Auto encoders 14.8 PredictiveeManifolds Sparse Decomp Decomposition osition . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 14.7 Applications Contractive Auto encoenco dersders . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 14.9 of Auto Autoenco encoders 14.8 Predictive Sparse Decomposition . . . . . . . . . . . . . . . . . . Applications of Autoencoders . . . . . . . . . . . . . . . . . . . . 15 14.9 Represen Representation tation Learning 15.1 Greedy La Lay yerWise Unsup Unsupervised ervised Pretraining . . . . . . . . . . . 15 15.2 Represen tation Learning Transfer Learning and Domain Adaptation . . . . . . . . . . . . . 15.1 SemiSup Greedy Laervised yerWise Unsupervised retraining . . .. .. .. .. .. .. .. .. .. 15.3 SemiSupervised Disentangling of P Causal Factors 15.2 Distributed Transfer Learning and Domain 15.4 Representation . . .A.daptation . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. 15.3 SemiSup ervised Disentangling of Causal 15.5 Exp Exponen onen onential tial Gains from Depth . . . . . F. actors . . . . .. .. .. .. .. .. .. .. .. 15.4 Representation . . . . . . .Causes . . . . .. .. .. .. .. .. .. .. .. .. 15.6 Distributed Pro to Disco Providing viding Clues Discov ver. Underlying 15.5 Exponential Gains from Depth . . . . . . . . . . . . . . . . . . . Providing Clues to Disco ver Underlying . . . . . . . . . . 16 15.6 Structured Probabilistic Mo Models dels for DeepCauses Learning 16.1 The Challenge of Unstructured Mo Modeling deling . . . . . . . . . . . . . . 16 16.2 Structured Probabilistic Mo dels for Deep Learning Using Graphs to Describ Describee Mo Model del Structure . . . . . . . . . . . . . 16.1 Challenge Unstructured Mo deling . 16.3 The Sampling from of Graphical Mo . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. Models dels 16.2 A Using Graphs Describe Mo Modeling del Structure 16.4 dv dvantages antages of to Structured Modeling . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. 16.3 Learning Sampling ab from Models 16.5 about outGraphical Dep Dependencies endencies . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 16.4 Inference Advantages of Approximate Structured Mo deling . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 16.6 and Inference 16.5 Learning ab out Dep endencies . . . . . . Probabilistic . . . . . . . Mo . . dels . . . 16.7 The Deep Learning Approach to. .Structured Models 16.6 Inference and Approximate Inference . . . . . . . . . . . . . . . . DeepMetho Learning 17 16.7 Mon Monte teThe Carlo Methods ds Approach to Structured Probabilistic Models 17.1 Sampling and Monte Carlo Metho Methods ds . . . . . . . . . . . . . . . . 17 17.2 Monte Carlo Metho ds Imp Importance ortance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 17.1 Sampling and Monte ds 17.3 Mark Marko ov Chain Mon Monte te Carlo Carlo Metho Metho Methods ds .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 17.2 Gibbs Importance Sampling 17.4 Sampling . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 17.3 The MarkChallenge ov Chain Mon te Carlo Metho ds . . . . Mo . . des . . .. .. .. .. .. .. .. .. 17.5 of Mixing betw etween een Separated Modes 17.4 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . Theting Challenge of MixingFunction between Separated Modes . . . . . . . . 18 17.5 Confron Confronting the Partition 18.1 The LogLik LogLikeliho eliho elihoo od Gradient . . . . . . . . . . . . . . . . . . . . 18 18.2 Confron ting theMaximum Partition Function Sto Stoc chastic Likelihoo Likelihood d and Contrastiv Contrastivee Divergence . . . 18.1 The LogLikelihood Gradient . . . . . . . . . . . . . . . . . . . . 18.2 Stochastic Maximum Likelihoo v d and Contrastive Divergence . . .
502 505 506 505 507 506 511 507 512 511 513 512 518 513 524 518 526 524 527 526 527 529 531 529 539 531 544 539 549 544 556 549 557 556 557 561 562 561 566 562 583 566 584 583 585 584 586 585 587 586 587 593 593 593 595 593 598 595 602 598 602 602 602 608 609 608 610 609 610
CONTENTS
18.3 Pseudolik Pseudolikeliho eliho elihoo od . . . . . . . . . . . . . . . . . . . . . . . 18.4 Score Matc Matching hing and Ratio Matching . . . . . . . . . . . . 18.3 Denoising Pseudolikeliho odMatching . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 18.5 Score 18.4 NoiseCon Score Matctrastiv hing and Ratio Matching 18.6 NoiseContrastiv trastive e Estimation . . . . .. .. .. .. .. .. .. .. .. .. .. .. 18.5 Denoising Score Matching . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. 18.7 Estimating the Partition Function 18.6 NoiseContrastive Estimation . . . . . . . . . . . . . . . . Estimating the Partition Function . . . . . . . . . . . . . . 19 18.7 Appro Approximate ximate inference 19.1 Inference as Optimization . . . . . . . . . . . . . . . . . . 19 19.2 ApproExp ximate inference Expectation ectation Maximization . . . . . . . . . . . . . . . . . . 19.1 Inference as Optimization . . ding . . . .. .. .. .. .. .. .. .. .. .. .. .. .. 19.3 MAP Inference and Sparse Co Coding 19.2 V Exp ectationInference Maximization . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. 19.4 ariational and Learning 19.3 Learned MAP Inference and Sparse Coding 19.5 Appro Approximate ximate Inference . . .. .. .. .. .. .. .. .. .. .. .. .. .. 19.4 Variational Inference and Learning . . . . . . . . . . . . . Appro ximate 20 19.5 Deep Learned Generativ Generative e Mo Models dels Inference . . . . . . . . . . . . . . . 20.1 Boltzmann Mac Machines hines . . . . . . . . . . . . . . . . . . . . . 20 20.2 Deep Restricted GenerativBoltzmann e Mo dels Machines . . . . . . . . . . . . . . . 20.1 Deep Boltzmann hines 20.3 Belief Mac Netw Networks orks .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 20.2 Deep Restricted Boltzmann Machines 20.4 Boltzmann Machines . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 20.3 Deep Belief Netw orks . . . . . .alued . . . Data . . . .. .. .. .. .. .. .. .. .. 20.5 Boltzmann Mac Machines hines for RealV RealValued 20.4 Boltzmann MachinesMac . . hines . . . .. .. .. .. .. .. .. .. .. .. .. .. .. 20.6 Deep Con Boltzmann Conv volutional Machines 20.5 Boltzmann Boltzmann Mac Mac hines for for Structured RealValuedorData . . . . Outputs . . . . . 20.7 Machines hines Sequential 20.6 Other Convolutional Boltzmann Mac. hines 20.8 Boltzmann Machines . . . .. .. .. .. .. .. .. .. .. .. .. .. .. 20.7 Boltzmann Mac hines for Structured or Sequential 20.9 Bac BackPropagation kPropagation through Random Op Operations erations . Outputs . . . . . 20.8 Other Boltzmann Machines . . . . . . . . . . . . . 20.10 Directed Generative Nets . . . . . . . . . . . . . . . .. .. .. .. 20.9 Dra Bac kPropagation through Random 20.11 Drawing wing Samples from Auto Autoenco enco encoders dersOp. erations . . . . . .. .. .. .. .. .. 20.10 Generativ Directed Generative .works . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 20.12 Generative e Sto StocchasticNets Net Netw 20.11 Dra wing Samples from Auto 20.13 Other Generation Schemes . enco . . ders . . . .. .. .. .. .. .. .. .. .. .. .. .. 20.12 Ev Generativ Stochastic Net w orks. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 20.14 Evaluating aluatinge Generative Mo Models dels 20.13 Conclusion Other Generation 20.15 . . . . Schemes . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 20.14 Evaluating Generative Models . . . . . . . . . . . . . . . . 20.15 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliograph Bibliography y
. . .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. . .
. . .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. . .
. . .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. . .
. 618 . 620 618 .. 622 620 .. 623 .. 626 622 . 623 . 634 626 . 636 . 634 637 .. 638 636 .. 637 641 .. 653 638 . 641 . 656 653 . 656 . 656 658 .. 662 656 658 .. 665 .. 662 678 .. 665 685 .. 678 687 .. 688 685 687 .. 689 .. 688 694 689 .. 712 694 .. 716 .. 712 717 .. 719 716 .. 717 721 . 719 . 723 721
Bibliography Index
723 780
Index
780
vi
Website Website
www.deeplearningb www.deeplearningbo ook.org www.deeplearningbook.org
This book is accompanied by the ab abov ov ovee website. The website provides a variety of supplemen supplementary tary material, including exercises, lecture slides, corrections of This b o ok is accompanied the ab e website. website provides a mistak mistakes, es, and other resources thatbyshould beov useful to both The readers and instructors. variety of supplementary material, including exercises, lecture slides, corrections of mistakes, and other resources that should be useful to both readers and instructors.
vii vii
Ackno knowledgmen wledgmen wledgments ts Acknowledgments
This book would not ha hav ve been possible without the con contributions tributions of man many y people. would lik likeenot to ha thank those who commen commented prop proposal osal for the book ThisWbeook would ve been possible without ted the on conour tributions of man y people. and help helped ed plan its con conten ten tents ts and organization: Guillaume Alain, Kyungh Kyunghyun yun Cho, W e w ould lik e to thank those who commen ted on our prop osal for book Çağlar Gülçehre, Da David vid Krueger, Hugo Larochelle Larochelle,, Razv Razvan an Pascan Pascanu u and the Thomas and helped plan its contents and organization: Guillaume Alain, Kyunghyun Cho, Rohée. Çağlar Gülçehre, David Krueger, Hugo Larochelle, Razvan Pascanu and Thomas We would like to thank the people who oﬀered feedback on the conten contentt of the Rohée. book itself. Some oﬀered feedbac feedback k on many chapters: Martín Abadi, Guillaume W e w ould like to thank the people who oﬀered feedback on Can the conten of the Alain, Ion Androutsopoulos, Fred Bertsc Bertsch, h, Olexa Bilaniuk, Ufuk Biçici,tMatk Matko o book itself. Some oﬀered Greg feedbac k on manyPierre chapters: Bošnjak, John Boersma, Bro Broc ckman, Luc Martín Carrier,Abadi, SarathGuillaume Chandar, Alain, Ion Androutsopoulos, Fred Bertsc h, Olexa Bilaniuk, Ufuk Can Biçici, Matko P awel Chilinski, Mark Daoust, Oleg Dashevskii, Laurent Dinh, Stephan Dreseitl, Bošnjak, John F Boersma, Brockman, Pierre Luc Carrier, Sarath Chandar, Jim Fan, Miao an, MeireGreg Fortunato, Frédéric Francis, Nando de Freitas, Çağlar P a w el Chilinski, Mark Daoust, Oleg Dashevskii, Laurent Dinh, Stephan Dreseitl, Gülçehre, Jurgen Van Gael, Javier Alonso García, Jonathan Hunt, Gopi Jeyaram, Jim Fan,Kab Miao an,Luk Meire Fortunato, Frédéric Francis, NandoJohn de FKing, reitas,Diederik Çağlar Chingiz Kabyta yta ytay yFev, Lukasz asz Kaiser, Varun Kanade, Akiel Khan, Gülçehre, Jurgen an Gael,Rudolf JavierMathey Alonso, García, Hunt, Gopi Jeyaram, P . Kingma, Yann VLeCun, Mathey, Matías Jonathan Mattamala, Abhinav Maurya, Chingiz Kab yta y ev, Luk asz Kaiser, V arun Kanade, Akiel Khan, John King, Diederik Kevin Murphy Murphy,, Oleg Mürk, Roman Nov Novak, ak, Augustus Q. Odena, Simon Pa Pavlik, vlik, P . Kingma, Y ann LeCun, Rudolf Mathey , Matías Mattamala, Abhinav Maurya, Karl Pichotta, Kari Pulli, Tapani Raiko, An Anurag urag Ranjan, Johannes Roith, Halis KevinCésar Murphy , OlegGrigory Mürk, Sapunov, Roman Nov ak, Augustus Q. Odena, Simon Pavlik, Sak, Salgado, Mik Mike e Sch Schuster, uster, Julian Serban, Nir Shabat, Karl Shirriﬀ, Pichotta,Scott KariStanley Pulli, T Anurag Ranjan, Johannes Roith, Halis Ken Stanley, , apani DavidRaiko, Sussillo, Ilya Sutsk Sutskev ev ever, er, Carles Gelada Sáez, Sak, César Salgado, Grigory Sapunov, Mik e Sch uster, Julian Serban, Nir Shabat, Graham Taylor, Valen alentin tin Tolmer, An Tran, Shubhendu Trivedi, Alexey Umnov, Ken Shirriﬀ, Scott Stanley Sussillo, Ilya Sutsk ever,WCarles Gelada Sáez, Vincen Vincentt Vanhouc anhouck ke, Marco, David Visen VisentiniScarzanella, tiniScarzanella, Da David vid ardeF ardeFarley arley arley, , Dustin Graham Taylor, Tolmer, Tran, tShubhendu rivedi, Alexey W ebb, Kelvin Xu,Valen Wei tin Xue, Li Yao,An Zygmun Zygmunt Za Zając jąc and T Ozan Çağlay Çağlayan. an. Umnov, Vincent Vanhoucke, Marco VisentiniScarzanella, David WardeFarley, Dustin We would also lik likee to thank those who provided us with useful feedbac feedback k on Webb, Kelvin Xu, Wei Xue, Li Yao, Zygmunt Za jąc and Ozan Çağlayan. individual chapters: We would also like to thank those who provided us with useful feedback on individual chapters: • Chapter 1, Introduction: Yusuf Akgul, Sebastien Bratieres, Samira Ebrahimi, Charlie Gorichanaz, Brendan Loudermilk, Eric Morris, Cosmin Pârvulescu Chapter 1, Introduction : Yusuf Akgul, Sebastien Bratieres, Samira Ebrahimi, and Alfredo Solano. • Charlie Gorichanaz, Brendan Loudermilk, Eric Morris, Cosmin Pârvulescu 2, Linear Algebra: Amjad Almahairi, Nik Nikola ola Banić, Kevin Bennett, • Chapter and Alfredo Solano. viiiAlmahairi, Nikola Banić, Kevin Bennett, Chapter 2, Linear Algebra: Amjad
•
viii
CONTENTS
• • • • • • • • • • • • • • • • • • • • •
Philipp Philippee Castonguay Castonguay,, Oscar Chang, Eric FoslerLussier, Andrey Khaly Khalya avin, Sergey Oreshk Oreshko ov, Istv István án Petrás, Dennis Prangle, Thomas Rohée, Colb Colby y Philipp e Castonguay , Oscar Chang, Eric F oslerLussier, Andrey Khaly a vin, Toland, Massimiliano Tomassoli, Alessandro Vitale and Bob Welland. Sergey Oreshkov, István Petrás, Dennis Prangle, Thomas Rohée, Colby Toland, Massimiliano omassoli, Alessandro Vitale and Bob Anderson, Welland. Kai Chapter 3, ProbabilityTand Information Theory : John Philip Arulkumaran, Vincen Vincentt Dumoulin, Rui Fa, Stephan Gouws, Artem Ob Oboturov, oturov, Chapter 3 , Probability and Information Theory : John Philip Anderson, Kai An Antti tti Rasmus, Andre Simp Simpelo, elo, Alexey Surk Surko ov and Volk olker er Tresp. Arulkumaran, Vincent Dumoulin, Rui Fa, Stephan Gouws, Artem Oboturov, Antti Rasmus, Andre Simp elo, Alexey Surk v andAn, Volk er T resp. and Hu Chapter 4, Numerical Computation : T ran oLam Ian Fischer, Yuhuang. Chapter 4, Numerical Computation: Tran Lam An, Ian Fischer, and Hu Yuhuang.5, Machine Learning Basics: Dzmitry Bahdanau, Nikhil Garg, Chapter Mak Makoto oto Otsuk Otsuka, a, Bob Pepin, Philip Popien, Emmanuel Ra Rayner, yner, KeeBong Chapter 5 , Machine Learning Basics : Dzmitry Bahdanau, Nikhil Garg, Song, Zheng Sun and Andy Wu. Makoto Otsuka, Bob Pepin, Philip Popien, Emmanuel Rayner, KeeBong Song, Zheng Sun F and Andyard Wu. Chapter 6, Deep eedforw eedforward Netw Networks orks orks:: Uriel Berdugo, Fabrizio Bottarel, Elizab Elizabeth eth Burl, Ishan Durugk Durugkar, ar, Jeﬀ Hlywa, Jong Wook Kim, Da David vid Krueger Chapter Feedforw ard and Adit Adity y6a, Deep Kumar Prahara Praharaj. j. Networks: Uriel Berdugo, Fabrizio Bottarel, Elizabeth Burl, Ishan Durugkar, Jeﬀ Hlywa, Jong Wook Kim, David Krueger and Adity Kumar Praharafor j. Deep Learning: Inkyu Lee, Sunil Mohan and Chapter 7a, Regularization Josh Joshua ua Salisbury Salisbury.. Chapter 7, Regularization for Deep Learning: Inkyu Lee, Sunil Mohan and Joshua Salisbury . Chapter 8, Optimization for Training Deep Mo Models dels dels:: Marcel Ackermann, Ro Row wel Atienza, Andrew Bro Brock, ck, Tegan Mahara Maharaj, j, James Martens and Klaus Chapter 8, Optimization for Training Deep Models: Marcel Ackermann, Strobl. Rowel Atienza, Andrew Brock, Tegan Mahara j, James Martens and Klaus Strobl. 9, Conv Chapter Convolutional olutional Netw Networks orks orks:: Martín Arjovsky Arjovsky,, Eugene Brevdo, Eric Jensen, Asifullah Khan, Mehdi Mirza, Alex Paino, Eddie Pierce, Marjorie Chapter 9, Conv olutional Networks Sa Say yer, Ryan Stout and Wentao Wu.: Martín Arjovsky, Eugene Brevdo, Eric Jensen, Asifullah Khan, Mehdi Mirza, Alex Paino, Eddie Pierce, Marjorie Sayer, Ryan and Modeling: Wentao Wu.Recurren Chapter 10, Stout Sequence Recurrentt and Recursive Nets Nets:: Gökçen Eraslan, Stev Steven en Hickson, Razv Razvan an Pascan Pascanu, u, Lorenzo von Ritter, Rui Ro Rodrigues, drigues, Chapter 10 , Sequence Modeling: Recurren t and Recursive Nets : Gökçen Mihaela Rosca, Dmitriy Serdyuk, Dongyu Shi and Kaiyu Yang. Eraslan, Steven Hickson, Razvan Pascanu, Lorenzo von Ritter, Rui Rodrigues, Mihaela Dmitriy Serdyuk, and Kaiyu Yang. Chapter Rosca, 11, Practical metho methodology dologyDongyu : DanielShi Bec Beckstein. kstein. 11, Applications Practical metho dologyDahl : Daniel kstein. Chapter 12 : George and Bec Ribana Roscher. 12, Representation Applications: George Dahl and Ribana Chapter 15 Learning : Kunal Ghosh.Roscher.
Learning: Mo Kunal Chapter 15 16, Representation Structured Probabilistic Models dels Ghosh. for Deep Learning: Minh Lê and Anton Varfolom. Chapter 16, Structured Probabilistic Models for Deep Learning: Minh Lê • and Chapter 18,VConfronting the Partition Function unction:: Sam Bowman. Anton arfolom. ix Chapter 18, Confronting the Partition Function: Sam Bowman.
•
CONTENTS
• Chapter 20, Deep Generativ Generativee Mo Models dels dels:: Nicolas Chapados, Daniel Galvez, Wenming Ma, Fady Medhat, Shakir Mohamed and Grégoire Monta Montav von. Chapter 20, Deep Generative Models: Nicolas Chapados, Daniel Galvez, enming Ma, Fady Medhat, • W Bibliograph Bibliography: y: Leslie N. Smith.Shakir Mohamed and Grégoire Montavon. Bibliography: Leslie N. Smith. We also wan antt to thank those who allo allow wed us to repro reproduce duce images, ﬁgures or data• from their publications. We indicate their contributions in the ﬁgure captions We also the wantext. t to thank those who allowed us to reproduce images, ﬁgures or throughout data from their publications. We indicate their contributions in the ﬁgure captions We w would ould like e to thank Ian’s wife Daniela Flori Goo Goodfellow dfellow for patiently throughout the lik text. supp supporting orting Ian during the writing of the book as well as for help with pro proofreading. ofreading. We would like to thank Ian’s wife Daniela Flori Goodfellow for patiently We would lik likee to thank the Go Google ogle Brain team for pro providing viding an intellectual supporting Ian during the writing of the book as well as for help with proofreading. en environmen vironmen vironmentt where Ian could dev devote ote a tremendous amoun amountt of time to writing this W e w ould lik e to thank the Go ogle Brain team for pro intellectual book and receiv receivee feedbac feedback k and guidance from colleagues. Weviding wouldan esp especially ecially like en vironmen t where Ian could dev ote a tremendous amoun t of time to writing this to thank Ian’s former manager, Greg Corrado, and his current manager, Samy book andforreceiv feedbac colleagues. Welike would especially like Bengio, theire supp support ortkofand thisguidance pro project. ject. from Finally Finally, , we would to thank Geoﬀrey to thank former manager, Greg Corrado, and his current manager, Samy Hin Hinton ton forIan’s encouragement when writing was diﬃcult. Bengio, for their support of this pro ject. Finally, we would like to thank Geoﬀrey Hinton for encouragement when writing was diﬃcult.
x
Notation Notation
This section pro provides vides a concise reference describing the notation used throughout this book. If you are unfamiliar with an any y of the corresp corresponding onding mathematical This section pro vides a concise reference describing the notation throughout concepts, this notation reference ma may y seem in intimidating. timidating. Ho How wev ever, er,used do not despair, this b o ok. If y ou are unfamiliar with an y of the corresp onding mathematical we describ describee most of these ideas in chapters 24. concepts, this notation reference may seem intimidating. However, do not despair, we describe most of these ideas Num in chapters 24. Arra Numb bers and Arrays ys a
A scalar (in (integer teger Num or real) bers and Arrays A scalar vector (integer or real)
A a
A matrix vector
A A IAn
A tensor matrix
II
Iden Identit tity matrix with withndimensionalit dimensionality y implied by tit y matrix rows and n columns con context text Identity matrix with dimensionality implied by Standard context basis vector [0, . . . , 0, 1, 0, . . . , 0] with a 1 at position i Standard basis vector [0, . . . , 0, 1, 0, . . . , 0] with a A square, diagonal matrix with diagonal entries 1 at position i giv given en by a A square, diagonal matrix with diagonal entries A variable givscalar en by random a
a
I e(i) e diag diag((a) diag(a) a
Iden Identit tit tity y matrix with n ro rows ws and n columns A tensor
a
A scalar vectorv ectorvalued alued random A random variablevariable
A a
A matrixv matrixvalued alued random random vvariable ariable vectorvalued
A
A matrixvalued random variable
xi xi
CONTENTS
Sets and Graphs A set
A A R
Sets and Graphs The umbers bers A setset of real num
{0R , 1}
The containing taining and 1 The set set con of real num0bers
{0, 10, ,. 1. . , n} The set con of all in integers tegers betw between taining 0 and 1 een 0 and n The real interv terv terval altegers including and b n 0, {1[a, , . .b.]}, n set ofin all in betwaeen 0 and { ([a, The real real in interv a, bb]] } The interv terval excluding aa and but bincluding b al including A (a,\B b] A B G \ P a G(xi ) P a G(x )
Set the set containing The subtraction, real interval i.e., excluding a but includingthe b elemen ments ts of A that are not in B Set subtraction, i.e., the set containing the eleA B A graph men ts of that are not in The paren parents ts of xi in G A graph
aa−i
The parents of x in Indexing G Elemen Elementt i of vector a , withIndexing indexing starting at 1 All elemen elements ts vofector vector except for elemen element t i at 1 a, a Elemen t i of with indexing starting
A a i,j
Elemen Element t i, ts j of Aexcept for element i All elemen of matrix vector a
A A i,:
Ro Row w i of A Elemen t i,matrix j of matrix A
A :,i
Column of matrix Row i ofimatrix A A
AAi,j,k A A
Elemen Element k) of aA 3D tensor A Columnt i(i,ofj,matrix A 2D slice a k3D Elemen t (of i, j, ) oftensor a 3D tensor
Aa i
Elemen Element t iofofathe 2D slice 3Drandom tensor vector a
ai
:,:,i
a A> A+
Element i of the random vector a Linear Algebra Op Operations erations Transpose of matrix LinearAAlgebra Op erations Mo MooreP oreP orePenrose enrose pseudoin pseudoinv T ranspose of matrix A verse of A
AA B
Elemen Elementwise twise product Mo oreP enrose(Hadamard) pseudoinverse of A of A and B
det( A A B) A) det(
Determinan Determinant t of A Elemen twise (Hadamard) product of A and B Determinant of A xii
CONTENTS
Calculus Deriv Derivative ative of y with resp respect ect to x. Calculus Derivative of y with respect to x. Partial deriv derivative ative of y with resp respect ect to x
dy dx dy ∂y dx ∂x ∂y ∇ xy ∂x ∇Xyy ∇ Xyy ∇
Gradien Gradient t of yative withofresp respect ect to x ect to x Partial deriv y with resp Matrix derivatives y ect withtoresp respect Gradienderiv t of yatives with of resp x ect to X
∇ y ∂f ∇∂ x ∂f ∇2x f (x) or H (f )(x) Z ∂x (xH )dx f (x)for (f )(x) Z ∇ f (x)dx S
f (x)dx
Z
a⊥b
Z a⊥b  c a b (a) c aP ⊥ b
T ensor con containing taining derivatives of yect with respect Matrix deriv atives deriv of y atives with resp to resp X ect to X Tensor containing derivatives of y with respect to X Jacobian matrix J ∈ Rm×n of f : Rn → Rm
R R R The Hessian matrix input Jacobian matrix J of f at of f : point x ∈ the en → of x Deﬁnite in integral tegral over entire tire domain The Hessian matrix of f at input point x
Deﬁnite in tegral with over the entire of xset S integral respect to domain x ov over er the S Deﬁnite integral with respect to x over the set Probabilit Probability y and Information Theory The random ariables a and Theory b are indep independen enden endentt Probabilit y andvInformation They are are vconditionally given en ct The random ariables a andindependent b are indepgiv enden
⊥ Pp(a) p(a) a∼P
A probabilit probability distribution oindependent ver a discretegiv variable They are arey conditionally en c A probabilit probability y distribution distribution oovver er aa discrete contin continuous uous varivariable able, or over a variable whose type has not been A eciﬁed probability distribution over a continuous varisp speciﬁed able, or over a variable whose type has not been Random sp eciﬁed variable a has distribution P
Ex∼P [f (ax)] P or Ef (x) E V f (or x))Ef (x) [far( (x∼ )]
V ariance of fof (x)f (under P (resp x) ect to P (x) Exp ectation x) with
Co Cov( v(ar( f (fx()x , g))(x)) V
Co Cov variance ) and Pg(x) under P (x) V ariance of of f (xf)(xunder
Cov(fH((xx)), g(x))
Shannon en entrop trop tropy the random variable g(x) under Covariance of fy(xof ) and P (x) x
D KL H((Pxk) Q)
Kullbac KullbackLeibler divergence of Pvand Q x ShannonkLeibler entropy div of ergence the random ariable
N Σ) D (x(;Pµ, Q (x; µk, Σ)
Gaussian distribution ov over er x Kullbac kLeibler divergence of with P andmean Q µ and co cov variance Σ Gaussian distribution over x with mean µ and covariance Σ
N
Exp Expectation ectation of f (xa)has with resp respect ect to P (x) Random variable distribution
xiii
CONTENTS
f :A→B A B f :f ◦ g f f(x;→ θ g)
Functions The function f with domain A and range B Functions A and g B Comp Composition osition of the functions The function f with domain f and range
ζx (x)p
A function parametrized Comp ositionofofx the functions f by andθ.g Sometimes we just write f (x) and ignore the argument θ to A function of x parametrized by θ. Sometimes ligh lighten ten notation. we just write f (x) and ignore the argument θ to Natural logarithm of x ligh ten notation. 1 Logistic sigmoid, of x Natural logarithm 1 + exp(−x) 1 Logistic sigmoid, Softplus, log log(1 (1 + exp( x )) 1 + exp( x) p L norm of Softplus, logx(1 + exp(x)) −
xx  xx+
P ositive e part norm of x of x, i.e., max(0, x) Lositiv
f (x◦; θ) log x σ (xx) log ζσ((xx))
L2 norm of x
x  1 condition is 1 if the condition is true, 0 ,otherwise Positiv e part of x, i.e., max(0 x) Sometimes we use a function f whose argumen argumentt is a scalar, but apply it to a vector, 1 is 1 if the condition is true, 0 otherwise matrix, or tensor: f (x), f ( X), or f (X). This means to apply f to the arra array y f Sometimes w e use a function whose argumen t is a scalar, but apply it to a v ector, elemen elementwise. twise. For example, if C = σ (X)X, then Ci,j,k = σ(Xi,j,k ) for all valid values f (x), f ( X), or f ( ). This means to apply f to the array matrix, or ktensor: of i, j and . C X C X elementwise. For example, if = σ ( ), then ) for all valid values = σ( of i, j and k. Datasets and distributions p data pˆdata pˆ X (i) xX y (i) xor y (i) y
or y X X
The data generating distribution Datasets and distributions The empirical distribution deﬁned by the training data generating distribution set The empirical distribution deﬁned by the training A setset of training examples The exampleexamples (input) from a dataset A setith of training ( i) The target asso associated ciated with x supervised ervised learnith example (input) fromfora sup dataset ing The target associated with x for supervised learn(i) The row w ing m × n matrix with input example x in ro Xi,: The m n matrix with input example x in row X ×
xiv
Chapter 1 Chapter 1
In Intro tro troduction duction tro In duction In Inv ven entors tors ha hav ve long dreamed of creating mac machines hines that think. This desire dates bac back k to at least the time of ancien ancientt Greece. The mythical ﬁgures Pygmalion, Inventors ha ve long dreamedma ofy creating machines that think. This Daedalus, and Hephaestus may all be interpreted as legendary in inv vdesire en entors, tors,dates and bac k to at least the time of ancien t Greece. The m ythical ﬁgures Pygmalion, Galatea, Talos, and Pandora may all be regarded as artiﬁcial life (Ovid and Martin, Daedalus, and Hephaestus y ). all be interpreted as legendary inventors, and 2004 ; Spark Sparkes es, 1996 ; Tandy, ma 1997 Galatea, Talos, and Pandora may all be regarded as artiﬁcial life (Ovid and Martin, programmable computers conceived, ed, people wondered whether 2004When ; Spark es, 1996; Tandy , 1997). were ﬁrst conceiv they migh mightt become intelligen intelligent, t, ov over er a hundred years before one was built (Lo Lov velace elace,, When programmable computers w ere ﬁrst conceiv ed, p eople w ondered whether 1842 1842). ). Today oday,, artiﬁcial intel intelligenc ligenc ligencee (AI) is a thriving ﬁeld with man many y practical they migh t b ecome intelligen t, ov er a h undred y ears b efore one was built Lovelace, applications and active researc research h topics. We lo look ok to intelligen intelligentt softw software are to (automate 1842 ). Tlab oday intel ligenc is a thriving ﬁeld withinman y practical routine understand sp images, mak medicine and labor, or,, artiﬁcial speech eech eor(AI) makee diagnoses applications and active researc h topics. W e lo ok to intelligen t softw are to automate supp support ort basic scientiﬁc research. routine labor, understand speech or images, make diagnoses in medicine and Inort thebasic earlyscientiﬁc da days ys of research. artiﬁcial in intelligence, telligence, the ﬁeld rapidly tackled and solved supp problems that are intellectually diﬃcult for human beings but relativ relatively ely straightIn the early da ys of artiﬁcial in telligence, the ﬁeld rapidly tackled and solved forw for computers—problems that can b e describ by a list of formal, mathforward ard described ed problems that are intellectually diﬃcult for human b eings but relativ ely straightematical rules. The true challenge to artiﬁcial intelligence prov proved ed to be solving forw ard for computers—problems that can b e describ ed by a list of formal, maththe tasks that are easy for people to perform but hard for people to describ describe e ematical rules. The true challenge to artiﬁcial intelligence prov ed to b e solving formally—problems that we solve intuitiv intuitively ely ely,, that feel automatic, like recognizing the tasks that are easy for p eople to p erform but hard for people to describe sp spok ok oken en words or faces in images. formally—problems that we solve intuitively, that feel automatic, like recognizing This book is ab about out a solution to these more intuitiv intuitivee problems. This solution is spoken words or faces in images. to allow computers to learn from exp experience erience and understand the world in terms of a This b o ok is ab out a solution to these more intuitiv e problems. This solution is hierarc hierarch hy of concepts, with each concept deﬁned in terms of its relation to simpler to allow computers to learn from exp erience and understand the world in terms of a concepts. By gathering knowledge from experience, this approac approach h av avoids oids the need hierarc hy ofop concepts, with each concept in terms of its that relation simpler for human operators erators to formally sp specify ecify deﬁned all of the knowledge the to computer concepts. gathering knowledge fromthe experience, approac h avoids concepts the need needs. TheBy hierarc hierarch hy of concepts allows computerthis to learn complicated for h uman op erators to formally sp ecify all of the knowledge that the computer by building them out of simpler ones. If we draw a graph showing how these needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones. 1If we draw a graph showing how these 1
CHAPTER 1. INTRODUCTION
concepts are built on top of eac each h other, the graph is deep, with man many y lay layers. ers. For this reason, we call this approach to AI de deep ep le learning arning arning.. concepts are built on top of each other, the graph is deep, with many layers. For Man Many y of the early successes of AI took place in relativ relatively ely sterile and formal this reason, we call this approach to AI deep learning. en environmen vironmen vironments ts and did not require computers to ha hav ve muc uch h kno knowledge wledge ab about out Man y of the early successes of AI took place in relativ ely sterile and formal the world. F For or example, IBM’s Deep Blue chesspla chessplaying ying system defeated world vironmen ts and did not computers haveismofuccourse h knowledge about cen hampion Garry Kasparo Kasparov v inrequire 1997 (Hsu , 2002). to Chess a very simple the world. For example, IBM’s Blue and chesspla ying orlde w orld, con containing taining only sixt sixtyfour yfourDeep lo locations cations thirt thirtyt yt ytw wosystem pieces defeated that can w mov move champion Garrycircumscrib Kasparov in , 2002). aChess is of course very simple in only rigidly circumscribed ed 1997 ways.(Hsu Devising successful chess astrategy is a w orld, con taining only sixt yfour lo cations and thirt yt w o pieces that can mov tremendous accomplishmen accomplishment, t, but the challenge is not due to the diﬃculty ofe in only rigidly circumscrib ed ways. successful strategyChess is a describing the set of chess pieces and Devising allo allow wableamov moves es to thechess computer. tremendous accomplishmen the brief challenge not due to the diﬃculty of can be completely describ described ed t,bybut a very list ofiscompletely formal rules, easily describing the set of chess pieces and allowable moves to the computer. Chess pro provided vided ahead of time by the programmer. can be completely described by a very brief list of completely formal rules, easily Ironically Ironically,, abstract and formal tasks that are among the most diﬃcult mental provided ahead of time by the programmer. undertakings for a human being are among the easiest for a computer. Computers , abstract formal tasks among the most diﬃcult mental ha hav veIronically long been able toand defeat even thethat bestare human chess play player, er, but are only undertakings for a h uman b eing are among the easiest for a computer. Computers recen recently tly matching some of the abilities of average human beings to recognize ob objects jects ha v e long b een able to defeat even the b est human chess play er, but are only or sp speech. eech. A person’s everyda everyday y life requires an immense amount of kno knowledge wledge recen matching some of of aviserage human beings to recognize ob jects ab about outtly the world. Much of the thisabilities kno knowledge wledge sub subjectiv jectiv jective e and intuitiv intuitive, e, and therefore or speech. person’s in everyda y lifewarequires an immense of kno diﬃcult to A articulate a formal y. Computers need amount to capture thiswledge same ab out the world. Much of this kno wledge is sub jectiv e and intuitiv e, and therefore kno knowledge wledge in order to beha ehav ve in an in intelligen telligen telligentt way. One of the key challenges in diﬃcult to articulate in a formal w a y . Computers need into to capture this same artiﬁcial in intelligence telligence is how to get this informal kno knowledge wledge a computer. knowledge in order to behave in an intelligent way. One of the key challenges in Sev Several eral artiﬁcial in intelligence telligence pro projects jects hav havee sought to hardco hardcode de knowledge ab about out artiﬁcial intelligence is how to get this informal knowledge into a computer. the worl world d in formal languages. A computer can reason ab about out statements in these Sev eral artiﬁcial in telligence pro jects hav e sought to hardco de knowledge abthe out formal languages automatically using logical inference rules. This is kno known wn as the worl d in formal languages. A computer can reason ab out statements in these know knowle le ledge dge base approach to artiﬁcial in intelligence. telligence. None of these pro projects jects has led to automatically using logical inference rules. This is knoand wn as the, aformal ma major jorlanguages success. One of the most famous such pro projects jects is Cyc (Lenat Guha know). ledge base approach to artiﬁcial intelligence. None these pro jects led to 1989 1989). Cyc is an inference engine and a database of of statements in a has language a ma jor success. One of the most famous such pro jects is Cycsup (Lenat andItGuha called CycL. These statements are en entered tered by a staﬀ of human supervisors. ervisors. is an, 1989 ). Cyc is an inference engine and a database of statements in a language un unwieldy wieldy pro process. cess. People struggle to devise formal rules with enough complexity called CycL. These statements are en tered by a staﬀ human ervisors. aIt story is an to accurately describ describe e the world. For example, Cycoffailed to sup understand unwieldy cess. People struggle to in devise formal rules with enough ab about out a ppro erson named Fred shaving the morning (Linde , 1992 ). Itscomplexity inference to accurately describ e the world. F or example, Cyc failed to understand a story engine detected an inconsistency in the story: it knew that people do not ha hav ve ab out a p erson named F red shaving in the morning ( Linde , 1992 ). Its inference electrical parts, but because Fred was holding an electric razor, it believed the engine detected an inconsistency in the story: parts. it knewIt that people do have en entit tit tity y “F “FredWhileShaving” redWhileShaving” contained electrical therefore ask asked ednot whether electrical parts, because Fred as holding F red was still a pbut erson while he waswsha shaving. ving. an electric razor, it believed the entity “FredWhileShaving” contained electrical parts. It therefore asked whether The diﬃculties faced by systems relying on hardcoded kno knowledge wledge suggest that Fred was still a person while he was shaving. AI systems need the ability to acquire their own kno knowledge, wledge, by extracting patterns The faced by systems relyingason hardcoded knowledge suggest that from ra raw wdiﬃculties data. This capabilit capability y is known machine le learning arning arning. . The in intro tro troduction duction AI systems need the ability to acquire their own knowledge, by extracting patterns from raw data. This capability is known 2 as machine learning. The intro duction
CHAPTER 1. INTRODUCTION
of mac machine hine learning allo allow wed computers to tackle problems inv involving olving knowledge of the real world and mak makee decisions that app appear ear sub subjective. jective. A simple machine of machine learningcalled allowlo ed computers to can tackle problems involving knowledge learning algorithm logistic gistic regr gression ession determine whether to recommend of the real world mak e decisions that ear submachine jective. learning A simplealgorithm machine cesarean deliv delivery ery and (MorY MorYosef osef et al. al.,, 1990 ). app A simple learning algorithm called lo gistic r e gr ession can determine whether to recommend called naive Bayes can separate legitimate email from spam email. cesarean delivery (MorYosef et al., 1990). A simple machine learning algorithm The performance of these simple machine learning algorithms dep depends ends heavily called naive Bayes can separate legitimate email from spam email. on the repr epresentation esentation of the data they are given. For example, when logistic The p erformance of these simple machine learning algorithms dep heavily regression is used to recommend cesarean deliv delivery ery ery,, the AI system do does es ends not examine on the r epr esentation of the data they are given. F or example, when logistic the patient directly directly.. Instead, the do doctor ctor tells the system several pieces of relev relevan an antt regression is used cesarean deliveryof, the AI system does not examine information, suc such htoasrecommend the presence or absence a uterine scar. Each piece of the patient directly . Instead, the doctor tellsofthe several piecesasofa relev anet. information included in the represen representation tation thesystem patient is known fe featur atur ature information, such learns as the ho presence orthese absence of a of uterine scar. correlates Each piece of Logistic regression how w eac each h of features the patient with information included in the represen tation of the patient is known as a fe atur e. various outcomes. Ho How wev ever, er, it cannot inﬂuence the wa way y that the features are Logistic in regression how eac h of thesewas features patient with deﬁned any wa way ylearns . If logistic regression given of anthe MRI scan correlates of the patient, variousthan outcomes. Howev er, it cannot inﬂuence the not waybethat features are rather the do doctor’s ctor’s formalized rep report, ort, it would ablethe to mak make e useful deﬁned in any way. If pixels logisticinregression washa given an MRI correlation scan of thewith patient, predictions. Individual an MRI scan hav ve negligible an any y rather than the do ctor’s formalized rep ort, it would not be able to mak e useful complications that might occur during delivery delivery.. predictions. Individual pixels in an MRI scan have negligible correlation with any This dep dependence endence on represen representations tations is a general phenomenon that app appears ears complications that might occur during delivery. throughout computer science and even daily life. In computer science, operaThis dep on arepresen tations is a can general phenomenon that faster appears tions suc such h asendence searching collection of data pro proceed ceed exp exponentially onentially if throughout computer science and even daily life. In computer science, operathe collection is structured and indexed intelligen intelligently tly tly.. P People eople can easily perform tions such as a collection data can proceed exponentially faster if arithmetic on searching Arabic numerals, but of ﬁnd arithmetic on Roman numerals muc uch h the collection is structured and indexed intelligen tly . P eople can easily p erform more timeconsuming. It is not surprising that the choice of represen representation tation has an arithmetic on Arabic numerals, but ﬁnd arithmetic on Roman n umerals much enormous eﬀect on the performance of mac machine hine learning algorithms. For a simple more timeconsuming. It is not surprising that the choice of represen tation has an visual example, see Fig. 1.1. enormous eﬀect on the performance of machine learning algorithms. For a simple Man Many y artiﬁcial tasks can be solv solved ed by designing the righ rightt set of visual example, see intelligence Fig. 1.1. features to extract for that task, then providing these features to a simple machine Manyalgorithm. artiﬁcial intelligence tasks can be solvedforbysp designing the right set of learning For example, a useful feature speak eak eaker er iden identiﬁcation tiﬁcation from features to extract for that task, then providing these features to a simple machine sound is an estimate of the size of speaker’s vocal tract. It therefore giv gives es a strong learning Forsp example, usefulwoman, featureorforchild. speaker identiﬁcation from clue as toalgorithm. whether the speaker eaker is aaman, sound is an estimate of the size of speaker’s vocal tract. It therefore gives a strong Ho How wev ever, er, for man many y tasks, it is diﬃcult to know what features should be extracted. clue as to whether the speaker is a man, woman, or child. For example, supp suppose ose that we would lik likee to write a program to detect cars in Ho w ev er, for man y tasks, it is diﬃcult to know features e extracted. photographs. We know that cars ha hav ve wheels, so what we might like should to use b the presence Fora example, ose that we would lik write a to program detect what cars in of wheel as asupp feature. Unfortunately Unfortunately, , ite istodiﬃcult describ describeto e exactly a photographs. W e know that cars ha v e wheels, so w e might like to use the presence wheel lo looks oks like in terms of pixel values. A wheel has a simple geometric shap shapee but of a wheel as a feature. Unfortunately , it is diﬃcult to describ e exactly whatoﬀa its image may be complicated by shadows falling on the wheel, the sun glaring wheel lo oks like in terms of pixel v alues. A wheel has a simple geometric shap e but the metal parts of the wheel, the fender of the car or an ob object ject in the foreground its image may e complicated by so shadows obscuring partbof the wheel, and on. falling on the wheel, the sun glaring oﬀ the metal parts of the wheel, the fender of the car or an ob ject in the foreground obscuring part of the wheel, and so on. 3
CHAPTER 1. INTRODUCTION
Polar coordinates
Cartesian coordinates
Polar coordinates
µ
y
Cartesian coordinates
x
r
Figure 1.1: Example of diﬀeren diﬀerentt represen representations: tations: suppose we wan antt to separate ttw wo categories of data by dra drawing wing a line betw between een them in a scatterplot. In the plot on the left, Figure 1.1: some Example diﬀeren t represen tations:and suppose weis w antossible. to separate two w e represent data of using Cartesian co coordinates, ordinates, the task imp impossible. In the plot categories of data b y dra wing a line betw een them in a scatterplot. In the plot on the on the right, we represent the data with p olar coordinates and the task b ecomes simpleleft, to we represent some data using Cartesian co ordinates, and the task imp ossible. In the plot solv solve e with a vertical line. (Figure pro produced duced in collab collaboration oration withis David WardeF ardeFarley) arley) on the right, we represent the data with p olar coordinates and the task b ecomes simple to solve with a vertical line. (Figure pro duced in collab oration with David WardeFarley)
One solution to this problem is to use machine learning to disco discov ver not only the mapping from represen representation tation to output but also the representation itself. solution to this as problem is to usele machine learning representations to discover not often only ThisOne approach is known repr epresentation esentation learning arning arning.. Learned the mapping represen tation tothan output also the representation itself. result in muc uch hfrom better performance can but be obtained with handdesigned This approach is known as r epr esentation le arning . Learned representations represen representations. tations. They also allow AI systems to rapidly adapt to new tasks, often with result in hm uch bin etter p erformance than can b e obtained with handdesigned minimal uman A representation learning algorithm can discov interv terv terven en ention. tion. discover er a represen tations. They also allow AI systems to rapidly adapt to new tasks, with go goo od set of features for a simple task in min minutes, utes, or a complex task in hours to minimal h uman in terv en tion. A representation learning discov mon months. ths. Manually designing features for a complex taskalgorithm requires acan great dealerofa go o d set of features for a simple task in min utes, or a complex task in hours to human time and eﬀort; it can tak takee decades for an en entire tire communit community y of researc researchers. hers. months. Manually designing features for a complex task requires a great deal of The quintessen quintessential tial example of a represen representation tation learning algorithm is the auhuman time and eﬀort; it can take decades for an entire community of researchers. to toenc enc enco oder der.. An auto autoenco enco encoder der is the com combination bination of an enc enco oder function that con conv verts The quintessen tial example of a represen tation learning algorithm is the authe input data into a diﬀerent representation, and a de deccoder function that con conv verts to enc o der . An auto enco der is the com bination of an enc o der function that con v erts the new representation bac back k into the original format. Auto Autoenco enco encoders ders are trained to the input data into a diﬀerent representation, and a de c o der function verts preserv preservee as muc uch h information as possible when an input is run throughthat the con enco encoder der the new backare into thetrained originaltoformat. Auto enco ders are trained toe and thenrepresentation the deco decoder, der, but also make the new representation hav have preserv as mprop uch information as possible anenco input is aim run through the encoder v ariousenice properties. erties. Diﬀerent kinds when of auto autoenco encoders ders to achiev achieve e diﬀerent and then the deco der, but are also trained to make the new representation have kinds of prop properties. erties. various nice properties. Diﬀerent kinds of autoencoders aim to achieve diﬀerent When designing kinds of prop erties. features or algorithms for learning features, our goal is usually to separate the factors of variation that explain the observed data. In this context, When features or algorithms learning features, our goalthe is usually we use the designing word “factors” simply to refer to for separate sources of inﬂuence; factors to separate the factors of variation that explain the observed data. In this context, are usually not combined by multiplication. Suc Such h factors are often not quantities we use the word “factors” simply to refer to separate sources of inﬂuence; the factors 4 are usually not combined by multiplication. Such factors are often not quantities
CHAPTER 1. INTRODUCTION
that are directly observed. Instead, they may exist either as unobserv unobserved ed ob objects jects or unobserved forces in the ph physical ysical world that aﬀect observ observable able quan quantities. tities. They that are directly may exist either as unobserv ed objects ma may y also exist asobserved. constructsInstead, in the hthey uman mind that pro provide vide useful simplifying or unobservedorforces in the physical world that tities. They explanations inferred causes of the observ observed ed aﬀect data. observ They able can bquan e thought of as ma y also exist as constructs in the h uman mind that pro vide useful simplifying concepts or abstractions that help us make sense of the rich variabilit ariability y in the data. explanations or inferred of thethe observ ed of data. They can be thought ofer’s as When analyzing a sp speech eechcauses recording, factors variation include the sp speak eak eaker’s concepts abstractions that makethat sense of the variabilit y inanalyzing the data. age, theiror sex, their accent andhelp theus words they are rich sp speaking. eaking. When When analyzing speech recording, the factors variation er’s an image of a car,a the factors of variation includeofthe positioninclude of the the car, sp itseak color, age, their sex, their accent and the words that they are sp eaking. When analyzing and the angle and brightness of the sun. an image of a car, the factors of variation include the position of the car, its color, ma major jor source of diﬃcult diﬃculty yofin the many realw realworld orld artiﬁcial intelligence applications andAthe angle and brightness sun. is that many of the factors of variation inﬂuence ev every ery single piece of data we are A ma jor source of diﬃcult y in many realw orld artiﬁcial able to observe. The individual pixels in an image of a red intelligence car migh mightt bapplications e very close is that many of the factors of v ariation inﬂuence ev ery single piece of data are to black at night. The shap shapee of the car’s silhouette dep depends ends on the viewingwe angle. able toapplications observe. The individual pixels in anthe image of a of redvariation car mighand t bediscard very close Most require us to disentangle factors the to black at night. The shap e of the car’s silhouette dep ends on the viewing angle. ones that we do not care about. Most applications require us to disentangle the factors of variation and discard the Of course, it can be very diﬃcult to extract such highlevel, abstract features ones that we do not care about. from ra raw w data. Man Many y of these factors of variation, such as a sp speak eak eaker’s er’s accen accent, t, it can very sophisticated, diﬃcult to extract such highlevel, abstract features can Of be course, iden identiﬁed tiﬁed onlybeusing nearly humanlev humanlevel el understanding of fromdata. raw When data. it Man y of these factors to of obtain variation, such as a speak er’s accen t, the is nearly as diﬃcult a representation as to solve the can be iden tiﬁed representation only using sophisticated, nearly humanlev el understanding of original problem, learning do does es not, at ﬁrst glance, seem to help us. the data. When it is nearly as diﬃcult to obtain a representation as to solve the De Deep epproblem, le learning arning solv solves es this central problem in not, represen representation tation learning original representation learning does at ﬁrst glance, seembytointroduchelp us. ing represen representations tations that are expressed in terms of other, simpler representations. Delearning ep learning solves central problem represen tation learning by introducDeep allows thethis computer to build in complex concepts out of simpler coning represen tations that are expressed in terms of other, simpler representations. cepts. Fig. 1.2 shows how a deep learning system can represen representt the concept of an Deep learning allows the computer to build complex concepts out ofand simpler conimage of a person by combining simpler concepts, such as corners contours, cepts. Fig.in1.2 shows how in a deep system can represent the concept of an whic which h are turn deﬁned termslearning of edges. image of a person by combining simpler concepts, such as corners and contours, quin quintessen tessential tial example of aofdeep learning mo model del is the feedforw feedforward ard deep whicThe h are in tessen turn deﬁned in terms edges. net netw work or multilayer per ercceptr eptron on (MLP). A multila ultilay yer perceptron is just a matheThe quin tessen tial example of a deep learning mo del is the feedforw deep matical function mapping some set of input values to output values. The ard function netformed work orbymultilayer permany ceptron (MLP). A multila is just a matheis comp composing osing simpler functions. Weyer canperceptron think of each application matical function mapping some set of values to output values. The function of a diﬀerent mathematical function as input pro providing viding a new representation of the input. is formed by composing many simpler functions. We can think of each application idea of learning thefunction right represen representation the representation data provides one persp erspececof aThe diﬀerent mathematical as protation viding for a new of the input. tiv tivee on deep learning. Another persp erspective ective on deep learning is that depth allows the The idea of learning the right represen tation forEac theh data one persp eccomputer to learn a multistep computer program. Each lay layer erprovides of the represen representation tation tive b one deep learning. Another on deepmemory learningafter is that depth allows the can thought of as the state pofersp theective computer’s executing another computer to learn a m ultistep computer program. Eac h lay er of the represen tation set of instructions in parallel. Net Netw works with greater depth can execute more can b e thought of as the state of the computer’s memory after executing another instructions in sequence. Sequential instructions oﬀer great pow ower er because later set of instructions in back parallel. worksofwith greater depth can executetomore instructions can refer to theNet results earlier instructions. According this instructions in sequence. Sequential instructions oﬀer great power because later instructions can refer back to the results5 of earlier instructions. According to this
CHAPTER 1. INTRODUCTION
CAR
PERSON
ANIMAL
Output (object identity)
3rd hidden layer (object parts)
2nd hidden layer (corners and contours)
1st hidden layer (edges)
Visible layer (input pixels)
Figure 1.2: Illustration of a deep learning mo model. del. It is diﬃcult for a computer to understand the meaning of ra raw w sensory input data, suc such h as this image represen represented ted as a collection Figure Illustration of a deep learningfrom mo del. It isofdiﬃcult foran a computer to tity understand of pixel1.2: values. The function mapping a set pixels to ob object ject iden identity is very the meaning Learning of raw sensory input data, such as seems this image represented as a collection complicated. or ev evaluating aluating this mapping insurmountable if tackled directly directly.. of pixel values.resolves The function mapping a setthe of desired pixels to an ob ject iden tity isinto very Deep learning this diﬃcult diﬃculty y by from breaking complicated mapping a complicated. Learning or ev aluating this mapping seems insurmountable if tackled directly series of nested simple mappings, each describ described ed by a diﬀerent lay layer er of the mo model. del. The. Deep learning resolves this diﬃcult y ,bysobreaking desired complicated mappingthat intowea input is presented at the visible layer named bthe ecause it contains the variables series of to nested simple mappings, each describ ed by a diﬀerent layer ofabstract the mo del. The are able observe. Then a series of hidden layers extracts increasingly features visible layer input is presented at the , so named b ecause it contains the v ariables that from the image. These lay layers ers are called “hidden” b ecause their values are not given we in hidden layers are able observe.the Then a series extracts increasingly abstract features the data;toinstead mo model del mustofdetermine which concepts are useful for explaining fromrelationships the image. These ers are data. called The “hidden” b ecause their values are not given in the in the lay observed images here are visualizations of the kind the data; instead the mo del must determine which concepts are useful for explaining of feature represented by each hidden unit. Giv Given en the pixels, the ﬁrst lay layer er can easily the relationships the observed data. Theofimages here are visualizations of the kind iden identify tify edges, by in comparing the brightness neighboring pixels. Given the ﬁrst hidden of represented byedges, each hidden unit. Givenlay the pixels, thesearch ﬁrst lay can easily la lay yfeature er’s description of the the second hidden layer er can easily forercorners and iden tify edges, by comparing the brightness of neighboring pixels. Given the ﬁrst hidden extended contours, which are recognizable as collections of edges. Given the second secondofhidden er can easily the search forhidden cornerslay and la lay yer’s description description of of the the edges, image the in terms cornerslay and contours, third layer er extended which recognizable ofeciﬁc edges. Given the hidden can detectcontours, entire parts ofare sp speciﬁc eciﬁc ob objects, jects,asbycollections ﬁnding sp speciﬁc collections ofsecond contours and layer’s description of description the image in of corners andofcontours, third er corners. Finally Finally,, this of terms the image in terms the ob object jectthe parts it hidden containslay can can detect entire parts sp eciﬁc ob jects, by ﬁnding eciﬁc reproduced collections of contours and b e used to recognize theofob objects jects present in the image. sp Images with p ermission corners. Finally description from Zeiler and, Fthis ergus (2014). of the image in terms of the ob ject parts it contains can b e used to recognize the ob jects present in the image. Images reproduced with p ermission from Zeiler and Fergus (2014). 6
CHAPTER 1. INTRODUCTION
Element Set
+ ⇥ + ⇥
+
⇥ w1
Element Set
⇥
+ x1
w2
Logistic Regression
x2
Logistic Regression
w
x
Figure 1.3: Illustration of computational graphs mapping an input to an output where eac each h no node de p erforms an op operation. eration. Depth is the length of the longest path from input to Figure 1.3: Illustration of computational graphs mappingaan input to an output where output but dep depends ends on the deﬁnition of what constitutes p ossible computational step. each computation no de p erformsdepicted an op eration. is the length of the from input to The in theseDepth graphs is the output of alongest logisticpath regression mo model, del, T but dep ends on the deﬁnition of what constitutes a p ossible computational step. output σ(w x ), where σ is the logistic sigmoid function. If we use addition, multiplication and The computation in these graphs is the output of athen logistic del, logistic sigmoids asdepicted the elemen elements ts of our computer language, this regression mo model del has mo depth σ(w xIf),we where is the logistic sigmoid If we use multiplication and three. viewσ logistic regression as anfunction. element itself, thenaddition, this mo model del has depth one. logistic sigmoids as the elements of our computer language, then this mo del has depth three. If we view logistic regression as an element itself, then this mo del has depth one.
⇥
⇥
view of deep learning, not all of the information in a lay layer’s er’s activ activations ations necessarily enco encodes des factors of variation that explain the input. The representation also stores view of deep learning, not alltoofexecute the information a lay er’smake activsense ationsofnecessarily state information that helps a programinthat can the input. enco des factors of v ariation that explain the input. The representation also stores This state information could be analogous to a coun counter ter or pointer in a traditional state information that to execute a program can make the input., computer program. It helps has nothing to do with the that con conten ten tent t of the sense inputofsp speciﬁcally eciﬁcally eciﬁcally, This state information could b e analogous to a coun ter or p ointer in a traditional but it helps the mo model del to organize its processing. computer program. It has nothing to do with the content of the input speciﬁcally, There are tw two o main wa ways ys of measuring the depth of a mo model. del. The ﬁrst view is but it helps the model to organize its processing. based on the num umber ber of sequen sequential tial instructions that must be executed to ev evaluate aluate There are tw o main wa ys of measuring the depth of a mo del. The ﬁrst view is the arc architecture. hitecture. We can think of this as the length of the longest path through based on the n um ber of sequen tial instructions that must b e executed to ev aluate a ﬂo ﬂow w chart that describ describes es how to compute each of the mo model’s del’s outputs given theinputs. architecture. e ocan think oft computer this as theprograms length ofwill the hav longest path through its Just asWtw two equiv equivalen alen alent have e diﬀerent lengths a ﬂo w c hart that describ es how to compute each of the mo del’s outputs given dep depending ending on which language the program is written in, the same function may be its inputs. Just as tw o equiv alen t computer programs will hav e diﬀerent lengths dra drawn wn as a ﬂo ﬂow wchart with diﬀeren diﬀerentt depths dep depending ending on whic which h functions we allow dep ending on which language the program is written in, the same be to be used as individual steps in the ﬂo ﬂow wchart. Fig. 1.3 illustratesfunction ho how w thismay choice dralanguage wn as a ﬂo wcgive hart tw with diﬀerentmeasurements depths depending on whic functions we allow of can two o diﬀerent for the sameharchitecture. to be used as individual steps in the ﬂowchart. Fig. 1.3 illustrates how this choice Another approac approach, h, used by deep probabilistic mo models, dels, regards the depth of a of language can give two diﬀerent measurements for the same architecture. mo model del as being not the depth of the computational graph but the depth of the Another approac used by deep probabilistic dels, In regards the depth of a graph describing howh,concepts are related to eachmo other. this case, the depth mothe del as bchart eing not the computations depth of the computational graph the but representation the depth of the of ﬂow ﬂowchart of the needed to compute of graph describing how concepts are related to each other. In this case, the depth 7 of the ﬂowchart of the computations needed to compute the representation of
CHAPTER 1. INTRODUCTION
eac each h concept ma may y be muc uch h deep deeper er than the graph of the concepts themselv themselves. es. This is because the system’s understanding of the simpler concepts can be reﬁned eac h concept mayab beout muc h more deepercomplex than the graph of concepts es. giv given en information about the concepts. Forthe example, anthemselv AI system This is because the of system’s understanding the simpler concepts cansee beone reﬁned observing an image a face with one eye in of shadow ma may y initially only eye. giv en information ab out the more complex concepts. F or example, an AI system After detecting that a face is presen present, t, it can then infer that a second ey eyee is probably observing image a face eyeof in concepts shadow ma y initially only one eye. presen present t asan well. In of this case,with the one graph only includes twosee lay layers—a ers—a After detecting thata alay face t, it canthe then inferofthat a second eyeincludes is probably la lay yer for ey eyes es and layer er is forpresen faces—but graph computations 2n presen t as well. In this case, the graph of concepts only includes t w o lay ers—a la lay yers if we reﬁne our estimate of each concept giv given en the other n times. layer for eyes and a layer for faces—but the graph of computations includes 2n Because it is not alw alwa ays clear whic which h of these tw two o views—the depth of the layers if we reﬁne our estimate of each concept given the other n times. computational graph, or the depth of the probabilistic mo modeling deling graph—is most Because it is not alw a ys clear whic h of these tw o views—the depth of the relev relevant, ant, and because diﬀeren diﬀerentt people cho hoose ose diﬀerent sets of smallest elemen elements ts computational graph, or the depth of the probabilistic mo deling graph—is most from whic which h to construct their graphs, there is no single correct value for the relev ant, and because diﬀeren eople is cho diﬀerent sets of smallest elemenof ts depth of an arc architecture, hitecture, just taspthere noose single correct value for the length which program. to construct graphs, there isab noout single valueafor the afrom computer Northeir is there a consensus about ho how w correct muc much h depth mo model del depth of an arc hitecture, just as there is no single correct v alue for the length of requires to qualify as “deep.” Ho How wev ever, er, deep learning can safely be regarded as the a computer program. Nor isinv there aboutt of how mucosition h depthof alearned model study of mo models dels that either involve olveaaconsensus greater amoun amount comp composition requires to qualify as “deep.” Ho w ev er, deep learning can safely b e regarded as the functions or learned concepts than traditional mac machine hine learning does. study of models that either involve a greater amount of composition of learned To summarize, deep learning, the sub subject ject of this book, is an approach to AI. functions or learned concepts than traditional machine learning does. Sp Speciﬁcally eciﬁcally eciﬁcally,, it is a type of machine learning, a technique that allow allowss computer T o summarize, deep learning, the sub ject of this b o ok, is an AI. systems to impro improv ve with exp experience erience and data. A According ccording to the approach authors oftothis Sp eciﬁcally , it is a type of machine learning, a technique that allow s computer book, mac machine hine learning is the only viable approac approach h to building AI systems that systems to impro ve with exp erience data. Ats. ccording to the authors of this can op operate erate in complicated, realw realworld orldand environ environmen men ments. Deep learning is a particular book,ofmac hine the achiev only viable approac to building AIby systems that kind mac machine hinelearning learningisthat achieves es great pow power erh and ﬂexibility learning to can op erate in complicated, realw orld environ men ts. Deep learning is a particular represen representt the world as a nested hierarc hierarch hy of concepts, with each concept deﬁned in kind of mac hine learning that achiev es great pow er and ﬂexibility by learning to relation to simpler concepts, and more abstract representations computed in terms represen t the world as Fig. a nested hierarchy of with beach concept in of less abstract ones. 1.4 illustrates theconcepts, relationship et etw ween thesedeﬁned diﬀerent relation to simpler more abstract representations AI disciplines. Fig.concepts, 1.5 gives and a highlevel schematic of ho how w eachcomputed works. in terms of less abstract ones. Fig. 1.4 illustrates the relationship between these diﬀerent AI disciplines. Fig. 1.5 gives a highlevel schematic of how each works.
1.1
Who Should Read This Bo Book? ok?
This ok can bShould e useful for a variet ariety y of readers, but we wrote it with two main 1.1 boWho Read This Book? target audiences in mind. One of these target audiences is universit university y students This b o ok can b e useful for a v ariet y of readers, but we wrote it with two main (undergraduate or graduate) learning ab about out machine learning, including those who target audiences in mind. One of these target audiences is universit y students are beginning a career in deep learning and artiﬁcial in intelligence telligence researc research. h. The (undergraduate or graduate) learning about machine who other target audience is softw software are engineers who do learning, not hav havee including a mac machine hinethose learning arestatistics beginning a career in but deepwan learning and artiﬁcial telligence researc h. deep The or backgr background, ound, want t to rapidly acquire in one and begin using other target audience is softw are engineers do not e a mac hine learning learning in their pro product duct or platform. Deep who learning hashav already pro prov ven useful in or statistics backgr ound, but wan t to rapidly acquire one and b egin using deep man many y soft softw ware disciplines including computer vision, speech and audio pro processing, cessing, learning in their product or platform. Deep learning has already proven useful in 8 many software disciplines including computer vision, speech and audio processing,
CHAPTER 1. INTRODUCTION
Deep learning Example: MLPs
Example: Shallow autoencoders
Example: Logistic regression
Example: Knowledge bases
Representation learning
Machine learning
AI
Figure 1.4: A Venn diagram showing how deep learning is a kind of represen representation tation learning, whic is in turn a kind of mac learning, which is used for many but not all approaches which h machine hine Figure 1.4: A V enn diagram showing how deep learning is a kind of represen tation learning, to AI. Each section of the Venn diagram includes an example of an AI technology technology. . which is in turn a kind of machine learning, which is used for many but not all approaches to AI. Each section of the Venn diagram includes an example of an AI technology.
9
CHAPTER 1. INTRODUCTION
Output
Output
Output
Mapping from features
Output
Mapping from features
Mapping from features
Additional layers of more abstract features
Handdesigned program
Handdesigned features
Features
Simple features
Input
Input
Input
Input
Rulebased systems
Deep learning
Classic machine learning
Representation learning
Figure 1.5: Flow Flowcharts charts showing how the diﬀeren diﬀerentt parts of an AI system relate to eac each h other within diﬀerent AI disciplines. Shaded b oxes indicate comp components onents that are able to Figurefrom 1.5: data. Flowcharts showing how the diﬀerent parts of an AI system relate to each learn other within diﬀerent AI disciplines. Shaded b oxes indicate comp onents that are able to learn from data.
10
CHAPTER 1. INTRODUCTION
natural language pro processing, cessing, rob robotics, otics, bioinformatics and chemistry chemistry,, video games, searc search h engines, online advertising and ﬁnance. natural language processing, robotics, bioinformatics and chemistry, video games, This book has been organized into three parts in order to best accommo accommodate date a search engines, online advertising and ﬁnance. variety of readers. Part I introduces basic mathematical to tools ols and machine learning This b o ok has been organized into three parts in order to best accommo date concepts. Part II describ describes es the most established deep learning algorithms that area vessen ariety of readers. Part I introduces mathematical ols and machine essentially tially solv solved ed tec technologies. hnologies. Partbasic III describ describes es moreto sp speculativ eculativ eculative e ideas learning that are concepts. P art I I describ es the most established deep learning algorithms that are widely believ elieved ed to be imp important ortant for future researc research h in deep learning. essentially solved technologies. Part III describes more speculative ideas that are Readers should feel free to skip parts that are not relev relevan an antt given their interests widely believed to be important for future research in deep learning. or background. Readers familiar with linear algebra, probability probability,, and fundamental Readers should feel freecan to skip skip Part partsI,that are not relev anreaders t given their interests mac machine hine learning concepts for example, while who just wan antt or background. Readers familiar with linear algebra, probability , and fundamental to implement a working system need not read bey eyond ond Part II. To help choose which mac hine learning concepts can skip Part I , for example, readers who just want chapters to read, Fig. 1.6 provides a ﬂow ﬂowchart chart sho showing wingwhile the highlevel organization to implement a w orking system need not read b ey ond Part I I . T o help choose which of the book. chapters to read, Fig. 1.6 provides a ﬂowchart showing the highlevel organization We do assume that all readers come from a computer science bac background. kground. We of the book. assume familiarity with programming, a basic understanding of computational We do assume all readers come from a computer science bac We performance issues,that complexity theory theory, , in introductory troductory lev level el calculus andkground. some of the assume familiarity with programming, a basic understanding of computational terminology of graph theory theory. . performance issues, complexity theory, introductory level calculus and some of the terminology of graph theory.
1.2
Historical Trends in Deep Learning
It is easiest to understand deep learning with some historical con context. text. Rather than 1.2 Historical Trends in Deep Learning pro providing viding a detailed history of deep learning, we iden identify tify a few key trends: It is easiest to understand deep learning with some historical context. Rather than providing detailed has history deepand learning, we iden tifyhas a few key • Deepa learning had of a long ric rich h history history, , but gone bytrends: many names • • • • •
reﬂecting diﬀerent philosophical viewp viewpoints, oints, and has waxed and waned in Deep learning has had a long and ric h history , but has gone by many names popularit opularity y. reﬂecting diﬀerent philosophical viewpoints, and has waxed and waned in popularit y. Deep learning has become more useful as the amoun amountt of av available ailable training data has increased. Deep learning has become more useful as the amount of available training data has increased. Deep learning models ha hav ve gro grown wn in size over time as computer hardware and soft softw ware infrastructure for deep learning has improv improved. ed. Deep learning models have grown in size over time as computer hardware and soft ware infrastructure for deep complicated learning hasapplications improved. with increasing Deep learning has solv solved ed increasingly accuracy over time. Deep learning has solved increasingly complicated applications with increasing accuracy over time.
11
CHAPTER 1. INTRODUCTION
1. Introduction
Part I: Applied Math and Machine Learning Basics 2. Linear Algebra
3. Probability and Information Theory
4. Numerical Computation
5. Machine Learning Basics
Part II: Deep Networks: Modern Practices 6. Deep Feedforward Networks
7. Regularization
8. Optimization
11. Practical Methodology
9. CNNs
10. RNNs
12. Applications
Part III: Deep Learning Research 13. Linear Factor Models
14. Autoencoders
15. Representation Learning
16. Structured Probabilistic Models
17. Monte Carlo Methods
19. Inference
18. Partition Function
20. Deep Generative Models
Figure 1.6: The highlevel organization of the b o ok. An arro arrow w from one chapter to another indicates that the former chapter is prerequisite material for understanding the latter. Figure 1.6: The highlevel organization of the b o ok. An arrow from one chapter to another 12 indicates that the former chapter is prerequisite material for understanding the latter.
CHAPTER 1. INTRODUCTION
1.2.1
The Man Many y Names and Changing Fortunes of Neural Networks 1.2.1 The Many Names and Changing Fortunes of Neural Networks We expect that many readers of this book ha hav ve heard of deep learning as an exciting new technology technology,, and are surprised to see a men mention tion of “history” in a book W e expect that many readers of this b o ok ha v e heard of deep learning an ab about out an emerging ﬁeld. In fact, deep learning dates back to the 1940s. as Deep towas exciting new are surprised see relatively a mention unp of “history” in several a book learning onlytechnology app appeears to, and be new, because it unpopular opular for outpreceding an emerging ﬁeld. Inpfact, deep learning datesit back to the 1940s. many Deep yab ears its current opularit opularity y, and because has gone through learning appand earshas to bonly e new, because it wascalled relatively opular for several diﬀeren diﬀerentt only names, recently become “deepunp learning.” The ﬁeld yhas earsbeen preceding its current p opularit y , and b ecause it has gone through many rebranded many times, reﬂecting the inﬂuence of diﬀeren diﬀerentt researchers diﬀeren t names, and has and diﬀeren diﬀerent t persp erspectiv ectiv ectives. es.only recently become called “deep learning.” The ﬁeld has been rebranded many times, reﬂecting the inﬂuence of diﬀerent researchers A comprehensive history of deep learning is bey eyond ond the scope of this textb textbo ook. and diﬀerent perspectives. Ho How wev ever, er, some basic context is useful for understanding deep learning. Broadly A comprehensive of deep is beyond scope of thisdeep textb ook. sp speaking, eaking, there hav havee bhistory een three wa wav veslearning of developmen development t ofthe deep learning: learnHowknown ever, some basic context useful for understanding deep learning. Broadly ing as cyb cybernetics ernetics in theis1940s–1960s, deep learning known as conne onnectionism ctionism speaking, there have band een the threecurren wavest of developmen t of deep learning: deep learnin the 1980s–1990s, current resurgence under the name deep learning ing known in as 2006. cybernetics in quantitativ the 1940s–1960s, deep learning known b eginning This is quantitatively ely illustrated in Fig. 1.7. as connectionism in the 1980s–1990s, and the current resurgence under the name deep learning Some of the earliest learning algorithms we recognize to toda da day y were intended beginning in 2006. This is quantitatively illustrated in Fig. 1.7. to be computational mo models dels of biological learning, i.e. mo models dels of how learning Some of could the earliest algorithms we recognize danames y werethat intended happ happens ens or happ happen enlearning in the brain. As a result, one of to the deep to b e computational mo dels of biological learning, i.e. mo dels of how learning learning has gone by is artiﬁcial neur neural al networks (ANNs). The corresp corresponding onding happ ens or could happ en in the brain. As a result, one of the names that deep persp erspectiv ectiv ectivee on deep learning mo models dels is that they are engineered systems inspired learning has gonebrain by is(whether artiﬁcialthe neur al networks correspanimal). onding b y the biological human brain or(ANNs). the brain The of another perspectiv on deep learning moorks dels is thatfor they are engineered inspired While the ekinds of neural netw networks used machine learning systems hav havee sometimes byeen theused biological brain (whether human(Hinton brain orand the Shallice brain of, another animal). b to understand brain the function 1991), they are While the kinds of neural netw orks used for machine learning hav e sometimes generally not designed to be realistic mo models dels of biological function. The neural bersp een used etoonunderstand brain function 1991idea ), they are p erspectiv ectiv ective deep learning is motiv motivated ated(Hinton by tw two o and mainShallice ideas. ,One is that generally not designed to e realistic dels of biological function. The neural the brain provides a pro proof of bby example mo that intelligen intelligent t behavior is possible, and a p ersp ectiv e on deep learning is motiv ated b y tw o main ideas. One idea is conceptually straightforw straightforward ard path to building in intelligence telligence is to rev reverse erse engineerthat the the brain provides a pro of by example that intelligen t b ehavior is p ossible, and a computational principles behind the brain and duplicate its functionalit functionality y. Another conceptually ardbpath to building intelligence is to revthe ersebrain engineer the p ersp erspectiv ectiv ectivee isstraightforw that it would e deeply interesting to understand and the computational behindintelligence, the brain and its functionalit Another principles that principles underlie human so duplicate machine learning mo models delsy.that shed p ersp ectiv e is that it w ould b e deeply interesting to understand the brain and thee ligh lightt on these basic scien scientiﬁc tiﬁc questions are useful apart from their abilit ability y to solv solve principles that underlie human intelligence, so machine learning mo dels that shed engineering applications. light on these basic scientiﬁc questions are useful apart from their ability to solve The mo modern dern term “deep learning” go goes es beyond the neuroscientiﬁc persp erspective ective engineering applications. on the curren currentt breed of mac machine hine learning models. It app appeals eals to a more general The mo dern term “deep learning” go es b eyond the neuroscientiﬁc ersp ective principle of learning multiple levels of comp omposition osition osition,, which can be appliedpin machine on the curren t breed of mac hine learning models. It app eals to a more general learning framew frameworks orks that are not necessarily neurally inspired. principle of learning multiple levels of composition, which can be applied in machine learning frameworks that are not necessarily neurally inspired. 13
Frequency of Word or Phrase
CHAPTER 1. INTRODUCTION
0.000250 0.000200 0.000150
cyb cybernetics ernetics (connectionism + neural net netw works) cyb ernetics (connectionism + neural networks)
0.000100 0.000050 0.000000 1940
1950
1960
1970
1980
1990
2000
Year
Figure 1.7: The ﬁgure shows tw two o of the three historical wa wav ves of artiﬁcial neural nets researc research, h, as measured by the frequency of the phrases “cyb “cybernetics” ernetics” and “connectionism” or Figure net 1.7:works” The ﬁgure shows twoogle of Bo theoks three esoofrecent artiﬁcial neural “neural netw according to Go Google Books (thehistorical third wa wav vwa e isvto too to app appear). ear). nets The researc h,easstarted measured the frequency of the phrases “cyb ernetics” and “connectionism” or ﬁrst wav ave withbycybernetics in the 1940s–1960s, with the developmen development t of theories “neural networks” according to Go oks ,(the wa,v1949 e is to recent to app ear). The of biological learning (McCullo McCulloch chogle andBo Pitts 1943third ; Hebb ) oand implementations of ﬁrst w av e started with cybernetics in the 1940s–1960s, with the developmen t of theories the ﬁrst mo models dels such as the p erceptron (Rosen Rosenblatt blatt, 1958) allo allowing wing the training of a single of biological learning (ve McCullo andthe Pitts , 1943; Hebb , 1949) and implementations of neuron. The second wa wave startedchwith connectionist approach of the 1980–1995 p erio eriod, d, the ﬁrst mo dels such as(Rumelhart the p erceptron 1958a) allo wing theork training of aorsingle with bac backpropagation kpropagation et al.(,Rosen 1986ablatt ) to ,train neural netw network with one tw two o neuron. The second wa ve started with the connectionist approach of the 1980–1995 p erio d, hidden la layers. yers. The current and third wa wav ve, deep learning, started around 2006 (Hinton et al., 1986a with kpropagation (Rumelhart to train a neural with oneinorbtw o et al.bac , 2006 ; Bengio et al. , 2007; Ranzato et al.,)2007a ), and is justnetw no now work app appearing earing ook hidden The current ve, deepapp learning, around 2006 (Hinton form as laofyers. 2016. The other and tw two o third waveswa similarly appeared eared instarted b o ok form muc uch h later than et al.corresp et al., 2007 et al., 2007a), and is just now app earing in b ook , 2006;onding Bengioscientiﬁc ; Ranzato the corresponding activity o ccurred. form as of 2016. The other two waves similarly app eared in b o ok form much later than the corresp onding scientiﬁc activity o ccurred.
14
CHAPTER 1. INTRODUCTION
The earliest predecessors of mo modern dern deep learning were simple linear mo models dels motiv motivated ated from a neuroscientiﬁc persp erspective. ective. These mo models dels were designed to earliest learning models tak takeeThe a set of n predecessors input values ofx1mo and asso associate ciatewere themsimple with linear an output , . .dern . , xn deep y. motiv ated from a neuroscientiﬁc p ersp ective. These mo dels were designed to These mo models dels would learn a set of weigh eights ts w1 , . . . , wn and compute their output tak e a set of input v alues them with output n x , . . . , x y. f (x, w ) = x 1w1 + · · · + xnwn . This ﬁrstand wavasso e of ciate neural netw networks orks an researc research h was These would ,learn a set of win eigh ts w kno known wnmo as dels cyb cybernetics ernetics ernetics, as illustrated Fig. 1.7,.. . . , w and compute their output f (x, w ) = x w + + x w . This ﬁrst wave of neural networks research was The McCullo McCullochPitts chPitts Neuron (McCullo McCulloch ch and Pitts, 1943) was an early mo model del known as cybernetics · · ,· as illustrated in Fig. 1.7. of brain function. This linear mo model del could recognize tw two o diﬀerent categories of The McCullo chPitts Neuron ( McCullo ch and Pitts , 1943 ) was anforearly del inputs by testing whether f (x, w ) is positiv ositivee or negative. Of course, the mo mo model del of brain function. linear model could o diﬀerent of to corresp correspond ond to theThis desired deﬁnition of therecognize categories,twthe weigh weights ts categories needed to be inputs by testing whether positiv negative. course, for the mo del fts (xcould , w ) isbe set correctly correctly. . These weigh weights set beyorthe human Of op operator. erator. In the 1950s, to corresp ond to(Rosen the desired categories, weigh ts needed be the perceptron Rosenblatt blatt, deﬁnition 1958, 1962of ) bthe ecame the ﬁrstthe mo model del that could to learn set correctly . These weigh ts could be set b y the human op erator. In the 1950s, the weigh weights ts deﬁning the categories given examples of inputs from each category category.. the padaptive erceptron (Rosen blatt, (ADALINE), 1958, 1962) bwhich ecamedates the ﬁrst del that The line linear ar element frommo about the could same learn time, the weigh ts deﬁning the categories given examples of inputs from each category simply returned the value of f (x) itself to predict a real num umb ber (Widro Widrow w and . The linecould ar element (ADALINE), fromfrom about the same time, Hoﬀ,adaptive 1960), and also learn to predictwhich thesedates num numbers bers data. simply returned the value of f (x) itself to predict a real number (Widrow and These simple learning algorithms greatly aﬀected the mo modern dern landscap landscapee of Hoﬀ, 1960), and could also learn to predict these numbers from data. mac machine hine learning. The training algorithm used to adapt the weigh weights ts of the ADAThese algorithms aﬀected modern landscap e tly of LINE was asimple sp special eciallearning case of an algorithmgreatly called sto stochastic chasticthe gr gradient adient desc descent ent ent.. Sligh Slightly mac hine learning. algorithm to adapt the remain weightsthe of the ADAmo modiﬁed diﬁed versions ofThe the training sto stocchastic gradien gradientt used descent algorithm dominan dominant t LINE was a sp ecial case of an algorithm called sto chastic gr adient desc ent . Sligh tly training algorithms for deep learning mo models dels today today.. modiﬁed versions of the stochastic gradient descent algorithm remain the dominant Mo Models dels based on the f (x, w) used by the perceptron and ADALINE are called training algorithms for deep learning models today. line linear ar mo models dels dels.. These mo models dels remain some of the most widely used machine learning (x, wthey Mo dels based on the ) used y the perceptron and areoriginal called mo models, dels, though in man many y fcases arebtrained in diﬀeren diﬀerent t wADALINE ays than the line ar mo dels . These mo dels remain some of the most widely used machine learning mo models dels were trained. models, though in many cases they are trained in diﬀerent ways than the original Linear mo models dels hav havee many limitations. Most famously famously,, they cannot learn the models were trained. ([0,, 1] , w) = 1 and f ([1 ([1,, 0], w) = 1 but f ([1 ([1,, 1], w) = 0 XOR function, where f ([0 Linear mo dels hav e many limitations. Most famously , they cannot learn the ([0,, 0], w ) = 0. Critics who observ and f ([0 observed ed these ﬂa ﬂaws ws in linear mo models dels caused f ([0 , 1] , w f ([1 , 0] , w f ([1 , 1] , w ) = 0, X OR function, where ) = 1 and ) = 1 but a bac backlash klash against biologically inspired learning in general (Minsky and Pap apert ert ert, f ([0 , 0],w was and ). )= Critics these ﬂaws in linear mo dels caused 1969 1969). This the0.ﬁrst ma major jorwho dipobserv in theed popularity of neural netw networks. orks. a backlash against biologically inspired learning in general (Minsky and Papert, day, , neuroscience regarded an imp importan ortan ortantt source of inspiration 1969T).oday This was the ﬁrstisma jor dip as in the popularity of neural networks. for deep learning researc researchers, hers, but it is no longer the predominan predominantt guide for the ﬁeld. Today, neuroscience is regarded as an important source of inspiration for deep The main reason for the diminished role of neuroscience in deep learning learning researchers, but it is no longer the predominant guide for the ﬁeld. researc research h to today day is that we simply do not ha hav ve enough information ab about out the brain The main reason for the diminished role of neuroscience in deep learning to use it as a guide. To obtain a deep understanding of the actual algorithms used researc h to day is that w e simply do not ha v e enough information ab out the brain by the brain, we would need to be able to monitor the activity of (at the very to use thousands it as a guide. To obtain a deepneurons understanding of the actual algorithms least) of interconnected sim simultaneously ultaneously ultaneously. . Because we areused not by the ouldfarneed be able to monitor the activity of (at the very able to brain, do this,weweware fromtounderstanding even some of the most simple and least) thousands of interconnected neurons simultaneously. Because we are not 15 able to do this, we are far from understanding even some of the most simple and
CHAPTER 1. INTRODUCTION
wellstudied parts of the brain (Olshausen and Field, 2005). Neuroscience has given us a reason to hop hopee that a single deep learning algorithm wellstudied parts of the brain (Olshausen and Field, 2005). can solve man many y diﬀeren diﬀerentt tasks. Neuroscientists ha hav ve found that ferrets can learn to Neuroscience has given us a reason to hop e that a single deep brains learning algorithm “see” with the auditory pro processing cessing region of their brain if their are rewired can solve man y diﬀeren t tasks. Neuroscientists ha v e found that ferrets can learn to to send visual signals to that area (Von Melchner et al. al.,, 2000). This suggests that “see” themammalian auditory pro cessing region if theirtobrains are rewired muc uch hwith of the brain might useofa their singlebrain algorithm solve most of the to send visual signals to that area ( V on Melchner et al. , 2000 ). This suggests that diﬀeren diﬀerentt tasks that the brain solves. Before this hypothesis, machine learning m uch of the mammalian brain with mightdiﬀeren use a tsingle algorithm solvehers most of the researc research h was more fragmented, diﬀerent communities of to researc researchers studying diﬀeren t tasks that the brain solves. Before this hypothesis, machine learning natural language processing, vision, motion planning and speech recognition. Toda day y, researc h was morecommunities fragmented,are with t communities of researc these application stilldiﬀeren separate, but it is common for hers deepstudying learning natural vision, motion and speechareas recognition. Today,. researc research hlanguage groups toprocessing, study many or even all ofplanning these application sim simultaneously ultaneously ultaneously. these application communities are still separate, but it is common for deep learning We are able to dra draw w some rough guidelines from neuroscience. The basic idea of research groups to study many or even all of these application areas simultaneously. ha having ving many computational units that become in intelligen telligen telligentt only via their interactions e are other able tois dra w someby rough guidelines neuroscience. The basic,idea withWeach inspired the brain. Thefrom Neo Neocognitron cognitron (Fukushima 1980of) hatro ving manya computational that become intelligen viathat theirwinteractions in intro troduced duced pow owerful erful modelunits arc architecture hitecture for pro processing cessingt only images as inspired with each other is inspired by the brain. The Neo cognitron ( F ukushima , 1980 by the structure of the mammalian visual system and later became the basis for) intromo duced pow erful model arc pro cessing images that inspired the modern dern acon conv volutional netw network orkhitecture (LeCun for et al. , 1998b ), as we will seewinasSec. 9.10. by theneural structure of theto mammalian and later became basis for Most netw networks orks toda da day y are basedvisual on a system mo model del neuron called the rethe ctiﬁe ctiﬁed d line linear ar the mo dern con v olutional netw ork ( LeCun et al. , 1998b ), as we will see in Sec. 9.10 unit unit.. The original Cognitron (Fukushima, 1975) in intro tro troduced duced a more complicated. netwhighly orks toinspired day are by based awledge model of neuron the The rectiﬁe d linear vMost ersionneural that was our on kno knowledge brain called function. simpliﬁed unit . The original Cognitron ( F ukushima , 1975 ) in tro duced a more complicated mo modern dern version was dev developed eloped incorp incorporating orating ideas from many viewp viewpoints, oints, with Nair vand ersion that inspiredetby kno)wledge brain function. simpliﬁed Hin Hinton ton (was 2010highly ) and Glorot al.our (2011a citing of neuroscience as anThe inﬂuence, and mo dern version was dev eloped incorp orating ideas from many viewp oints, with Nair Jarrett et al. (2009) citing more engineeringoriented inﬂuences. While neuroscience and ton (2010source ) and Glorot et al. (2011a ) citing inﬂuence, is anHin imp important ortant of inspiration, it need notneuroscience be tak taken en asasa an rigid guide. and We Jarrett et al. ( 2009 ) citing more engineeringoriented inﬂuences. While neuroscience kno know w that actual neurons compute very diﬀerent functions than mo modern dern rectiﬁed is an imp ortant of neural inspiration, it need not yet be tak rigid guide. linear units, butsource greater realism has not leden toasana impro improv vemen ementtWine know that actual p neurons compute very diﬀerent functions modern rectiﬁed mac machine hine learning erformance. Also, while neuroscience hasthan successfully inspired linear units, but greater neural realism has not yet led to an impro v emen t in sev several eral neural netw network ork arc know w enough ab about out biological architectures hitectures, we do not yet kno mac hine learning p erformance. Also, while neuroscience has successfully inspired learning for neuroscience to oﬀer muc much h guidance for the learning algorithms we several neuralthese netwarchitectures. ork architectures, we do not yet know enough about biological use to train learning for neuroscience to oﬀer much guidance for the learning algorithms we Media accoun accounts ts often emphasize the similarity of deep learning to the brain. use to train these architectures. While it is true that deep learning researchers are more lik likely ely to cite the brain as an Media accoun ts often emphasize the similarity of deep tohthe brain. inﬂuence than researchers working in other machine learninglearning ﬁelds suc such as kernel While it isor true that deep learningone researchers are view moredeep likelylearning to cite the brain as an mac machines hines Bay Bayesian esian statistics, should not as an attempt inﬂuence than working other machine learning ﬁelds sucmany h as kernel to sim simulate ulate theresearchers brain. Mo Modern dern deepinlearning draws inspiration from ﬁelds, mac hines or Bay esian statistics, one should not view deep learning as an attempt esp especially ecially applied math fundamentals like linear algebra, probability probability,, information to sim ulate the brain. Mo dern deep learning draws inspiration many theory theory,, and numerical optimization. While some deep learningfrom researc researchers hersﬁelds, cite especially applied fundamentals linear algebra, , information neuroscience as anmath imp important ortant source oflike inspiration, othersprobability are not concerned with theory, and numerical optimization. While some deep learning researchers cite neuroscience as an important source of inspiration, others are not concerned with 16
CHAPTER 1. INTRODUCTION
neuroscience at all. It is w worth orth noting that the eﬀort to understand ho how w the brain works on neuroscience at all. an algorithmic lev level el is aliv alivee and well. This endea endeav vor is primarily known as It is w orth noting that the eﬀort to understand how the brain on “computational neuroscience” and is a separate ﬁeld of study from deepworks learning. an isalgorithmic el is aliv and well. This is een primarily knownThe as It common forlev researc researchers herse to mov move e bac back k andendea forthvor betw etween both ﬁelds. “computational neuroscience” andconcerned is a separate ﬁeld deep learning. ﬁeld of deep learning is primarily with howoftostudy buildfrom computer systems It is common for researc hers to mov e bac k and forth b etw een both ﬁelds. The that are able to successfully solv solvee tasks requiring in intelligence, telligence, while the ﬁeld of ﬁeld of deep learning is primarily concerned with how to build computer systems computational neuroscience is primarily concerned with building more accurate that areofable successfully solveworks. tasks requiring intelligence, while the ﬁeld of mo models dels howtothe brain actually computational neuroscience is primarily concerned with building more accurate In the thebrain second wave of neural net netw work research emerged in great part models of 1980s, how the actually works. via a mov movemen emen ementt called conne onnectionism ctionism or par aral al allel lel distribute distributed d pr pro ocessing (Rumelhart In the 1980s, the second w a v e of neural net w ork research emerged great part et al., 1986c; McClelland et al., 1995). Connectionism arose in theincontext of via a mov emen t called c onne ctionism or p ar al lel distribute d pr o c essing ( Rumelhart cognitiv cognitivee science. Cognitiv Cognitivee science is an interdisciplinary approach to understandet al. , 1986c ; McClelland et al., 1995 ). Connectionism arose in the context of ing the mind, combining multiple diﬀerent lev levels els of analysis. During the early cognitiv e science. Cognitiv e science is an interdisciplinary understand1980s, most cognitive scientists studied mo models dels of sym symb bolic approach reasoning.toDespite their ing the mind, combining multiple diﬀerent lev els of analysis. During the early popularit opularity y, symbolic mo models dels were diﬃcult to explain in terms of ho how w the brain 1980s, most cognitive scientists studied mo dels of sym b olic reasoning. Despite their could actually implement them using neurons. The connectionists began to study p opularit symbolicthat models were diﬃcult explain in in neural terms of how thetations brain mo models dels ofy,cognition could actually be to grounded implemen implementations actually them), using neurons. study (could Touretzky andimplement Min Minton ton, 1985 reviving man many yThe ideasconnectionists dating bac back k btoegan the to work of mo dels of cognition that could actually b e grounded in neural implemen tations psyc psychologist hologist Donald Hebb in the 1940s (Hebb, 1949). (Touretzky and Minton, 1985), reviving many ideas dating back to the work of central idea inHebb connectionism is that a large num number psycThe hologist Donald in the 1940s (Hebb , 1949 ). ber of simple computational units can ac achiev hiev hievee in intelligen telligen telligentt behavior when net netw work orked ed together. This insight The central idea in connectionism is that a large num ber simple computational applies equally to neurons in biological nerv nervous ous systems of and to hidden units in units can ac hiev e in telligen t behavior when net w ork ed together. This insight computational mo models. dels. applies equally to neurons in biological nervous systems and to hidden units in Sev Several eral key concepts arose during the connectionism mov movemen emen ementt of the 1980s computational models. that remain central to to today’s day’s deep learning. Several key concepts arose during the connectionism movement of the 1980s of these concepts is that of distribute distributed epresentation esentation (Hinton et al., 1986). thatOne remain central to today’s deep learning.d repr This is the idea that eac each h input to a system should be represen represented ted by man many y features, these concepts that of distribute d r epr esentation ( Hinton et al.,inputs. 1986). andOne eachoffeature should beis in olv in the represen of many p ossible inv volved ed representation tation This is the ideasuppose that eacwhe input a system should represen ted bcars, y mantruc y features, For example, ha hav ve atovision system thatbecan recognize trucks, ks, and and each feature should b e in v olv ed in the represen tation of many p ossible inputs. birds and these ob objects jects can eac each h be red, green, or blue. One way of representing F or example, suppose w e ha v e a vision system that can recognize cars, truc ks, these inputs would be to ha hav ve a separate neuron or hidden unit that activ activates atesand for birds and these ob jects can eac h b e red, green, or blue. One w a y of representing eac each h of the nine possible combinations: red truck, red car, red bird, green truck, and these would benine to ha ve a separate neuron or hidden unit that activ ates for so on.inputs This requires diﬀerent neurons, and each neuron must indep independently endently each of theconcept nine possible red truck, bird, green and learn the of colorcombinations: and ob object ject identit identity y. Onered wacar, y to red impro improv ve on thistruck, situation so on. This requires nine diﬀerent neurons, and each neuron m ust indep endently is to use a distributed representation, with three neurons describing the color and learn the concept of color and object ject identit yy.. One ay to impro on neurons this situation three neurons describing the ob object identit identity Thiswrequires onlyvesix total is to useofa nine, distributed representation, with three neurons the and instead and the neuron describing redness is abledescribing to learn ab about outcolor redness three neurons describing the ob ject identity. This requires only six neurons total instead of nine, and the neuron describing 17 redness is able to learn ab out redness
CHAPTER 1. INTRODUCTION
from images of cars, trucks and birds, not only from images of one sp speciﬁc eciﬁc category of ob objects. jects. The concept of distributed represen representation tation is central to this book, and frombimages ofed cars, trucks and birds, not only 15 from will e describ described in greater detail in Chapter . images of one speciﬁc category of ob jects. The concept of distributed representation is central to this book, and ma major accomplishment the connectionist mov movemen emen ementt was the sucwill Another be describ edjor in accomplishmen greater detail int of Chapter 15. cessful use of backpropagation to train deep neural net netw works with in internal ternal repreAnother mathe jor paccomplishmen t of connectionist mov ement was the sucsen sentations tations and opularization of thethe backpropagation algorithm (Rumelhart cessful use of; backpropagation to algorithm train deep has neural netwand orkswaned with in repreet al. al.,, 1986a LeCun, 1987). This waxed internal popularity sentations andwriting the popularization of the backpropagation (Rumelhart but as of this is currently the dominan dominant t approac approach h toalgorithm training deep models. et al., 1986a; LeCun, 1987). This algorithm has waxed and waned in popularity thewriting 1990s, is researc researchers hers made important adv advances ances mo modeling deling sequences but During as of this currently the dominan t approac h tointraining deep models. with neural netw networks. orks. Ho Hochreiter chreiter (1991) and Bengio et al. (1994) iden identiﬁed tiﬁed some During the 1990s, researchers made important advances modeling sequences of the fundamental mathematical diﬃculties in mo modeling deling longinsequences, describ described ed with neural netw orks. Ho chreiter ( 1991 ) and Bengio et al. ( 1994 ) iden tiﬁed some in Sec. 10.7. Hochreiter and Sc Schmidh hmidh hmidhub ub uber er (1997) introduced the long shortterm of the fundamental mathematical diﬃculties modeling long sequences, describ ed memory or LSTM net netw work to resolv resolvee some ofinthese diﬃculties. Toda day y, the LSTM in widely Sec. 10.7 . Hochreiter and Schmidh uber tasks, (1997)including introduced thenatural long shortterm is used for many sequence mo modeling deling many language memory or LSTM net w ork to resolv e some of these diﬃculties. T o da y , the LSTM pro processing cessing tasks at Go Google. ogle. is widely used for many sequence modeling tasks, including many natural language The second wa wave ve of neural netw networks orks research lasted un until til the mid1990s. Venprocessing tasks at Google. tures based on neural netw networks orks and other AI technologies began to make unrealistisecond wa ve ofwhile neural networks research lasted the mid1990s. VencallyThe ambitious claims seeking inv investments. estments. When un AItil research did not fulﬁll tures based on neuralexp netw orks and inv other AI technologies bointed. egan to Simultaneously make unrealisti, these unreasonable expectations, ectations, investors estors were disapp disappointed. Simultaneously, cally ambitious claims while seeking inv estments. When AI research not et fulﬁll other ﬁelds of mac machine hine learning made adv advances. ances. Kernel mac machines hines (did Boser al., these unreasonable exp ectations, inv estors were disapp ointed. Simultaneously 1992; Cortes and Vapnik, 1995; Schölk Schölkopf opf et al. al.,, 1999) and graphical mo models dels (Jor, other ﬁelds hineedlearning madeonadv ances. Kernel machines (Boser et al., dan , 1998 ) bof othmac achiev achieved go goood results many imp importan ortan ortantt tasks. These two factors 1992 ; Cortes andinVthe apnik , 1995; Schölk opf etnetw al., orks 1999)that andlasted graphical dels (Jorled to a decline popularity of neural networks untilmo 2007. dan, 1998) both achieved good results on many important tasks. These two factors During this time, neural net netw works con contin tin tinued ued to obtain impressiv impressivee performance led to a decline in the popularity of neural networks that lasted until 2007. on some tasks (LeCun et al., 1998b; Bengio et al., 2001). The Canadian Institute thisResearc time, neural net works con to neural obtain net impressiv performance for During Adv Advanced anced Research h (CIF (CIFAR) AR) help helped ed tin toued keep netw works eresearch alive on some tasks ( LeCun et al. , 1998b ; Bengio et al. , 2001 ). The Canadian Institute via its Neural Computation and Adaptiv Adaptivee Perception (NCAP) research initiative. for Adv anced Researc h (CIF AR) helpedresearc to keep neural led netw orks research alive This program united machine learning research h groups by Geoﬀrey Hinton via its Neural Computation and Adaptiv e P erception (NCAP) research initiative. at Universit University y of Toron oronto, to, Yosh oshua ua Bengio at Univ Universit ersit ersity y of Montreal, and Yann This program learning researc groupsresearch led by Geoﬀrey LeCun at Newunited York machine Universit University y. The CIF CIFAR AR hNCAP initiativeHinton had a at Universit y of T oron to, Y osh ua Bengio at Univ ersit y of Montreal, and Yann multidisciplinary nature that also included neuroscien neuroscientists tists and experts in human LeCun at New Y ork Universit y . The CIF AR NCAP research initiative had a and computer vision. multidisciplinary nature that also included neuroscientists and experts in human At this poin ointt in time, deep netw networks orks were generally believ elieved ed to be very diﬃcult and computer vision. to train. W Wee now know that algorithms that hav havee existed since the 1980s work Atwell, this but pointhis t in w time, deep networks were generally believ to be vsimply ery diﬃcult quite as not apparent circa 2006. The issue isedperhaps that to train. W e now know that algorithms that hav e existed since the 1980s work these algorithms were to too o computationally costly to allo allow w muc much h exp experimentation erimentation quite well, but this w as not apparent circa 2006. The issue is p erhaps simply that with the hardware av available ailable at the time. these algorithms were too computationally costly to allow much experimentation The third wa wav ve of neural netw networks orks research began with a breakthrough in with the hardware available at the time. The third wave of neural networks18research began with a breakthrough in
CHAPTER 1. INTRODUCTION
2006. Geoﬀrey Hinton show showed ed that a kind of neural netw network ork called a deep belief net netw work could be eﬃcien eﬃciently tly trained using a strategy called greedy la lay yerwise 2006. Geoﬀrey Hinton show ed that a kind of neural netw ork called a deep belief pretraining (Hin Hinton ton et al. al.,, 2006), which will be describ described ed in more detail in Sec. net w ork could be eﬃcien tly trained using a strategy called greedy la y erwise 15.1 15.1.. The other CIF CIFARaﬃliated ARaﬃliated research groups quickly show showed ed that the same pretraining ( Hin ton et al. , 2006 ), which will b e describ ed in more detail inetSec. strategy could be used to train man many y other kinds of deep net netw works (Bengio al., 15.1.; The otheretCIF research groups quickly showede that the same 2007 Ranzato al. al.,,ARaﬃliated 2007a) and systematically help helped ed to improv improve generalization strategy could be This used wa to vtrain many netw otherorks kinds of deep networks (the Bengio et the al., on test examples. wav e of neural networks researc research h popularized use of 2007 Ranzato et al.to , 2007a ) and systematically help ed to improv term; de deep ep le learning arning emphasize that researchers were no now w able etogeneralization train deeper on test examples. This wa v e of neural netw orks researc h p opularized the useon of the the neural net netw works than had been possible before, and to fo focus cus attention term deep leimp arning to emphasize that researchers were no;wDelalleau able to train deeper, theoretical importance ortance of depth (Bengio and LeCun , 2007 and Bengio neural works had b; een possible efore, and to fo custime, attention the 2011 ; Pnet ascan ascanu u et than al. al.,, 2014a Montufar et bal. al., , 2014 ). At this deep on neural theoretical imperformed ortance ofcompeting depth (Bengio and LeCun , 2007 ; Delalleau andlearning Bengio, net netw works outp outperformed AI systems based on other machine 2011 ; P ascan u et al. , 2014a ; Montufar et al. , 2014 ). At this time, deep neural tec technologies hnologies as well as handdesigned functionalit functionality y. This third wave of popularity netneural works netw outporks erformed competing onthough other machine of networks con contin tin tinues ues to the AI timesystems of this based writing, the fo focus cuslearning of deep tec hnologies as well as handdesigned functionalit y . This third w a v e of popularity learning research has changed dramatically within the time of this wave. The of neural orkswith contin to on thenew timeunsup of this writing, though the focus and of deep third wavnetw e began a ues fo focus cus unsupervised ervised learning techniques the learning research has changed dramatically within the time of this w a v e. The abilit ability y of deep mo models dels to generalize well from small datasets, but to today day there is third wave began a fo cus on new unsup ervised learning techniques the more interest in mwith uc older sup learning algorithms and the abilit of deep uch h supervised ervised ability y and abilit y of models generalize well from small datasets, but today there is mo models dels to deep leverage largetolab labeled eled datasets. more interest in much older supervised learning algorithms and the ability of deep models to leverage large labeled datasets.
1.2.2
Increasing Dataset Sizes
One wonder whyDataset deep learning 1.2.2mayIncreasing Sizeshas only recently become recognized as a crucial tec technology hnology though the ﬁrst exp experiments eriments with artiﬁcial neural net netw works were One may wonder why deep become as a conducted in the 1950s. Deeplearning learninghas hasonly beenrecently successfully usedrecognized in commercial crucial technology though thebut ﬁrstwexp eriments with artiﬁcial netan works were applications since the 1990s, as often regarded as beingneural more of art than inand the something 1950s. Deep learning bert eencould successfully intly commercial aconducted technology that only anhas exp expert use, untilused recen recently tly. . It is true applications since the 1990s, but w as often regarded as being more of an art than that some skill is required to get go goood performance from a deep learning algorithm. a ortunately technology, the and amoun something only an exp ert could use,amount until recen tly. It is data true F ortunately, amount t of that skill required reduces as the of training that some The skill learning is required to get gooreac d phing erformance a deep learning algorithm. increases. algorithms reaching human from performance on complex tasks F ortunately , the amoun t of skill required reduces as the amount of training to toda da day y are nearly iden identical tical to the learning algorithms that struggled to solvedata toy increases. The learning algorithms reac hing human p erformance on complex tasks problems in the 1980s, though the mo models dels we train with these algorithms hav havee to day are nearly tical to thethe learning algorithms thatarchitectures. struggled to The solvemost toy undergone changesiden that simplify training of very deep problems thedevelopmen 1980s, though the mo delswe wecan train with these imp importan ortan ortanttinnew development t is that to today day provide these algorithms algorithms hav withe undergone changes the training very The most the resources they that needsimplify to succeed. Fig. 1.8ofsho shows wsdeep howarchitectures. the size of benchmark imp ortan t new developmen t is that to day we can provide these algorithms with datasets has increased remark remarkably ably ov over er time. This trend is driven by the increasing the resources need succeed. Fig.of1.8 ws howtake the place size of enchmark digitization of they so societ ciet ciety y. As to more and more oursho activities onbcomputers, datasets increased remark ablyisov er time. This trend is driven by increasing more andhas more of what we do recorded. As our computers arethe increasingly digitization of societyit. As more and more of our activities take place computers, net netw work orked ed together, becomes easier to centralize these records andoncurate them more and more of what we do is recorded. As our computers are increasingly networked together, it becomes easier to19centralize these records and curate them
CHAPTER 1. INTRODUCTION
in into to a dataset appropriate for mac machine hine learning applications. The age of “Big Data” has made mac machine hine learning muc much h easier because the key burden of statistical in to a dataset appropriate for mac hine learning applications. age amoun of “Bigt estimation—generalizing well to new data after observing only The a small amount Data” has made macconsiderably hine learning lightened. much easierAs because the key burden of of statistical of data—has been of 2016, a rough rule th thum um umb b estimation—generalizing w ell to new data after observing only a small amoun is that a sup supervised ervised deep learning algorithm will generally achiev achievee acceptablet oferformance data—haswith beenaround considerably lightened. As ofper 2016, a rough thumor b p 5,000 lab labeled eled examples category category, , andrule will of match is that human a supervised deep learning algorithm generally achiev acceptable exceed performance when trained withwill a dataset con containing taininge at least 10 p erformance with around 5,000 lab eled examples per category , and will match million lab labeled eled examples. Working successfully with datasets smaller than this or is exceed human performance when trained with a dataset con taining at least 10 an imp importan ortan ortantt research area, fo focusing cusing in particular on ho how w we can take adv advantage antage million eled examples. Weled orking successfully datasets thanervised this is of large lab quantities of unlab unlabeled examples, with with unsup unsupervised ervised smaller or semisup semisupervised an important research area, focusing in particular on how we can take advantage learning. of large quantities of unlabeled examples, with unsupervised or semisupervised learning.
1.2.3
Increasing Mo Model del Sizes
1.2.3 kIncreasing Mo del Sizes Another ey reason that neural net netw works are wildly successful to toda da day y after enjoying comparativ comparatively ely little success since the 1980s is that we hav havee the computational Another ey run reason that neural net wildly successful today after enjoying resourceskto muc uch h larger mo models delsworks to toda da day yare . One of the main insights of connectioncomparativ ely littlebecome successin since the 1980sman is that we hav e the computational ism is that animals intelligen telligen telligent t when many y of their neurons work together. resources to run m uc h larger mo dels to da y . One of the main insights of connectionAn individual neuron or small collection of neurons is not particularly useful. ism is that animals become intelligent when many of their neurons work together. Biological neurons are not esp especially ecially densely connected. As seen in Fig. 1.10, An individual neuron or small collection of neurons is not particularly useful. our mac machine hine learning mo models dels hav havee had a num number ber of connections per neuron that Biological neurons are not esp ecially densely connected. in Fig. 1.10, was within an order of magnitude of even mammalian brains As for seen decades. our machine learning models have had a number of connections per neuron that In terms of the total num umber ber of neurons, neural netw networks orks hav havee been astonishingly was within an order of magnitude of even mammalian brains for decades. small until quite recently recently,, as shown in Fig. 1.11. Since the introduction of hidden In terms of the totalnet num ber of neurons, neural netw orks hav e been units, artiﬁcial neural netw works ha hav ve doubled in size roughly every 2.4astonishingly years. This small until quite recently , as shown in Fig. 1.11 . Since the introduction hidden gro growth wth is driv driven en by faster computers with larger memory and by the avof ailability units, artiﬁcial neural netwnet orksworks have are doubled in achiev size roughly every 2.4 years. This of larger datasets. Larger netw able to achieve e higher accuracy on more gro wth is driv en b y faster computers with larger memory and b y the a v ailability complex tasks. This trend looks set to contin continue ue for decades. Unless new tec technologies hnologies of larger datasets. netw orks are able achiev higher onber more allo allow w faster scaling,Larger artiﬁcial neural netw networks orkstowill note hav have e theaccuracy same num number of complex tasks. This trend looks set to contin ue for decades. Unless new tec hnologies neurons as the human brain until at least the 2050s. Biological neurons ma may y allo w faster scaling, artiﬁcial neural netw orks will not hav e the same num ber of represen representt more complicated functions than curren currentt artiﬁcial neurons, so biological neurons as the human brain until at least Biological neurons may neural net netw works may be even larger than thisthe plot2050s. portrays. represent more complicated functions than current artiﬁcial neurons, so biological In retrosp retrospect, ect, it is not particularly surprising that neural net netw works with fewer neural networks may be even larger than this plot portrays. neurons than a leec leech h were unable to solv solvee sophisticated artiﬁcial in intelligence telligence probIn retrosp ect, it is not particularly surprising that neural net w orks with fewer lems. Ev Even en to today’s day’s netw networks, orks, whic which h we consider quite large from a computational neurons than leec h were to solv artiﬁcialofineven telligence probsystems pointa of view, areunable smaller thane sophisticated the nervous system relatively lems. Eveen today’s netw orks,like whicfrogs. h we consider quite large from a computational primitiv primitive vertebrate animals systems point of view, are smaller than the nervous system of even relatively The increase in mo model del size over time, due to the availabilit ailability y of faster CPUs, primitive vertebrate animals like frogs. The increase in model size over time, 20 due to the availability of faster CPUs,
CHAPTER 1. INTRODUCTION
Increasing dataset size over time
Dataset size (number examples)
9
10
Canadian Hansard
8
10
Increasing dataset size overWMT time Sports1M
7
10
ImageNet10k
6
10
5
10
Criminals
Public SVHN ImageNet
4
10
MNIST
3
10
102
Iris
1
10
T vs G vs F
ILSVRC 2014 CIFAR10
Rotated T vs C
0
10
1900
1950
1985 2000 2015
Year
Figure 1.8: Dataset sizes ha hav ve increased greatly ov over er time. In the early 1900s, statisticians studied datasets using hundreds or thousands of manually compiled measuremen measurements ts (Garson, Figure 1.8: Dataset sizes hav,e1935 increased greatly er the time. In the early 1980s, 1900s, the statisticians 1900 ; Gosset , 1908; Anderson ; Fisher , 1936).ovIn 1950s through pioneers studied datasets using hundreds or thousands of manually compiled measuremen ts (Garson of biologically inspired mac machine hine learning often work orked ed with small, syn synthetic thetic datasets, such, 1900 ;wresolution Gosset, 1908bitmaps ; Anderson , 1935; that Fisher , 1936 ). In the 1980s, the pioneers as lo lowresolution of letters, were designed to1950s incur through lo low w computational cost and of biologicallythat inspired mac hineorks learning oftentowork ed sp with small, syn datasets, such demonstrate neural netw networks were able learn speciﬁc eciﬁc kinds ofthetic functions (Widrow as lo wresolution bitmaps of letters, that were designed to incur lo w computational cost and and Hoﬀ, 1960; Rumelhart et al. al.,, 1986b). In the 1980s and 1990s, machine learning demonstrate neural in netw orks and werebable sp eciﬁc kinds of functions (Widrow b ecame morethat statistical nature egantotolearn lev leverage erage larger datasets con containing taining tens et al. andthousands Hoﬀ, 1960 Rumelhartsuch 1986b ). In the 1980s and learning of of; examples as, the MNIST dataset (sho (shown wn1990s, in Fig.machine 1.9) of scans of b ecame more statistical in nature and b egan to lev erage larger datasets con taining tens handwritten num numbers bers (LeCun et al., 1998b). In the ﬁrst decade of the 2000s, more of thousands datasets of examples such as the (showndataset in Fig. (1.9 ) of scansand of sophisticated of this same size,MNIST such as dataset the CIF CIFAR10 AR10 Krizhevsky et duced. al., 1998b handwritten bers ).ardInthe the ﬁrst decade of the more Hin Hinton ton, 2009)num contin continued ued(LeCun to b e pro produced. Tow oward end of that decade and2000s, throughout sophisticated of this same size,larger such datasets, as the CIF AR10 dataset (Krizhevsky and the ﬁrst half ofdatasets the 2010s, signiﬁcantly containing hundreds of thousands Hintens ton,of2009 ) contin to b e pro duced. Tchanged oward the endwofasthat decade and throughout to millions of ued examples, completely what p ossible with deep learning. the ﬁrst half of included the 2010s,the signiﬁcantly larger datasets, hundreds(Netzer of thousands These datasets public Street View Housecontaining Numbers dataset et al., to tens of millions of examples, completely changed what w as p ossible with deep learning. 2011 2011), ), various versions of the ImageNet dataset (Deng et al., 2009, 2010a; Russako Russakovsky vsky et the al., These included the public dataset Street View House et Numbers dataset (Netzer et al., datasets 2014a), and the Sp Sports1M orts1M (Karpathy al., 2014 ). At the top of et al. 2011), vwe arious versions of theofImageNet (Deng 2009, dataset 2010a; Russako vsky graph, see that datasets translateddataset sentences, such as ,IBM’s constructed et al. et al. , 2014a ), and the Sp orts1M dataset ( Karpathy , 2014 ). A t the top of the from the Canadian Hansard (Brown et al., 1990) and the WMT 2014 English to Frenc rench h graph, we see that datasets of translated sentences, such as IBM’s dataset constructed dataset (Sch Schwenk wenk, 2014) are typically far ahead of other dataset sizes. from the Canadian Hansard (Brown et al., 1990) and the WMT 2014 English to French dataset (Schwenk, 2014) are typically far ahead of other dataset sizes.
21
CHAPTER 1. INTRODUCTION
Figure 1.9: Example inputs from the MNIST dataset. The “NIST” stands for National Institute of Standards and Technology echnology,, the agency that originally collected this data. Figure Example inputs from thethe MNIST dataset. “NIST” for National The “M”1.9: stands for “mo “modiﬁed,” diﬁed,” since data has b een The prepro preprocessed cessedstands for easier use with Institute of Standards and T echnology , the agency that originally collected this data. mac machine hine learning algorithms. The MNIST dataset consists of scans of handwritten digits The standslab forels “mo diﬁed,” since the data09 has eentained prepro for easier usesimple with and “M” asso associated ciated labels describing whic which h digit is bcon contained incessed each image. This machine learning algorithms. consists of scans of handwritten digits classiﬁcation problem is one ofThe theMNIST simplestdataset and most widely used tests in deep learning and asso lab els p describing whichbdigit is con tained indern each tec image. Thistosimple researc research. h.ciated It remains opular despite eing 09 quite easy for mo modern techniques hniques solve. classiﬁcation problem is one ed of the and most used tests in meaning deep learning Geoﬀrey Hin Hinton ton has describ described it assimplest “the dr drosophila osophila of widely machine learning,” that researc h. machine It remains p opular despite b quite easy for mo dern techniques to solve. it allows learning researchers toeing study their algorithms in controlled lab laboratory oratory osophila Geoﬀrey Hin tonh has describ edoften it as study “the drfruit conditions, muc much as biologists ﬂies. of machine learning,” meaning that it allows machine learning researchers to study their algorithms in controlled lab oratory conditions, much as biologists often study fruit ﬂies.
22
CHAPTER 1. INTRODUCTION
the adven adventt of general purp purpose ose GPUs (describ (described ed in Sec. 12.1.2), faster net netw work connectivit connectivity y and better softw software are infrastructure for distributed computing, is one of the most advenimp t ofortant general purpin osethe GPUs (describ ed learning. in Sec. 12.1.2 ), faster network the important trends history of deep This trend is generally connectivit and better softw arethe infrastructure for distributed computing, is one of exp expected ected toy contin continue ue well in into to future. the most important trends in the history of deep learning. This trend is generally expected to continue well into the future.
1.2.4
Increasing Accuracy Accuracy,, Complexit Complexity y and RealW RealWorld orld Impact
1.2.4 theIncreasing y anded RealW orld Impact Since 1980s, deep Accuracy learning has, Complexit consistently improv improved in its ability to provide accurate recognition or prediction. Moreov Moreover, er, deep learning has consisten consistently tly been Since the 1980s, deep learning has consistently improv ed in its ability to provide applied with success to broader and broader sets of applications. accurate recognition or prediction. Moreover, deep learning has consistently been The earliest deep mo models dels were used to recognize individual ob objects jects in tightly applied with success to broader and broader sets of applications. cropp extremely small images ( Rumelhart et al. , 1986a ). Since then there has cropped, ed, The earliest deep mo dels w ere used to recognize individual ob jects tightly been a gradual increase in the size of images neural net netw works could pro process. cess.inMo Modern dern cropp ed, extremely small images ( Rumelhart et al. , 1986a ). Since then there has ob object ject recognition netw networks orks pro process cess ric rich h highresolution photographs and do not been increase in the the size of images neuralnear netwthe orksob could probcess. Modern ha hav ve aa gradual requirement that photo be cropped object ject to e recognized ob ject recognition netw orks pro cess ric h highresolution photographs and do not (Krizhevsky et al., 2012). Similarly Similarly,, the earliest netw networks orks could only recognize a requirement photo be the cropped near ob jectoftoa bsingle e recognized tha wovekinds of ob objects jects that (or inthe some cases, absence orthe presence kind of (ob Krizhevsky et al. , 2012 ). Similarly , the earliest netw orks could only recognize object), ject), while these mo modern dern net netw works typically recognize at least 1,000 diﬀerent tcategories wo kinds of ob jects (or in some cases, the in absence presence ofisa the single kind of of ob objects. jects. The largest contest ob object jectorrecognition ImageNet ob ject), while theseRecognition modern netChallenge works typically recognize least 1,000 diﬀerent LargeScale Visual (ILSVRC) held at each year. A dramatic categories jects. The in came objectwhen recognition is the ImageNet momen momentt in of theobmeteoric riselargest of deepcontest learning a con conv volutional netw network ork LargeScale Visual Recognition Challenge (ILSVRC) held each y ear. A dramatic won this challenge for the ﬁrst time and by a wide margin, bringing down the moment in the meteoric riserate of deep came when a convolutional ork stateoftheart top5 error fromlearning 26.1% to 15.3% (Krizhevsky et al.netw , 2012 ), w on this challenge for the ﬁrst time and by a wide margin, bringing down the meaning that the conv convolutional olutional net netw work pro produces duces a ranked list of possible categories stateoftheart top5 ratecategory from 26.1% to 15.3% (Krizhevsky et al., of 2012 ), for eac each h image and theerror correct appeared in the ﬁrst ﬁv ﬁvee entries this meaning that convolutional netw ork produces a ranked of pcomp ossible categories list for all butthe15.3% of the test examples. Since then, list these competitions etitions are for eac h image and the correct category appeared in the ﬁrst ﬁv e entries of this consisten consistently tly won by deep conv convolutional olutional nets, and as of this writing, adv advances ances in list for all but 15.3% of the test examples. Since then, these comp etitions are deep learning ha hav ve brought the latest top5 error rate in this contest do down wn to 3.6%, consisten won 1.12 by deep convolutional nets, and as of this writing, advances in as sho shown wn tly in Fig. . deep learning have brought the latest top5 error rate in this contest down to 3.6%, Deep has. also had a dramatic impact on sp speech eech recognition. After as sho wn learning in Fig. 1.12 impro improving ving throughout the 1990s, the error rates for sp speech eech recognition stagnated Deep learning has also had a dramatic impact on sp(eech starting in ab about out 2000. The introduction of deep learning Dahlrecognition. et al., 2010; After Deng impro the 1990s, the error for) to speech recognition stagnated et al., ving 2010bthroughout ; Seide et al. , 2011 ; Hinton et al.rates , 2012a sp speech eech recognition resulted starting in ab out 2000. The introduction deep rates learning Dahl al.e, will 2010explore ; Deng in a sudden drop of error rates, with someoferror cut (in half.et W et al.history , 2010b;inSeide al., 2011 ; Hinton this moreetdetail in Sec. 12.3.et al., 2012a) to speech recognition resulted in a sudden drop of error rates, with some error rates cut in half. We will explore Deep net netw works ha hav ve also had sp spectacular ectacular successes for pedestrian detection and this history in more detail in Sec. 12.3. image segmentation (Sermanet et al., 2013; Farab arabet et et al. al.,, 2013; Couprie et al. al.,, Deep net w orks ha v e also had sp ectacular successes for p edestrian detection and 2013 2013)) and yielded sup superh erh erhuman uman performance in traﬃc sign classiﬁcation (Ciresan image segmentation (Sermanet et al., 2013; Farabet et al., 2013; Couprie et al., 2013) and yielded superhuman performance in traﬃc sign classiﬁcation (Ciresan 23
CHAPTER 1. INTRODUCTION
4
Connections per neuron
10
Number of connections per neuron over time Number of connections per neuron6 over 9time 7 4
3
10
10
5 2
3
1
Cat Mouse
2
10
Human
8
Fruit fly
1
10
1950
1985
2000
2015
Year
Figure 1.10: Initially Initially,, the number of connections b et etw ween neurons in artiﬁcial neural net netw works was limited by hardware capabilities. To day day,, the num number ber of connections b et etween ween Figure 1.10: Initially , the consideration. number of connections b etween neurons in artiﬁcial neural neurons is mostly a design Some artiﬁcial neural netw networks orks hav havee nearly as net w orks was limited b y hardware capabilities. T o day , the num ber of connections b et ween man many y connections p er neuron as a cat, and it is quite common for other neural net netw works neurons is many mostlyconnections a design consideration. artiﬁcial neural orks havethe nearly as to hav havee as p er neuron asSome smaller mammals likenetw mice. Even human many do connections as at cat, andt of it is quite common for otherBiological neural netneural works brain does es not ha hav vpeeranneuron exorbitan exorbitant amoun amount connections p er neuron. to hav e as many connections neuron as smaller mammals like mice. Even the human net netw work sizes from Wikip Wikipedia ediap(er 2015 ). brain do es not have an exorbitant amount of connections p er neuron. Biological neural Adaptive and Hoﬀ netw1.ork sizes linear fromelement Wikip(Widrow edia (2015 ). , 1960) 2. Neocognitron (Fukushima, 1980) 3. GPUaccelerated convolutional network (Chellapilla et al., 2006) 4. Deep Boltzmann machine (Salakhutdinov and Hinton, 2009a) 5. Unsupervised convolutional network (Jarrett et al., 2009) 6. GPUaccelerated multilayer perceptron (Ciresan et al., 2010) 7. Distributed autoencoder (Le et al., 2012) 8. MultiGPU convolutional network (Krizhevsky et al., 2012) 9. COTS HPC unsupervised convolutional network (Coates et al., 2013) 10. GoogLeNet (Szegedy et al., 2014a)
24
CHAPTER 1. INTRODUCTION
et al. al.,, 2012). At the same time that the scale and accuracy of deep netw networks orks has increased, et al., 2012). so has the complexity of the tasks that they can solve. Go Goo odfellow et al. (2014d) A t the same time that the scale and accuracy of deep netw orks has increased, sho show wed that neural netw networks orks could learn to output an entire sequence of characters so has the of therather tasksthan thatjust theyidentifying can solve.aGo odfellow et al. (2014d), transcrib transcribed edcomplexity from an image, single ob object. ject. Previously Previously, showwasedwidely that neural netw orksthis could learn to output an entire sequence of characters it believed that kind of learning required lab labeling eling of the individual transcrib ed from an image, rather than just identifying a single ob ject. Previously elemen elements ts of the sequence (Gülçehre and Bengio, 2013). Recurren Recurrentt neural net netw works,, it whasaswidely believedsequence that thismo kind learning required the individual suc such the LSTM model delofmentioned ab abov ov ove, e, lab areeling nowofused to mo model del elemen ts of the sequence ( Gülçehre and Bengio , 2013 ). Recurren t neural net w orks, relationships bet etw ween se sequenc quenc quences es and other se sequenc quenc quences es rather than just ﬁxed inputs. such sequencetosequence as the LSTM sequence modelseems mentioned e, are used to model This learning to be ab onovthe cuspnow of rev revolutionizing olutionizing relationships betweenmachine sequences and other(Sutskev sequences rather than; just ﬁxed inputs. another application: translation Sutskever er et al. al.,, 2014 Bahdanau et al. al.,, This sequencetosequence learning seems to b e on the cusp of rev olutionizing 2015 2015). ). another application: machine translation (Sutskever et al., 2014; Bahdanau et al., This trend of increasing complexit complexity y has been pushed to its logical conclusion 2015). with the introduction of neural Turing machines (Grav Graves es et al. al.,, 2014a) that learn This trend of increasing complexit y has b een pushed to logicalcells. conclusion to read from memory cells and write arbitrary con conten ten tentt to its memory Suc Such h with the introduction of neural T uring machines ( Grav es et al. , 2014a ) that learn neural net netw works can learn simple programs from examples of desired behavior. For to read from andlists write conten t to memory cells. Suc h example, they memory can learncells to sort of arbitrary num umbers bers given examples of scrambled and neural net works canThis learn simple programs technology from examples desired behavior. For sorted sequences. selfprogramming is inofits infancy infancy, , but in the example, theyincan learn to lists to of nearly numbers future could principle besort applied an any ygiven task.examples of scrambled and sorted sequences. This selfprogramming technology is in its infancy, but in the Another crowning achiev achievement ement of deep learning is its extension to the domain future could in principle be applied to nearly any task. of reinfor einforccement le learning arning arning.. In the context of reinforcement learning, an autonomous Another crowning achievement is its extension to the domain agen agent t must learn to perform a task of bydeep triallearning and error, without an any y guidance from of r einfor c ement le arning . In the context of reinforcement learning, an autonomous the human op operator. erator. DeepMind demonstrated that a reinforcement learning system agen t must learn to perform a taskofby trial and without anygames, guidance from based on deep learning is capable learning to error, play Atari video reaching the human op erator. DeepMind demonstrated that a reinforcement learning system humanlev umanlevel el performance on many tasks (Mnih et al., 2015). Deep learning has basedsigniﬁcantly on deep learning isedcapable of learningoftoreinforcement play Atari video games, also improv improved the performance learning for reaching rob robotics otics h umanlev el p erformance on many tasks ( Mnih et al. , 2015 ). Deep learning has (Finn et al., 2015). also signiﬁcantly improved the performance of reinforcement learning for robotics Man Many of, these (Finn etyal. 2015).applications of deep learning are highly proﬁtable. Deep learning is now used b by y many top technology companies including Go Google, ogle, Microsoft, Man y ofIBM, theseBaidu, applications deep learning highly and proﬁtable. Faceb acebo ook, Apple,ofAdobe, Netﬂix,are NVIDIA NEC. Deep learning is now used by many top technology companies including Google, Microsoft, Adv dvances ances in deep learning have e also Netﬂix, dep depended ended hea heavily vilyand on adv advances ances in softw software are Faceb ook, IBM, Baidu, Apple,hav Adobe, NVIDIA NEC. infrastructure. Softw Software are libraries such as Theano (Bergstra et al., 2010; Bastien A dv ances in deep learning have also dep ended), hea vily(on advert ances in ,softw are et al., 2012), PyLearn2 (Go Goo odfellow et al. , 2013c Torch Collob Collobert et al. 2011b ), infrastructure. are libraries such as, 2013 Theano (Bergstra et al. 2010 ; Bastien DistBelief (DeanSoftw et al. , 2012 ), Caﬀe (Jia ), MXNet (Chen et ,al. , 2015 ), and etensorFlow al., 2012),(Abadi PyLearn2 odfellow , 2013c), Timp orch (Collob ert et al., jects 2011bor), T et al.(Go , 2015 ) hav haveeetallal.supported importan ortan ortant t researc research h pro projects DistBelief ( Dean et al. , 2012 ), Caﬀe ( Jia , 2013 ), MXNet ( Chen et al. , 2015 ), and commercial pro products. ducts. TensorFlow (Abadi et al., 2015) have all supported important research pro jects or Deep learning has also made contributions back to other sciences. Mo Modern dern commercial products. con conv volutional netw networks orks for ob object ject recognition provide a mo model del of visual pro processing cessing Deep learning has also made contributions back to other sciences. Modern 25 convolutional networks for ob ject recognition provide a model of visual processing
CHAPTER 1. INTRODUCTION
that neuroscientists can study (DiCarlo, 2013). Deep learning also pro provides vides useful to tools ols for pro processing cessing massiv massivee amounts of data and making useful predictions in that neuroscientists can study (DiCarlo,used 2013to ). predict Deep learning also prowill videsinteract useful scien scientiﬁc tiﬁc ﬁelds. It has been successfully how molecules tools for to pro cessing massive amounts of data andnew making in order help pharmaceutical companies design drugsuseful (Dahlpredictions et al., 2014in ), scien tiﬁc ﬁelds. It has b een successfully used to predict how molecules will interact to searc search h for subatomic particles (Baldi et al., 2014), and to automatically parse in order toe help pharmaceutical companies new (Dahl et (al. , 2014 ), microscop microscope images used to construct a 3Ddesign map of thedrugs human brain Kno Knowleswlesto searceth al. for, 2014 subatomic particles (Baldi et al.to, 2014 ), and to automatically parse Barley al., ). We exp expect ect deep learning app appear ear in more and more scientiﬁc microscop e images ﬁelds in the future. used to construct a 3D map of the human brain (KnowlesBarley et al., 2014). We expect deep learning to appear in more and more scientiﬁc In summary summary,, deep learning is an approac approach h to machine learning that has dra drawn wn ﬁelds in the future. hea heavily vily on our knowledge of the human brain, statistics and applied math as it In summary deep learning an approac to machine thattremendous has drawn dev develop elop eloped ed over ,the past sev several eralis decades. In hrecen recent t years, learning it has seen hea vily in onits ourpopularit knowledge of usefulness, the human due brain, statistics applied math as it gro growth wth opularity y and in large partand to more pow owerful erful comdev elop ed o v er the past sev eral decades. In recen t years, it has seen tremendous puters, larger datasets and techniques to train deep deeper er netw networks. orks. The years ahead growth in cits popularit y and usefulness,todue in large part to more pow erful comare full of hallenges and opp opportunities ortunities improv improve e deep learning even further and puters, larger datasets and techniques to train deep er netw orks. The years ahead bring it to new frontiers. are full of challenges and opportunities to improve deep learning even further and bring it to new frontiers.
26
Number of neurons (logarithmic scale)
CHAPTER 1. INTRODUCTION
1011 1010 109 108 107 106 105 104 103 102 101 100 10−1 10−2
Increasing neural netw network ork size ov over er time Human
Increasing neural over time 17 network size20 19
16
8
Octopus
18
14 11
Frog Bee
3
Ant Leech
13 1
2 4
1950
12
6
1985
2000
5
9 7
Roundworm
15
10
2015
2056
Sponge
Year
Figure 1.11: Since the introduction of hidden units, artiﬁcial neural netw networks orks hav havee doubled in size roughly every 2.4 years. Biological neural netw network ork sizes from Wikip Wikipedia edia (2015). Figure 1.11: Since the introduction of hidden units, artiﬁcial neural networks have doubled 1. Perceptron Rosenblatt , 1962 ) in size roughly (every 2.4, 1958 years. Biological neural network sizes from Wikip edia (2015). 2. Adaptive linear element (Widrow and Hoﬀ, 1960) 3. Neocognitron (Fukushima, 1980) 4. Early backpropagation network (Rumelhart et al., 1986b) 5. Recurrent neural network for speech recognition (Robinson and Fallside, 1991) 6. Multilayer perceptron for speech recognition (Bengio et al., 1991) 7. Mean ﬁeld sigmoid belief network (Saul et al., 1996) 8. LeNet5 (LeCun et al., 1998b) 9. Echo state network (Jaeger and Haas, 2004) 10. Deep belief network (Hinton et al., 2006) 11. GPUaccelerated convolutional network (Chellapilla et al., 2006) 12. Deep Boltzmann machine (Salakhutdinov and Hinton, 2009a) 13. GPUaccelerated deep belief network (Raina et al., 2009) 14. Unsupervised convolutional network (Jarrett et al., 2009) 15. GPUaccelerated multilayer perceptron (Ciresan et al., 2010) 16. OMP1 network (Coates and Ng, 2011) 17. Distributed autoencoder (Le et al., 2012) 18. MultiGPU convolutional network (Krizhevsky et al., 2012) 19. COTS HPC unsupervised convolutional network (Coates et al., 2013) 20. GoogLeNet (Szegedy et al., 2014a)
27
ILSVRC classiﬁcation error rate
CHAPTER 1. INTRODUCTION
0.30 0.25
Decreasing error rate ov over er time Decreasing error rate over time
0.20 0.15 0.10 0.05 0.00 2010
2011
2012
2013
2014
2015
Year
Figure 1.12: Since deep net netw works reached the scale necessary to comp compete ete in the ImageNet Large Scale Visual Recognition Challenge, they hav havee consistently won the comp competition etition Figure 1.12: and Since deep net wer orks the scale to comp in the ImageNet ev every ery year, yielded low lower andreached low lower er error ratesnecessary each time. Dataete from Russak Russakovsky ovsky Large Recognition they have consistently won the comp etition et al. (Scale 2014bVisual ) and He et al. (2015Challenge, ). every year, and yielded lower and lower error rates each time. Data from Russakovsky et al. (2014b) and He et al. (2015).
28
Part I Part I
Applied Math and Mac Machine hine Learning Basics Applied Math and Machine Learning Basics
29 29
This part of the book in intro tro troduces duces the basic mathematical concepts needed to understand deep learning. We begin with general ideas from applied math that This part of the book inof tromany duces vthe basic ﬁnd mathematical needed to allo allow w us to deﬁne functions ariables, the highestconcepts and low lowest est points understand deep learning. We bdegrees egin with on these functions and quantify of bgeneral elief. ideas from applied math that allow us to deﬁne functions of many variables, ﬁnd the highest and lowest points Next, we describ describee the fundamen fundamental tal goals of machine learning. We describe how on these functions and quantify degrees of belief. to accomplish these goals by sp specifying ecifying a mo model del that represen represents ts certain beliefs, Next, w e describ e the fundamen tal goals of machine learning. We describe how designing a cost function that measures how well those beliefs corresp correspond ond with to accomplish these goals balgorithm y specifying mo del that ts certain beliefs, realit reality y and using a training to aminimize that represen cost function. designing a cost function that measures how well those beliefs correspond with This elementary framew framework ork is the basis for a broad of mac machine hine learning realit y and using a training algorithm to minimize thatvariety cost function. algorithms, including approac approaches hes to machine learning that are not deep. In the This elementary ork basis for a broad variety of machine learning subsequen subsequent t parts of framew the bo book, ok, is wethe develop deep learning algorithms within this algorithms, including approac hes to machine learning that are not deep. In the framew framework. ork. subsequent parts of the book, we develop deep learning algorithms within this framework.
30
Chapter 2 Chapter 2
Linear Algebra Linear Algebra Linear algebra is a branc branch h of mathematics that is widely used throughout science and engineering. Ho How wev ever, er, because linear algebra is a form of contin continuous uous rather Linear algebramathematics, is a branch of mathematics is widely than discrete man many y computer that scientists hav haveeused littlethroughout exp experience erience science with it. and engineering. Ho w ev er, b ecause linear algebra is a form of contin uous A go goo od understanding of linear algebra is essential for understanding and wrather orking than discrete mathematics, man y computer scientists hav e little exp erience with with man many y mac machine hine learning algorithms, esp especially ecially deep learning algorithms. Wit. e A go o d understanding of linear algebra is essential for understanding and w orking therefore precede our in intro tro troduction duction to deep learning with a fo focused cused presentation of withkey man y mac hine learning algorithms, especially deep learning algorithms. We the linear algebra prerequisites. therefore precede our introduction to deep learning with a focused presentation of If you are already familiar with linear algebra, feel free to skip this chapter. If the key linear algebra prerequisites. you hav havee previous exp experience erience with these concepts but need a detailed reference If y are already familiar we with linear algebra, feel freeCo tookb skip chapter. If sheet tooureview key formulas, recommend The Matrix (Petersen and Cookb okbo ookthis you have, previous erience these needalgebra, a detailed P edersen 2006). Ifexp you ha hav ve with no exp exposure osureconcepts at all tobut linear thisreference chapter sheet to review key formulas, we recommend The Matrix Co okb o ok ( Petersen and will teac teach h you enough to read this bo ok, but we highly recommend that you also P edersen , 2006 ). If y ou ha v e no exp osure at all to linear algebra, this chapter consult another resource fo focused cused exclusiv exclusively ely on teaching linear algebra, such as will teac h y ou enough to read this b o ok, but we highly that you also Shilo Shilov v (1977). This chapter will completely omit man many y recommend imp importan ortan ortantt linear algebra consultthat another resource fo cused exclusively on teaching linear algebra, such as topics are not essential for understanding deep learning. Shilov (1977). This chapter will completely omit many imp ortant linear algebra topics that are not essential for understanding deep learning.
2.1
Scalars, Vectors, Matrices and Tensors
The of linear V algebra inv involv olv olves es several types of mathematical objects: jects: 2.1 study Scalars, ectors, Matrices and Tensors ob The study of linear algebra involves several types of mathematical ob jects: • Sc Scalars alars alars:: A scalar is just a single num umb b er, in contrast to most of the other ob objects jects studied in linear algebra, whic which h are usually arrays of multiple num numbers. bers. Sc alars : A scalar is just a single n um b er, in contrast to most of the other We write scalars in italics. We usually give scalars low lowercase ercase variable names. ob jectswe studied in linear algebra, whic h are usually arrays ofbm • When in intro tro troduce duce them, we sp specify ecify what kind of num numb erultiple they num are. bers. For We write scalars in italics. We usually give scalars lowercase variable names. When we introduce them, we specify 31 what kind of number they are. For 31
CHAPTER 2. LINEAR ALGEBRA
example, we migh mightt sa say y “Let s ∈ R be the slop slopee of the line,” while deﬁning a realv realvalued alued scalar, or “Let n ∈ NRbe the num numb ber of units,” while deﬁning a s example, w e migh t sa y “Let b e the slop e of the line,” while deﬁning a natural num umb ber scalar. N n realvalued scalar, or “Let ∈ be the number of units,” while deﬁning a natural ber scalar. ectors ctors::num A vector is an array numb bers. The num numb b ers are arranged in • V ∈ of num order. We can iden identify tify eac each h individual num numb ber by its index in that ordering. V e ctors : A vector is an array of num b ers. The num arranged in Typically we give vectors low lower er case names written inb ers boldare typeface, suc such h order. W e can iden tify eac h individual num b er by its index in that ordering. • as x. The elements of the vector are iden identiﬁed tiﬁed by writing its name in italic ypically we give vectors low er ﬁrst case element names written bold typeface, such tTyp ypeface, eface, with a subscript. The of x is xin 1 , the second element as xx. and The so elements the need vectortoare tiﬁed by of writing in italic is on. Weofalso sayiden what kind num umb bits ers name are stored in 2 x x tthe ypeface, with a subscript. The ﬁrst element of is , the second element vector. If each element is in R, and the vector has n elemen elements, ts, then the x is and so on. W e also need to say what kind of n um b ers are in n times, vector lies in the set formed by taking the Cartesian pro product duct of Rstored R the vector. Ifneach element is into explicitly , and the iden vector elements, then the denoted as R . When we need identify tifyhas the nelements of R a vector, vector liesthem in the set formed by taking the Cartesian product of n times, w e write brackets: ets: R as a column enclosed in square brack denoted as . When we need to explicitly identify the elements of a vector, we write them as a column enclosed inx1square brackets: x2 x = x. . (2.1) x.. x = x.n . (2.1) .. x p oin ts in space, with each element We can think of vectors as identifying oints giving the coordinate along a diﬀerent axis. ts in space, with each element We can think of vectors as identifying p oin elements Sometimes we need toalong indexa adiﬀerent set of of a vector. In this case, we giving the coordinate axis. deﬁne a set con containing taining the indices and write the set as a subscript. For Sometimes we need index a set of elements of a vector. In this case, we example, to access xto 1 , x3 and x6 , we deﬁne the set S = { 1, 3, 6} and write deﬁne a set containing the indices and write the set as a subscript. For x S . We use the − sign to index the complement of a set. For example x−1 is x ,all x elemen = x 1, 3, 6is the example, access and xts, we andvwrite x1 ,Sand the vectortocon containing taining elements of xdeﬁne exceptthe forset ector −S . W e use the sign to index the complement of a set. F or example is x x { } con containing taining all of the elements of x except for x1, x 3 and x6 . the vector containing all elements of x except for x , and x is the vector − • Matric con taining of theiselements ofyxofexcept for so x ,eac x hand x .t is identiﬁed by Matrices es es:: Aall matrix a 2D arra array num numb bers, each elemen element two indices instead of just one. We usually giv givee matrices upp uppercase ercase variable Matric es : A matrix is a 2D arra y of num b ers, so eac h elemen t identiﬁed by names with bold typ ypeface, eface, suc such h as A . If a realv realvalued alued matrix Ais has a heigh height t twomindices of njust one.we Wsay e usually giv∈e matrices upp ercase identify variable • of and a instead width of , then that A e usually Rm×n. W names withtsbof oldatmatrix ypeface,using such its as A . If ainrealv matrix A has a height the elemen elements name italicalued Rbut not bold font, and the of m and widthwith of nseparating , then we say that AFor example, . WeAusually identify indices area listed commas. upper er 1,1 is the upp the elemen ts of a matrix using its name in italic but not b old font, and the ∈ left entry of A and Am,n is the bottom righ rightt entry entry.. We can identify all of A the indices are listed separating commas. For example, is horizon the upptal er i by writing the num numb bers withwith vertical co coordinate ordinate a “ :” for horizontal A A left entry ofFA is the bottom righ t entry We can identify of co coordinate. ordinate. or and example, horizontal tal. cross section of A all with i,: denotes the horizon the numco bers with vertical cokno ordinate by writing ” for the horizon v ertical coordinate ordinate known wn asi the . Likewise, is i. This is ith rowaof“ :A A:,ital coordinate. For example, A denotes the horizontal cross section of A with vertical co ordinate i. This is known as the ith row of A. Likewise, A is 32
CHAPTER 2. LINEAR ALGEBRA
2
A1,1 A = 4 A2,1 A3,1
3 A1,2 A1,1 A2,2 5 ) A> = A1,2 A3,2
A2,1 A2,2
A3,1 A3,2
Figure 2.1: The transp transpose ose of the matrix can be thought of as a mirror image across the main diagonal. Figure 2.1: The transpose of the matrix can be thought of as a mirror image across the main diagonal.
the i th column of A . When we need to explicitly iden identify tify the elemen elements ts of a matrix, we write them as an array enclosed in square brac brack kets: the i th column of A . When we need to explicitly identify the elements of a matrix, we write them as an array A 1,enclosed 1 A 1,2 in square brackets: . (2.2) A 2,1 A 2,2 A A . (2.2) A alued expressions that are not just Sometimes we ma may y need to indexAmatrixv matrixvalued a single letter. In this case, we use subscripts after the expression, but do Sometimes may need to index matrixv alued expressions that are not just f (A )i,j giv j) not con conv vertwe anything to low lower er case. For example, gives es elemen element t (i, after a single letter. In this case, we use subscripts the expression, but do of the matrix computed by applying the function f to A. not convert anything to lower case. For example, f (A ) gives element (i, j ) • T ofensors the matrix computed by applying function to A.than tw ensors: : In some cases w e will need the an arra array y withf more two o axes. In the general case, an array of num numb bers arranged on a regular grid with a Tariable ensors:num In some weknown will need arrayW with moreathan twnamed o axes. “A” In v numb ber ofcases axes is as aan tensor. e denote tensor the general case, anA.array of numbthe ers element arrangedofon a regular grid (with • with A at i, j, ka) this typeface: We identify co coordinates ordinates variable num ber. of axes is known as a tensor. We denote a tensor named “A” b y writing Ai,j,k A A with this typeface: . We identify the element of at co ordinates (i, j, k ) A by writing . One imp important ortant op operation eration on matrices is the tr transp ansp anspose ose ose.. The transp transpose ose of a matrix is the mirror image of the matrix across a diagonal line, called the main One ,imp ortantdown operation is the tr anspits oseupp . The transp ose of a diagonal diagonal, running and toonthematrices righ right, t, starting from upper er left corner. See matrix mirror image of theofmatrix across a diagonal line, the Fig. 2.1isforthe a graphical depiction this op operation. eration. We denote thecalled transp transpose osemain of a diagonal , running down and to the righ t, starting from its upp er left corner. See > matrix A as A , and it is deﬁned such that Fig. 2.1 for a graphical depiction of this operation. We denote the transpose of a > matrix A as A , and it is deﬁned(Asuch )i,jthat = Aj,i. (2.3) ) =A . contain only one column. (2.3) Vectors can be though thoughtt of as(A matrices that The transp transpose ose of a vector is therefore a matrix with only one row. Sometimes we Vectors can be thought of as matrices that contain only one column. The 33 transpose of a vector is therefore a matrix with only one row. Sometimes we
CHAPTER 2. LINEAR ALGEBRA
deﬁne a vector by writing out its elements in the text inline as a ro row w matrix, then using the transp transpose ose op operator erator to turn it in into to a standard column vector, e.g., deﬁne >by writing out its elements in the text inline as a row matrix, x = [[x x1a, xvector 2, x3 ] . then using the transpose operator to turn it into a standard column vector, e.g., a single entry entry.. From this, we x =A[xscalar , x , xcan ] .be thought of as a matrix with only > can see that a scalar is its own transp transpose: ose: a = a . A scalar can be thought of as a matrix with only a single entry. From this, we We can add matrices to each other, as long as they ha hav ve the same shap shape, e, just can see that a scalar is its own transpose: a = a . by adding their corresp corresponding onding elemen elements: ts: C = A + B where Ci,j = Ai,j + Bi,j . We can add matrices to each other, as long as they have the same shape, just We can also add a scalar to a matrix or multiply a matrix by a scalar, just by adding their corresp onding elements: C = A + B where C = A + B . by performing that op operation eration on eac each h element of a matrix: D = a · B + c where W e can also add a scalar to a matrix or multiply a matrix by a scalar, just Di,j = a · Bi,j + c. by performing that operation on each element of a matrix: D = a B + c where context text of deep learning, we also use some less conv conventional entional notation. D In= the a Bcon + c. · We allo allow w the addition of matrix and a vector, yielding another matrix: C = A + b, · In the text+ofb deep learning, we also use some less conventional notation. Ci,j con =A where i,j j . In other words, the vector b is added to each row of the C = Ain +to b, We allowThis the addition of eliminates matrix andthe a vector, matrix. shorthand need toyielding deﬁne aanother matrixmatrix: with b copied into Cw b= A doing + b .the b is added where In addition. other words, vectorcopying of the eac each h ro row efore Thisthe implicit of bto toeach man many yrow lo locations cations matrix. This shorthand eliminates the need to deﬁne a matrix with copied into b is called br bro oadc adcasting asting asting.. each row before doing the addition. This implicit copying of b to many locations is called broadcasting.
2.2
Multiplying Matrices and Vectors
One most imp important ortant op operations erations inv involving olving matrices is multiplication of two 2.2 of the Multiplying Matrices and Vectors matrices. The matrix pr pro oduct of matrices A and B is a third matrix C . In order Onethis of the mosttoimp opA erations inveolving matrices of has two for pro product duct beortant deﬁned, must hav have the same num numb bis er multiplication of columns as B C . In matrices. duct of matrices and thirdCmatrix mo× n and × p. B is of A ro rows. ws. If AThe is ofmatrix shap shapee pr shap shape e nB×isp,athen is of shap shape e morder A B for this pro duct to b e deﬁned, m ust hav e the same num b er of columns as has We can write the matrix pro product duct just by placing tw two o or more matrices together, m n n p m p. A B C ro ws. If is of shap e and is of shap e , then is of shap e e.g. We can write the matrix × product just placing two or more matrices together, × × C =byAB . (2.4) e.g. (2.4) The pro product duct op operation eration is deﬁnedCb= y AB . X The product operation is deﬁned Ci,j = by A i,k B k,j. (2.5) k
C = A B . (2.5) Note that the standard pro product duct of tw twoo matrices is not just a matrix con containing taining the pro product duct of the individual elements. Suc Such h an op operation eration exists and is called the Note that the standard pro duct of tw o matrices is not just elementwise pr pro oduct or Hadamar Hadamard d pr pro oduct, and is denoted as aAmatrix B . containing XSuch an operation exists and is called the the product of the individual elements. x y of the same dimensionalit The dot pr pro o duct b etw etween een t w o vectors y is the elementwise product or Hadamard productand , and is denoted asdimensionality A B. > matrix pro product duct x y . We can think of the matrix pro product duct C = AB as computing y column dot dot product betwbeen wo vectors of the same y is the Ci,j The as the pro product duct et etw wteen ro row w i ofxAand and j of dimensionalit B. matrix product x y . We can think of the matrix pro duct C = AB as computing 34 A and column j of B . C as the dot pro duct between row i of
CHAPTER 2. LINEAR ALGEBRA
Matrix pro product duct op operations erations hav havee many useful prop properties erties that make mathematical analysis of matrices more con convenien venien venient. t. For example, matrix m multiplication ultiplication is Matrix pro duct op erations hav e many useful prop erties that make mathematical distributiv distributive: e: analysis of matrices more A con venien For example, is (B + C )t.= AB + AC . matrix multiplication (2.6) distributive: It is also asso associativ ciativ ciative: e: A(B + C ) = AB + AC . (2.6) A(B C ) = (AB )C . (2.7) It is also asso ciative: AB = B A do Matrix multiplication is not commutativ commutative (the does es(2.7) not A(B C ) = (eAB )Ccondition . alw alwa ays hold), unlik unlikee scalar multiplication. Ho Howev wev wever, er, the dot pro product duct betw etween een tw two o AB = B A Matrix multiplication is not commutativ e (the condition do es not vectors is comm commutativ utativ utative: e: > wever, the dot pro duct b etween two always hold), unlike scalar multiplication. x>y = yHo x. (2.8) vectors is commutative: x y has = y a xsimple . (2.8) The transp transpose ose of a matrix pro product duct form: (AB )> has = Ba>simple A >. form: The transpose of a matrix pro duct
(2.9)
AB )2.8= A . This allo allows ws us to demonstrate (Eq. , bB y exploiting the fact that the value of (2.9) suc such h a pro product duct is a scalar and therefore equal to its own transp transpose: ose: This allows us to demonstrate Eq. 2.8, by exploiting the fact that the value of > such a product is a scalar and x>therefore y = x >yequal=toy>its x.own transpose: (2.10)
x y= x y = y x. (2.10) Since the fo focus cus of this textb textbo ook is not linear algebra, we do not attempt to dev develop elop a comprehensive list of useful prop properties erties of the matrix pro product duct here, but Since the fo cus of this textb o ok is not linear algebra, we do not attempt to the reader should b e aware that many more exist. develop a comprehensive list of useful properties of the matrix product here, but notation We no now w kno know w enough linear algebra to write down a system of linear the reader should b e aware that many more exist. equations: We now know enough linear algebra to write down a system of (2.11) linear Axnotation =b equations: n b ∈=Rbm is a known vector, and x ∈ R(2.11) where A ∈ R m×n is a known matrix,Ax is a vector of unknown variables we would like R to solve for. Each elemen elementt xi of xRis one R A b x another where is a known matrix, is a known vector, is a A of these unknown variables. Each row of and eac each h element of b and pro provide vide vector of t. unknown ariables Eq. we w2.11 ouldas: like ∈ ∈ to solve for. Each element x of∈ x is one constrain constraint. We canvrewrite of these unknown variables. Each row of A and each element of b provide another constraint. We can rewrite Eq. 2.11Aas: (2.12) 1,: x = b1
or, ev even en more explicitly explicitly,, as:
A2,: x = b2
(2.12) (2.13)
A .x. .= b A m,:x . . .= bm
(2.14) (2.13) (2.15) (2.14)
A
(2.15)
x=b
A,1as: or, even more explicitly ,1 x1 + A 1,2x 2 + · · · + A 1,nx n = b1 A
x +A
35 x + +A ···
x =b
(2.16) (2.16)
CHAPTER 2. LINEAR ALGEBRA
1 0 01 0 0
0 1 00 1 0
0 0 10 0 1
Figure 2.2: Example identity matrix: This is I 3 .
Figure 2.2: Example identity matrix: This is I .
A2,1 x1 + A 2,2x 2 + · · · +A 2,nx n = b2
... + A x = b x + A m,1x1 + Am,2x 2 +. .··.···· + A m,nxn = bm . A
x +A
(2.17) (2.18) (2.17) (2.19) (2.18)
Aductx notation + A xpro x compact = b . representation +vides + aAmore (2.19) Matrixv Matrixvector ector pro product provides for equations of this form. ··· Matrixvector product notation provides a more compact representation for equations of this form.
2.3
Iden Identit tit tity y and In Inverse verse Matrices
Linear oﬀers a pow owerful tool Matrices called matrix inversion that allows us to 2.3 algebra Identit y and Inerful verse analytically solv solvee Eq. 2.11 for many values of A. Linear algebra oﬀers a powerful tool called matrix inversion that allows us to To describ describee matrix in inv version, we ﬁrst need to deﬁne the concept of an identity analytically solve Eq. 2.11 for many values of A. matrix matrix.. An identit identity y matrix is a matrix that do does es not change any vector when we To describ e matrix version, we ﬁrst need tothe deﬁne the of anpreserves identity multiply that vector byin that matrix. We denote iden identit tit tity y concept matrix that matrix . An identit y matrix matrix, Ithat dones ×n,not n dimensional vectors as In.isFaormally ormally, andchange any vector when we n∈R multiply that vector by that matrix. We denote the identity matrix that preserves R ndimensional vectors as I . Formally ∀x ∈ R,nI, In x = x., and (2.20) ∈ R x I x = xall . of the entries along the(2.20) The structure of the identit identity y matrix is ,simple: main ∀ ∈ entries are zero. See Fig. 2.2 for an example. diagonal are 1, while all of the other The structure of the identity matrix is simple: all of the entries along the main The matrix inverse of A is denoted as A−1, and it is deﬁned as the matrix diagonal are 1, while all of the other entries are zero. See Fig. 2.2 for an example. suc such h that The matrix inverse of A is denoted , and it is deﬁned as the matrix (2.21) A −1 Aas = IA n. such that (2.21) A = I . steps: We can now solve Eq. 2.11 by A the following We can now solve Eq. 2.11 by the following steps: Ax = b −1 A −1 Ax Ax = =A b b −1
(2.22) (2.23) (2.22)
== AA b b A In x Ax
(2.24) (2.23)
36A b I x=
(2.24)
CHAPTER 2. LINEAR ALGEBRA
x = A−1b.
(2.25)
= Apossible b. (2.25) Of course, this dep depends ends on it x being to ﬁnd A−1. We discuss the −1 conditions for the existence of A in the follo following wing section. Of course, this depends on it being possible to ﬁnd A . We discuss the −1 When Afor the exists, severalofdiﬀerent exist for ﬁnding it in closed form. conditions existence A in algorithms the following section. In theory theory,, the same in inv verse matrix can then b e used to solv solvee the equation many A When exists, several diﬀerent algorithms exist for form. −1 times for diﬀerent values of b . How However, ever, A is primarilyﬁnding useful it asina closed theoretical Inol, theory the same verse matrix caninthen b e used solvsoftw e theare equation many to tool, and ,should notinactually b e used practice for to most software applications. b A times for diﬀerent v alues of . How ever, is primarily useful as a theoretical Because A−1 can b e represented with only limited precision on a digital computer, tool, and should not actually e used in of practice mostobtain software applications. b can for algorithms that make use of bthe value usually more accurate Because can b e represented with only limited precision on a digital computer, A estimates of x. algorithms that make use of the value of b can usually obtain more accurate estimates of x.
2.4
Linear Dep Dependence endence and Span
In for A−1 to exist, Eq. 2.11 must have e exactly one solution for every value 2.4orderLinear Dep endence andhav Span of b. How However, ever, it is also possible for the system of equations to hav havee no solutions A In order for to exist, Eq. 2.11 must hav e exactly one solution value or inﬁnitely many solutions for some values of b. It is not possiblefor to every ha have ve more of b. one Howbut ever, it than is alsoinﬁnitely possibleman for ythe system for of equations tobhav no solutions x and y than less many solutions a particular ; if eboth b or inﬁnitely many solutions for some v alues of . It is not p ossible to ha ve more are solutions then than one but less than inﬁnitelyzman and y = αyxsolutions + (1 − α)for y a particular b ; if both x (2.26) are solutions then is also a solution for any real αz. = αx + (1 α)y (2.26) To analyze ho how w man many y solutions the equation − has, we can think of the columns is also a solution for any real α. of A as sp specifying ecifying diﬀerent directions we can tra trave ve vell from the origin (the point T o analyze ho w man y solutions the equation has, wewcan think of the columns sp speciﬁed eciﬁed by the vector of all zeros), and determine ho how many wa ways ys there are of A of as sp ecifying diﬀerent directions we can tra ve l from the origin (the peloint reac reaching hing b. In this view, each element of x sp speciﬁes eciﬁes ho how w far we should trav travel in sp eciﬁed by the vector ofwith all zeros), and determine many ys direction there are of of xi sp eac each h of these directions, specifying ecifying how far ho towmo mov ve inwa the reachingi:b. In this view, each element of x speciﬁes how far we should travel in column X each of these directions, with xAx sp= ecifying how. far to move in the direction of x iA (2.27) :,i column i: i Ax = xA . (2.27) In general, this kind of op operation eration is called a line linear ar combination ombination.. Formally ormally,, a linear com combination bination of some set of vectors {v (1) , . . . , v(n) } is given by multiplying each In general, of onding op eration is called a line combination Formally, a linear vector v(i) bthis y a kind corresp corresponding scalar co coeﬃcien eﬃcien eﬃcient t ar and adding the. results: combination of some set of vectors v , . . . , v is given by multiplying each XX ( i ) vector v by a corresponding scalar{coeﬃcien } adding the results: ci v . t and (2.28) i
cv . (2.28) The sp span an of a set of vectors is the set of all points obtainable by linear combination of the original vectors. The span of a set of vectors is the set of all points obtainable by linear combination of the original vectors. X37
CHAPTER 2. LINEAR ALGEBRA
Determining whether Ax = b has a solution th thus us amoun amounts ts to testing whether b is in the span of the columns of A. This particular span is known as the column Determining whether sp spac ac ace e or the range of A. Ax = b has a solution thus amounts to testing whether b is in the span of the columns of A. This particular span is known as the column Ax = b to ha order theofsystem hav ve a solution for all values of b ∈ R m , spacIn e or the rfor ange A. Rm we therefore require that the column space of A be all of R m . If any p oin ointt in R = b that b has, In order from for the to hapvoint e a solution for allvalue values b that is excluded thesystem columnAx space, is a potential of of R R we solution. therefore require that the column space of A space be all of of A b.eIfallany in m timplies ∈ no The requirement that the column of pRoin is excluded from column space, thatmpoint is a pi.e., otential value of b that has Am immediately thatthe ust hav have e at least columns, . Otherwise, the n≥m R A no solution. The requirement that the column space of b e all of implies dimensionalit dimensionality y of the column space would be less than m. For example, consider a mustb hav e at but leastx m n difying m . Otherwise, 3immediately x × 2 matrix. that The A target is 3D, is columns, only 2D, i.e., so mo modifying the value ofthe dimensionalit y of the column space would b e less than . F or example, consider m 3 ≥ at best allows us to trace out a 2D plane within R . The equation has a solutiona 3 and b isplane. 2 matrix. 3D, but x is only 2D, so mo difying the value of x if only if bThe liestarget on that R at×best allows us to trace out a 2D plane within . The equation has a solution if and only nif ≥ b lies thata plane. m isononly Ha Having ving necessary condition for ev every ery poin ointt to ha have ve a solution. It is not a suﬃcient condition, because it is possible for some of the columns to be m is only Having condition every point to ve a solution. redundan redundant. t. nConsider a 2 ×a 2necessary matrix where b othfor of the columns arehaequal to each It is not a suﬃcient condition, b ecause it is p ossible for some of the columns bye ≥ other. This has the same column space as a 2 × 1 matrix containing only one to cop copy redundan t. Consider a 2 2 matrix where b oth of the columns are equal to each of the replicated column. In other words, the column space is still just a line, and other.toThis has theall same column as there a 2 1are matrix containing only one copy 2 × fails encompass of R , ev even en space though tw two o columns. of the replicated column. In other words, the × column space is still just a line, and R redundancy is known as line Formally ormally,, this kind of linear ar dep dependenc endenc endencee. A set of fails to encompass all of , even though there are two columns. vectors is line linearly arly indep independent endent if no vector in the set is a linear combination of the F ormally , this kind redundancy known linear dependencof e. the A set of other vectors. If we add of a vector to a setis that is aas linear combination other vectors ectors is endent if nodo ves ector theany set pisoints a linear combination the v in line thearly set, indep the new vector does notinadd to the set’s span.ofThis other vthat ectors. wecolumn add a vector to the a set that is linear combination thematrix other means for Ifthe space of matrix toaencompass all of Rm,ofthe v ectors in the set, the new vector do es not add any p oints to the set’s span. This must con contain tain at least one set of m linearly independent columns. R This condition means for theand column spacefor of the encompass all of , thevmatrix is b oththat necessary suﬃcient Eq. matrix 2.11 totoha hav ve a solution for every alue of m.ust conthat tain at one set ofismforlinearly independent columns. This condition b m linear Note theleast requirement a set to hav havee exactly indep independent endent is b oth necessary andmsuﬃcient Eq. 2.11 to havveectors a solution fore every alue m of columns, not at least . No set for of m dimensional can hav have more v than b m Note that the indep requirement is for a set e exactly linear endent m.utually linearly independen enden endentt columns, buttoahav matrix with more thanindep m columns m m m columns, not at least . No set of dimensional v ectors can hav e more than ma may y ha hav ve more than one such set. mutually linearly indep endent columns, but a matrix with more than m columns mayInhaorder ve more thanmatrix one such set.e an in for the to hav have inv verse, we additionally need to ensure that Eq. 2.11 has at most one solution for each value of b. To do so, we need to ensure order for the e an invOtherwise erse, we additionally need to ensure that thatInthe matrix has matrix at mosttomhav columns. there is more than one wa way y of b Eq. 2.11 has at most one solution for each v alue of . T o do so, we need to ensure parametrizing eac each h solution. that the matrix has at most m columns. Otherwise there is more than one way of Together, this means that the matrix must be squar squaree, that is, we require that parametrizing each solution. independent. endent. A square matrix m = n and that all of the columns must b e linearly indep T ogether, this means that the matrix must b e squar e , that is, we require that with linearly dependent columns is known as singular. m = n and that all of the columns must b e linearly independent. A square matrix If A is not square or is square but singular, it can still b e possible to solve the with linearly dependent columns is known as singular. 38 If A is not square or is square but singular, it can still b e possible to solve the
CHAPTER 2. LINEAR ALGEBRA
equation. How However, ever, we can not use the metho method d of matrix inv inversion ersion to ﬁnd the solution. equation. However, we can not use the method of matrix inversion to ﬁnd the So far we hav havee discussed matrix in inv verses as b eing multiplied on the left. It is solution. also possible to deﬁne an inv inverse erse that is multiplied on the righ right: t: So far we have discussed matrix inverses as b eing multiplied on the left. It is also possible to deﬁne an inverse that on the right: AAis−1multiplied = I. (2.29) AA = I . inv For square matrices, the left inv inverse erse and right inverse erse are equal.
(2.29)
For square matrices, the left inverse and right inverse are equal.
2.5
Norms
Sometimes we need to measure the size of a vector. In mac machine hine learning, we usually 2.5 Norms measure the size of vectors using a function called a norm. Formally ormally,, the Lp norm Sometimes is giv given en by we need to measure the size of a vector. In machine learning, we usually ! 1a norm. Formally, the L norm measure the size of vectors using a function called p X is given by x p = xi p (2.30) x
=
i
x (2.30) for p ∈ R, p ≥ 1.     R including the L p norm, are functions mapping vectors to nonnegative Norms, for p , p 1. x measures the distance from values. On an intuitiv intuitivee lev level, el, the norm of a vector !mapping ∈ ≥ L Norms, including the norm, are functions vectors to X the origin to the poin ointt x. More rigorously rigorously,, a norm is any function that satisﬁes f nonnegative values. On anprop intuitiv e level, the norm of a vector x measures the distance from the follo following wing properties: erties: the origin to the point x. More rigorously, a norm is any function f that satisﬁes the•follo erties: f (xwing ) = 0prop ⇒x =0
• f (x)+=y0) ≤ fx(x=) 0 + f (y ) (the triangle ine inequality quality quality)) • ⇒ fα (x∈+Ry, )f (αx f ()x=) + • ∀ αff((yx))(the triangle inequality) • R ≤ , f (αxwith )= α f (2x p= Theα L2 norm, 2,,) is known as the Euclide Euclidean an norm norm.. It is simply the • ∀ ∈ distance from  the origin to the poin Euclidean ointt iden identiﬁed tiﬁed by x. The L 2 norm is L p The norm, with = 2 , is known as the Euclide an norm . It isassimply the used so frequently in mac machine hine learning that it is often denoted simply x, with x. ofThe L norm Euclidean distance from the to the poin t identiﬁed is the subscript 2 omitted. It isorigin also common to measure theby size a vector using usedsquared so frequently in mac hine learning that it is simply often denoted the L2 norm, whic which h can b e calculated as x>x.simply as x , with the subscript 2 omitted. It is also common to measure the size of a vector   using L2 squared moreb econv convenient enient to workaswith mathematically and the The squared L norm,norm whicis h can calculated simply x x . 2 computationally than the L norm itself. For example, the deriv derivatives atives of the The squared norm is more conv enient to work with mathematically L2 normL with x squared resp to each element of eac dep the respect ect each h depend end only on and L computationally than the norm itself. F or example, the deriv atives of the 2 corresp corresponding onding elemen elementt of x, while all of the deriv derivativ ativ atives es of the L norm dep depend end L x squared norm with resp ect to each element of eac h dep end only on the 2 on the en entire tire vector. In many contexts, the squared L norm ma may y be undesirable corresponding element of x, while all of the derivatives of the L norm depend on the entire vector. In many contexts,39 the squared L norm may be undesirable
CHAPTER 2. LINEAR ALGEBRA
because it increases very slowly near the origin. In sev several eral machine learning applications, it is imp importan ortan ortantt to discriminate b et etw ween elements that are exactly because increases slowlybut near the origin. several zero and it elements thatvery are small nonzero. In theseIncases, we machine turn to a learning function applications, it is imp ortan t to discriminate b et w een elements that are exactly that gro grows ws at the same rate in all lo locations, cations, but retains mathematical simplicity: zero and elements that are small but nonzero. In these cases, w e turn to a function the L1 norm. The L1 norm ma may y be simpliﬁed to that grows at the same rate in all locations, but retains mathematical simplicity: X the L norm. The L norm may bxesimpliﬁed xito . (2.31) 1= i
x = x . (2.31) The L1 norm is commonly used in machine learning when the diﬀerence b etw etween een     zero and nonzero elements is very imp importan ortan ortant. t. Every time an element of x mo mov ves norm usedincreases in machine when the diﬀerence between aThe wayLfrom 0 bis y commonly , the L1 norm by learning . zero and nonzero elements is very importan t. Every time an element of x moves X We sometimes measure the size of the vector by coun counting ting its num umb ber of nonzero away from 0 by , the L norm increases by . 0 norm,” but this is incorrect L elemen Some authors refer to this function as the “ elements. ts. We sometimes size of the vectorinbya coun tingis its ber of bnonzero terminology terminology. . The measure num numb ber the of nonzero entries vector notnum a norm, ecause L elemen ts. Some authors refer to this function as the “ norm,” but this is incorrect scaling the vector by α do does es not change the num umb ber of nonzero en entries. tries. The L 1 terminology . The num b er of nonzero entries in a v ector is not a norm, because norm is often used as a substitute for the number of nonzero en entries. tries. scaling the vector by α does not change the number of nonzero entries.∞The L One other norm that commonly arises in machine learning is the L norm, norm is often used as a substitute for the number of nonzero entries. also known as the max norm. This norm simpliﬁes to the absolute value of the Onet with otherthe norm thatmagnitude commonlyinarises in machine learning is the L norm, elemen element largest the vector, also known as the max norm. This norm simpliﬁes to the absolute value of the element with the largest magnitude vector, x∞in=the xi . (2.32) max i
x (2.32) = max x . Sometimes we may also wish to measure the size of a matrix. In the con context text     of deep learning, the most common wa way y to do this is with the otherwise obscure Sometimes we may also wish to measure the size of a matrix. In the context Frob obenius enius norm sX of deep learning, the most common way to do this is with the otherwise obscure A 2i,j , A F = (2.33) Frobenius norm i,j
A , A = (2.33) whic which h is analogous to the L2 norm of a vector.   producttoofthe twoLvectors rewritten in terms of norms. Sp Speciﬁcally eciﬁcally eciﬁcally,, whicThe h isdot analogous normcan of abevector. sX The dot product of two vectors can rewritten xbe  2 y 2 cos θin terms of norms. Speciﬁcally x>y = (2.34), y = yx y cos θ where θ is the angle betw etween een x x and .     where θ is the angle between x and y .
2.6
Sp Special ecial Kinds of Matrices and Vectors
Some special ecial kindsKinds of matrices vectors areand particularly useful. 2.6 sp Sp ecial of and Matrices Vectors 40 are particularly useful. Some special kinds of matrices and vectors
(2.34)
CHAPTER 2. LINEAR ALGEBRA
Diagonal matrices consist mostly of zeros and hav havee nonzero entries only along the main diagonal. F Formally ormally ormally,, a matrix D is diagonal if and only if Di,j = 0 for consistseen mostly zeros and e nonzero entriesthe only alongy all iDiagonal Weematrices hav havee already one ofexample of hav a diagonal matrix: identit identity 6= j . W D D the main diagonal. Formally , a entries matrixare is and if 0 for (v)only matrix, where all of the diagonal 1. diagonal We write ifdiag diag( to denote a=square all . W e hav e already seen one example of a diagonal matrix: the identit i = j diagonal matrix whose diagonal entries are given by the en entries tries of the vector vy. matrix, all ofare theofdiagonal 1. Wemultiplying write diag(vby ) to denote a square 6 where Diagonal matrices interest entries in partare because a diagonal matrix v. diagonal matrix whose diagonal entries are given b y the en tries of the vector is very computationally eﬃcien eﬃcient. t. To compute diag diag((v)x , we only need to scale each Diagonal are of interest indiag( part by aa square diagonal matrix ( vb )xecause = v multiplying elemen element t xmatrices Inverting erting diagonal x. Inv i by v i. In other words, diag ( v ) x is very computationally eﬃcien t. T o compute diag , we only need to scale each matrix is also eﬃcient. The inv inverse erse exists only if ev every ery diagonal entry is nonzero, x bycase, v . In (/v v)x, . = elemen other words, diag v x.>Inverting a square diagonal ( v) −1 = diag ([1 ([1/v and in tthat diag diag( diag([1 1 . . , 1/vn] ). In many cases, we may matrix is also eﬃcient. inv exists algorithm only ifevery diagonal entry is matrices, nonzero, deriv derivee some very generalThe mac machine hineerse learning in terms of arbitrary ( v) (and = diag ([1descriptiv /v , . . . , 1e) /v algorithm ] ). In many and obtain in thatacase, diagensive cases, we some may but less exp expensive less descriptive) by restricting deriv e some v ery general mac hine learning algorithm in terms of arbitrary matrices, matrices to be diagonal. but obtain a less expensive (and less descriptive) algorithm by restricting some Not all diagonal matrices need be square. It is p ossible to construct a rectangular matrices to be diagonal. diagonal matrix. Nonsquare diagonal matrices do not hav havee inv inverses erses but it is still Not all diagonal matrices need b e square. It is p ossible to construct a rectangular D , the possible to multiply by them cheaply cheaply.. For a nonsquare diagonal matrix diagonal matrix. Nonsquare diagonal matrices do not hav e inv erses but it still pro product duct Dx will in inv volv olvee scaling each element of x , and either concatenatingissome D , last possible cheaply For nonsquare diagonal matrix the Dthem zeros to to themultiply result ifby is taller than. it is awide, or discarding some of the x , and either concatenating some producttsD willvector involvife scaling eachthan element elemen elements ofxthe D is wider it isoftall. zeros to the result if D is taller than it is wide, or discarding some of the last A symmetric matrixif is matrix thatit isis equal transpose: ose: elemen ts of the vector Dany is wider than tall. to its own transp A= A>is. equal to its own transpose: (2.35) A symmetric matrix is any matrix that Symmetric matrices often arise whenAthe of = entries A . are generated by some function (2.35) two argumen arguments ts that do does es not dep depend end on the order of the arguments. For example, Symmetric matrices often arise when the generated by some function of if A is a matrix of distance measuremen with Aare measurements, ts,entries i,j giving the distance from p oint op argumen ts that do= es A not dep end on the order of the arguments. For example, itwto oin ointt j , then Ai,j j,i b ecause distance functions are symmetric. if A is a matrix of distance measurements, with A giving the distance from point A unit ve vector ctor is a vector with unit norm: i to p oint j , then A = A because distance functions are symmetric. xnorm 1. (2.36) 2=1 A unit vector is a vector with unit :. x = 1. (2.36) > A vector x and a vector y are ortho orthogonal each h other if x y = 0. If both   gonal to eac vectors ha e nonzero norm, this means that they are at a 90 degree angle to each hav v x y x ynonzero A vector and a vector are ortho gonal to eac h other ifwith = 0. If norm. both n other. In R , at most n vectors ma may y b e mutually orthogonal vectors ve nonzero norm, this means that they are at unit a 90 norm, degree we angle each If the vha ectors have ve calltothem R are not only orthogonal but also ha other. In , at most v ectors ma y b e mutually orthogonal with nonzero norm. n orthonormal orthonormal.. If the vectors are not only orthogonal but also have unit norm, we call them An ortho orthogonal gonal matrix is a square matrix whose rows are mutually orthonormal orthonormal . and whose columns are mutually orthonormal: An orthogonal matrix is a square matrix whose rows are mutually orthonormal A>orthonormal: A = AA> = I . (2.37) and whose columns are mutually A A = 41 AA = I .
(2.37)
CHAPTER 2. LINEAR ALGEBRA
This implies that
A −1 = A> ,
(2.38) This implies that so orthogonal matrices are of interest inv inverse erse is very cheap to compute. Abecause = A their , (2.38) Pay careful atten attention tion to the deﬁnition of orthogonal matrices. Counterin Counterintuitively tuitively tuitively,, so orthogonal are of interest because their inverse is very cheapistonocompute. their rows arematrices not merely orthogonal but fully orthonormal. There sp special ecial P a y careful atten tion to the deﬁnition of orthogonal matrices. Counterin tuitively term for a matrix whose rows or columns are orthogonal but not orthonormal. , their rows are not merely orthogonal but fully orthonormal. There is no special term for a matrix whose rows or columns are orthogonal but not orthonormal.
2.7
Eigendecomp Eigendecomposition osition
Man Many ob objects jects can be understo understoo o d better by breaking them in into to 2.7 y mathematical Eigendecomp osition constituen constituentt parts, or ﬁnding some properties of them that are univ universal, ersal, not caused Man y mathematical ob jects can b e understo o d b etter by breaking them into by the way we cho hoose ose to represen representt them. constituent parts, or ﬁnding some properties of them that are universal, not caused For wexample, integers tegers can bet decomp decomposed into to prime factors. The wa way y we by the ay we choin ose to represen them. osed in represen representt the num numb ber 12 will change dep depending ending on whether we write it in base ten example, tegers can decomp osed in×to2 ×prime factors. The way we or inFor binary binary, , but itin will alwa always ys bb e etrue that 12 = 22× 3. From this representation represen t the num ber 12prop willerties, change dep oniswhether we write in that base an teny w e can conclude useful properties, suc such h asending that 12 not divisible by 5it, or any 2 or in binary , but it will alwa ys b e true that 12 = 2 3 . F rom this representation in integer teger multiple of 12 will be divisible by 3. we can conclude useful properties, such as that 12×is not × divisible by 5, or that any Muc Much h as we can disco discov ver something ab about out the true nature of an integer by integer multiple of 12 will be divisible by 3. decomp decomposing osing it into prime factors, we can also decomp decompose ose matrices in ways that Muc h as we can disco v er something ab out the true of vious an integer by sho show w us information ab about out their functional prop properties erties thatnature is not ob obvious from the decomp osing we can decompose matrices in ways that represen representation tationitofinto theprime matrixfactors, as an array of also elements. show us information about their functional properties that is not obvious from the One of the most widely used kinds of matrix decomp decomposition osition is called eigenrepresentation of the matrix as an array of elements. de deccomp omposition osition osition,, in whic which h we decomp decompose ose a matrix in into to a set of eigenv eigenvectors ectors and One of the most widely used kinds of matrix decomp osition is called eigeneigen eigenv values. decomposition, in which we decompose a matrix into a set of eigenvectors and An eigenve eigenvector ctor of a square matrix A is a nonzero vector v suc such h that multiplieigen values. cation by A alters only the scale of v: An eigenvector of a square matrix A is a nonzero vector v such that multiplication by A alters only the scale of Av v: = λv. (2.39) Av =corresponding λv . The scalar λ is known as the eigenvalue to this eigen eigenv vector. (2.39) (One > > v A = λ v can also ﬁnd a left eigenve eigenvector ctor suc such h that , but we are usually concerned λ is known The scalar as the eigenvalue corresponding to this eigenvector. (One with righ rightt eigen eigenv vectors). can also ﬁnd a left eigenvector such that v A = λv , but we are usually concerned v ist an eigenv eigenvector ector of A, then so is an any y rescaled vector sv for s ∈ R, s 6 = 00.. withIf righ eigen vectors). Moreo Moreov ver, sv still has the same eigenv eigenvalue. alue. For this reason, we usually only look ok R lo v A s v s , s If is an eigenv ector of , then so is an y rescaled vector for = 0. for unit eigen eigenvectors. vectors. Moreover, sv still has the same eigenvalue. For this reason, we usually∈only 6look Supp Suppose ose that a matrix A has n linearly indep independen enden endentt eigenv eigenvectors, ectors, {v (1) , . . . , for unit eigenvectors. corresponding onding eigenv eigenvalues alues {λ1, . . . , λn} . We ma may y concatenate all of the v(n) } , with corresp Suppose that a matrix A has n linearly indep endent eigenvectors, v , . . . , , with corresponding eigenvalues λ42, . . . , λ . We may concatenate{all of the v } { }
CHAPTER 2. LINEAR ALGEBRA
E"ect of eigenvectors and eigenvalues
3 2
Before multiplication Before multiplication
1
2
After multiplication After multiplication
1
v (1)
0
x10
x1
3
v (1)
0
¸ 2v (2) v (2)
(2)
v
−1
−1
−2 −3 −3
¸1 v(1)
−2 −2
−1
0 x0
1
2
3
−3 −3
−2
−1
0 x00
1
2
3
Figure 2.3: An example of the eﬀect of eigen eigenvectors vectors and eigenv eigenvalues. alues. Here, we hav havee a matrix A with tw two o orthonormal eigenv eigenvectors, ectors, v (1) with eigenv eigenvalue alue λ1 and v (2) with Figure 2.3:λ 2An example of the e hav e eigen eigenv value . (L (Left) eft) We plot theeﬀect set ofofalleigen unitvectors vectorsand a unit Here, circle.w(R (Right) ight) u ∈eigenv R 2 asalues. A v λ v a matrix with tw o orthonormal eigenv ectors, with eigenv alue and with Au A We plot the set of all points . By observing the way that Rdistorts the unit circle, we eft) W λ it. (L (i) unit vectors u eigensee value e plot set of vall as a unit circle. (Right) can that scales space in the direction by λi . Au A We plot the set of all points . By observing the way that∈ distorts the unit circle, we can see that it scales space in direction v by λ .
eigen eigenv vectors to form a matrix V with one eigen eigenv vector per column: V = [[v v(1) , . . . , v(n) ]. Likewise, we can concatenate the eigenv λ1 , . . . , eigenvalues alues to form a vector λ = [[λ eigen v ectors to form a matrix with one eigen v ector p er column: V V = [ v ,..., > λn ] . The eigende eigendeccomp omposition osition of A is then giv given en by v ]. Likewise, we can concatenate the eigenvalues to form a vector λ = [ λ , . . . , −1 λ ] . The eigendecomposition A of = AV is diag then(λgiv by diag( )Ven . (2.40)
A = matrices V diag(λwith )V sp . eciﬁc eigenv (2.40) We hav havee seen that constructing speciﬁc eigenvalues alues and eigenv eigenvecectors allo allows ws us to stretch space in desired directions. Ho Howev wev wever, er, we often wan antt to W e hav e seen that c onstructing matrices with sp eciﬁc eigenv alues and eigenv de deccomp ompose ose matrices into their eigen eigenv values and eigenv eigenvectors. ectors. Doing so can help ecus torsanalyze allows certain us to stretch space However, wean often waninto t to to prop properties erties of in thedesired matrix,directions. muc much h as decomp decomposing osing integer decomp osefactors matrices eigenvaluesthe andbehavior eigenvectors. so can help us its prime caninto helptheir us understand of thatDoing in integer. teger. to analyze certain prop erties of the matrix, much as decomposing an integer into Not every matrix can b e decomp decomposed osed in into to eigenv eigenvalues alues and eigenv eigenvectors. ectors. In some its prime factors can help us understand the behavior of that integer. Not every matrix can b e decomposed43into eigenvalues and eigenvectors. In some
CHAPTER 2. LINEAR ALGEBRA
cases, the decomp decomposition osition exists, but ma may y in inv volv olvee complex rather than real numbers. Fortunately ortunately,, in this b ook, we usually need to decomp decompose ose only a sp speciﬁc eciﬁc class of cases, the decomp osition exists, but ma y in v olv e complex rather than numbers. matrices that ha have ve a simple decomp decomposition. osition. Sp Speciﬁcally eciﬁcally eciﬁcally,, ev every ery realreal symmetric Fortunately this b ook, we need to using decomp oserealv only alued a speciﬁc class of matrix can b, eindecomposed intousually an expression only realvalued eigen eigenvectors vectors matrices that ha ve a simple decomp osition. Sp eciﬁcally , ev ery real symmetric and eigen eigenv values: matrix can b e decomposed into anAexpression = QΛQ>,using only realvalued eigenvectors (2.41) and eigenvalues: where Q is an orthogonal matrixAcomp composed osed eigenvectors ectors of A, and Λ(2.41) is a = QΛ Q ,of eigenv diagonal matrix. The eigen eigenv value Λi,i is asso associated ciated with the eigen eigenv vector in column i Q is anasorthogonal ΛAis as where matrix osed of eigenv ectors , and of a of an orthogonal matrix, we of canAthink Q, denoted Q:,i. Because Q iscomp diagonalspace matrix. eigenvalue Λ(i) is associated with the eigenvector in column i scaling by λThe i in direction v . See Fig. 2.3 for an example. of Q, denoted as Q . Because Q is an orthogonal matrix, we can think of A as A guaran While an any ybreal guaranteed ha hav ve an eigendecomp eigendecomposiosiscaling space y λ symmetric in directionmatrix v . SeeisFig. 2.3 teed for antoexample. tion, the eigendecomp eigendecomposition osition ma may y not be unique. If any tw two o or more eigenv eigenvectors ectors A isofguaran While any real symmetric matrix teed to have an eigendecomp osishare the same eigenv eigenvalue, alue, then an any y set orthogonal vectors lying in their span tion,also theeigenv eigendecomp osition maeigenv y notalue, be unique. any tw o oralently more eigenv are eigenvectors ectors with that eigenvalue, and weIfcould equiv equivalently chooseectors aQ share the same eigenv alue, then an y set of orthogonal vectors lying in their span using those eigenv eigenvectors ectors instead. By con conv ven ention, tion, we usually sort the en entries tries of Λ are also eigenv ectors with that eigenv alue, and w e could equiv alently choose aQ in descending order. Under this conv convention, ention, the eigendecomp eigendecomposition osition is unique only using eigenvalues ectorsare instead. if all ofthose the eigenv eigenvalues unique.By convention, we usually sort the entries of Λ in descending order. Under this convention, the eigendecomposition is unique only The eigendecomp eigendecomposition osition of a matrix tells us man many y useful facts about the if all of the eigenvalues are unique. matrix. The matrix is singular if and only if any of the eigenv eigenvalues alues are 0. The The eigendecomp osition of a matrix tells us man y useful factstoabout the eigendecomp eigendecomposition osition of a real symmetric matrix can also be used optimize matrix. The matrix isof singular if any of the eigenv alues are 0. The x) =only x> Ax x x 2 = quadratic expressions the form iff (and sub subject ject to  11.. Whenever eigendecomp osition of a real symmetric matrix can also be used to optimize is equal to an eigenv eigenvector ector of A, f tak takes es on the value of the corresponding eigenv eigenvalue. alue. f ( x ) = x Ax x x quadratic expressions of the form sub ject to = 1 . Whenever The maxim maximum um value of f within the constrain constraintt region is the maximum eigenv eigenvalue alue A f is equal to a n eigenv ector of , tak es on the v alue of the corresponding eigenv alue.  um eigen and its minim minimum um value within the constraint region is theminim minimum eigenv value. The maximum value of f within the constraint region is the maximum eigenvalue A matrix whose eigen eigenv values are all positive is called positive deﬁnite deﬁnite.. A matrix and its minimum value within the constraint region is the minimum eigenvalue. whose eigenv eigenvalues alues are all positiv ositivee or zerov zerovalued alued is called positive semideﬁnite semideﬁnite.. A matrix whose eigen v alues are all p ositive is called p ositive deﬁnite . A, matrix Lik Likewise, ewise, if all eigen eigenv values are negativ negative, e, the matrix is ne negative gative deﬁnite deﬁnite, and if whose eigenv alues are all p ositiv e or zerov alued is called p ositive semideﬁnite all eigen eigenv values are negative or zerov zerovalued, alued, it is ne negative gative semideﬁnite semideﬁnite.. Positiv Positivee. Likewise, if all eigenvare alues are negativ e, thethey matrix is negative , and x, x>Ax ≥ 0if. semideﬁnite matrices interesting because guarantee that ∀deﬁnite allositiv eigen alues are negative or zerovalued, it is that negative semideﬁnite e P ositive e vdeﬁnite matrices additionally guarantee x>Ax = 0 ⇒ x =. 0Positiv . semideﬁnite matrices are interesting because they guarantee that x, x Ax 0. Positive deﬁnite matrices additionally guarantee that x Ax = 0 ∀ x = 0. ≥ 2.8 Singular Value Decomp Decomposition osition ⇒ In , we sa saw w ho how to decomp decompose ose a matrix in into to eigen eigenv vectors and eigen eigenv values. 2.8Sec. 2.7 Singular Vwalue Decomp osition The singular value de deccomp omposition osition (SVD) pro provides vides another wa way y to factorize a matrix, In Sec. 2.7 , w e sa w ho w to decomp ose a matrix in to eigen v ectors and veigen values. in into to singular ve vectors ctors and singular values. The SVD allows us to disco discov er some of The singular value de c omp osition (SVD) pro vides another wa y to factorize a matrix, the same kind of information as the eigendecomp eigendecomposition. osition. Ho How wev ever, er, the SVD is into singular vectors and singular values. The SVD allows us to discover some of the same kind of information as the eigendecomp osition. However, the SVD is 44
CHAPTER 2. LINEAR ALGEBRA
more generally applicable. Every real matrix has a singular value decomp decomposition, osition, but the same is not true of the eigenv eigenvalue alue decomp decomposition. osition. For example, if a matrix more applicable. Every real matrix has a singular alue osition, is not generally square, the eigendecomp eigendecomposition osition is not deﬁned, and wevm ustdecomp use a singular but same is not true of the eigenvalue decomposition. For example, if a matrix v aluethe decomp decomposition osition instead. is not square, the eigendecomposition is not deﬁned, and we must use a singular Recall that the eigendecomp eigendecomposition osition in involv volv volves es analyzing a matrix A to disco discov ver value decomposition instead. a matrix V of eigen eigenv vectors and a vector of eigen eigenv values λ suc such h that we can rewrite Recall that the eigendecomp osition involves analyzing a matrix A to discover A as λ such that we can rewrite a matrix V of eigenvectors andAa = vector of(eigen values V diag diag( λ)V −1 . (2.42) A as A = V diag (λ)V except . (2.42) A The singular value decomp decomposition osition is similar, this time we will write as a product of three matrices: The singular value decomposition is similar, except this time we will write A as a product of three matrices: A = U DV >. (2.43) A = U DV . (2.43) Supp Suppose ose that A is an m × n matrix. Then U is deﬁned to be an m × m matrix, D to be an m × n matrix, and V to be an n × n matrix. Suppose that A is an m n matrix. Then U is deﬁned to be an m m matrix, Eac Each h of these matrices is deﬁned to hav havee a sp special ecial structure. The matrices U D to be an m n matrix, and × V to be an n n matrix. × and V are both deﬁned to b e orthogonal matrices. The matrix D is deﬁned to be × matrices is deﬁned to have a×special structure. The matrices U Each ofmatrix. these a diagonal Note that D is not necessarily square. and V are both deﬁned to b e orthogonal matrices. The matrix D is deﬁned to be The elemen elements ts along the diagonal of D are kno known wn as the singular values of the a diagonal matrix. Note that D is not necessarily square. matrix A. The columns of U are kno known wn as the leftsingular ve vectors ctors ctors.. The columns D The elemen ts along the diagonal of are kno wn as the singular values of the of V are kno known wn as as the rightsingular ve vectors ctors ctors.. matrix A. The columns of U are known as the leftsingular vectors. The columns e can actually the singular value decomposition osition of A in terms of of VWare kno wn as asinterpret the rightsingular vectors . decomp the eigendecomposition of functions of A . The leftsingular vectors of A are the A in terms Wveectors can actually the singular valueofdecomp osition of of. A are the eigen eigenv of AA> .interpret The righ rightsingular tsingular vectors eigen eigenvectors vectors of A>A A A the eigendecomposition of functions of . The leftsingular v ectors of are the The nonzero singular values of A are the square ro roots ots of the eigen eigenv values of A>A. AA A eigen v ectors of . The righ tsingular v ectors of are the eigen vectors of A A. > The same is true for AA . The nonzero singular values of A are the square roots of the eigenvalues of A A. Perhaps the most useful feature of the SVD is that we can use it to partially The same is true for AA . generalize matrix in inversion version to nonsquare matrices, as we will see in the next P erhaps the most useful feature of the SVD is that we can use it to partially section. generalize matrix inversion to nonsquare matrices, as we will see in the next section.
2.9
The Mo MooreP oreP orePenrose enrose Pseudoin Pseudoinverse verse
Matrix inv version is oreP not deﬁned for matrices that areverse not square. Supp Suppose ose we wan wantt 2.9 in The Mo enrose Pseudoin to mak makee a leftinv leftinverse erse B of a matrix A, so that we can solve a linear equation Matrix inversion is not deﬁned for matrices that are not square. Suppose we want to make a leftinverse B of a matrixAx A, = so ythat we can solve a linear equation (2.44) Ax = y 45
(2.44)
CHAPTER 2. LINEAR ALGEBRA
by leftm leftmultiplying ultiplying eac each h side to obtain by leftmultiplying each side to obtain x = By.
(2.45)
= B y . it ma (2.45)a Dep Depending ending on the structure of the x problem, may y not be possible to design unique mapping from A to B . Depending on the structure of the problem, it may not be possible to design a If A is taller than it is wide, then it is possible for this equation to hav havee unique mapping from A to B . no solution. If A is wider than it is tall, then there could be multiple possible If A is taller than it is wide, then it is possible for this equation to have solutions. no solution. If A is wider than it is tall, then there could be multiple possible The Mo Moor or orePenr ePenr ePenrose ose pseudoinverse allows us to make some headwa headway y in these solutions. cases. The pseudoinv pseudoinverse erse of A is deﬁned as a matrix The MoorePenrose pseudoinverse allows us to make some headway in these cases. The pseudoinverse A of+A=islim deﬁned (A>Aas+aαmatrix I ) −1 A> . (2.46) α&0
A = lim (A A + αI ) A . (2.46) Practical algorithms for computing the pseudoinv pseudoinverse erse are not based on this deﬁnition, but rather the formula Practical algorithms for computing the pseudoinverse are not based on this deﬁnition, but rather the formula A + = V D +U >, (2.47) A value = Vdecomp D U osition , (2.47) A , and the pseudoin where U , D and V are the singular decomposition of ofA pseudoinverse verse + D of a diagonal matrix D is obtained by taking the recipro reciprocal cal of its nonzero V arethe where U and thetransp singular decomp osition of A , and the pseudoinverse elemen elements ts, D then taking transpose osevalue of the resulting matrix. D of a diagonal matrix D is obtained by taking the reciprocal of its nonzero When A has more columns than rows, then solving a linear equation using the elements then taking the transp ose of the resulting matrix. pseudoin pseudoinv verse provides one of the man many y possible solutions. Speciﬁcally Speciﬁcally,, it pro provides vides A When has more columns than rows, then solving a linear equation using the + the solution x = A y with minimal Euclidean norm x2 among all possible pseudoinverse provides one of the many possible solutions. Speciﬁcally, it provides solutions. the solution x = A y with minimal Euclidean norm x among all possible When A has more rows than columns, it is possible for there to be no solution. solutions.   In this case, using the pseudoinv pseudoinverse erse gives us the x for which Ax is as close as WhentoAy has more rows than columns, is p−ossible possible in terms of Euclidean norm itAx y2 . for there to be no solution. In this case, using the pseudoinverse gives us the x for which Ax is as close as possible to y in terms of Euclidean norm Ax y . 2.10 The Trace Op Operator erator  −  The operator gives Op the sum of all of the diagonal en entries tries of a matrix: 2.10traceThe Trace erator X The trace operator gives the sum entries of a matrix: (2.48) Tr(ofAall ) =of theAdiagonal i,i . i
Tr(A) = A . (2.48) The trace op operator erator is useful for a variety of reasons. Some op operations erations that are diﬃcult to sp specify ecify without resorting to summation notation can b e sp speciﬁed eciﬁed using The trace operator is useful for a variety of reasons. Some operations that are 46X diﬃcult to specify without resorting to summation notation can b e speciﬁed using
CHAPTER 2. LINEAR ALGEBRA
matrix pro products ducts and the trace op operator. erator. For example, the trace op operator erator provides an alternativ alternativee way of writing the Frob robenius enius norm of a matrix: matrix products and the trace operator. For example, the trace operator provides q an alternative way of writing the Frobenius norm of a matrix: AF = Tr( r(AA AA> ). (2.49)
A = Tr(AA ). (2.49) Writing an expression in terms of the trace op operator erator op opens ens up opp opportunities ortunities to  man manipulate the expression using many y useful identities. F For or example, the trace W riting an expression in terms of the trace operator opens up opp ortunities to op operator erator is in inv varian ariantt to the transp transpose ose op operator: erator: manipulate the expression using manyquseful identities. For example, the trace operator is invariant to the transp Tr(ose A) op = erator: T Tr( r(A> ). (2.50)
Tr(A) = Tr(A ). (2.50) The trace of a square matrix comp composed osed of many factors is also in inv varian ariantt to mo moving ving the last factor into the ﬁrst p osition, if the shap shapes es of the corresp corresponding onding The trace of a square matrix comp osed of many factors is also in v arian t to matrices allo the resulting pro to b e deﬁned: allow w product duct moving the last factor into the ﬁrst p osition, if the shapes of the corresp onding matrices allow the resulting pro toCbAB e deﬁned: Tr(AB ABC C )duct =T Tr( r( )=T Tr( r(B C A) (2.51) or more generally generally,, or more generally,
Tr(AB C ) = Tr(C AB) = Tr(B C A) n−1 n Y Y Tr( F (i)) = T Tr( r( r(F F (n) F (i) ). i=1
(2.51) (2.52)
i=1
Tr( F ) = Tr(F F ). (2.52) This inv invariance ariance to cyclic perm ermutation utation holds even if the resulting pro product duct has a diﬀeren diﬀerentt shap shape. e. For example, for A ∈ Rm×n and B ∈ R n×m, we ha hav ve This invariance to cyclic permutation holds even if the resulting product has a R R diﬀerent shape. For example, for A B , we have Tr( r(and BA (2.53) YTr(AB ) = T Y) ∈ ∈ n×T n r(B A) AB (2.53) . ev even en though AB ∈ Rm×m and TBr(A ∈ )R= R mindRis that . a scalar is its own trace: a = Tr(a ). evenAnother though useful AB fact to keep and in BA ∈ is that a scalar is its own trace: a = Tr(a ). Another useful∈fact to keep in mind
2.11
The Determinan Determinantt
The of a square matrix, denoted det det((A ), is a function mapping 2.11determinant The Determinan t matrices to real scalars. The determinant is equal to the pro product duct of all the The determinant of a square ), is a function eigen eigenv v alues of the matrix. The matrix, absolute denoted value of det the(A determinant can bemapping thought matrices to real scalars. The determinant is equal to the pro duct of all the of as a measure of how muc uch h multiplication by the matrix expands or con contracts tracts eigen v alues of the matrix. The absolute v alue of the determinant can b e thought space. If the determinant is 0, then space is contracted completely along at least of asdimension, a measure causing of how m multiplication by the If matrix expands or is con one it uc tohlose all of its volume. the determinant 1,tracts then space. If the determinant is 0, then space is contracted completely along at least the transformation is volumepreserving. one dimension, causing it to lose all of its volume. If the determinant is 1, then the transformation is volumepreserving. 47
CHAPTER 2. LINEAR ALGEBRA
2.12
Example: Principal Comp Componen onen onents ts Analysis
One mac machine hine learning algorithm,Comp princip principal alonen comp omponents onents analysis or PCA can 2.12simple Example: Principal ts Analysis be deriv derived ed using only kno knowledge wledge of basic linear algebra. One simple machine learning algorithm, principal(1) components analysis or PCA can m poin , . . . , x (m)} in Rn . Supp Supp Suppose oseusing we ha hav ve akno collection of basic oints ts {xalgebra. Suppose ose we be deriv ed only wledge of linear would like to apply lossy compression to these poin Lossy compression means oints. ts. R m , . . . , x x Supp ose w e ha v e a collection of p oin ts in . Supp ose we storing the points in a wa way y that requires less memory but ma may y lose some precision. would like lik toeapply compression { points. Lossy} compression means W e would like to loselossy as little precisiontoasthese possible. storing the points in a way that requires less memory but may lose some precision. walik yw enco encode these points is represen representt a lo lowerdimensional werdimensional version We One would e etocan lose as de little precision asto possible. of them. For each point x(i) ∈ R n we will ﬁnd a corresponding co code de vector c (i) ∈ R l. way wthan e can nenco points is to represent a lowerdimensional version If l One is smaller , it de willthese take code de points than the R less memory to store the co R x c of them. F or each p oint w e will ﬁnd a corresponding co de vector original data. We will wan antt to ﬁnd some enco encoding ding function that pro produces duces the co code de. l n If is smaller than , it will take less memory to store the co de p oints than the ∈ for an input, f (x) = c, and∈a deco decoding ding function that pro produces duces the reconstructed original data. W e will w an t to ﬁnd some enco ding function that pro duces the co de input giv given en its co code, de, x ≈ g(f (x)). for an input, f (x) = c, and a decoding function that produces the reconstructed PCA is deﬁned by xour cghoice decoding ding function. Sp Speciﬁcally eciﬁcally eciﬁcally,, to mak makee the input given its co de, (f (x))of. the deco deco decoder der very simple, we choose to use matrix multiplication to map the co code de back ≈ choice of the ndeco PCA is deﬁned b y our ×l ding function. Sp eciﬁcally, to make the n in into to R . Let g (c) = Dc, where D ∈ R is the matrix deﬁning the deco decoding. ding. decoder very simple, we choose to use matrix multiplication to map the code back R R deco Computing the optimal co code de for this decoder der could be a diﬃcult problem. To into is the matrix deﬁning the decoding. . Let g (c) = Dc, where D D to be orthogonal keep the enco encoding ding problem easy easy,, PCA constrains the colum columns ns of ofD ∈ this decoder could be a diﬃcult problem. To Computing the optimal co de for to eac each h other. (Note that D is still not technically “an orthogonal matrix” unless k eep the l = n) encoding problem easy, PCA constrains the columns of D to be orthogonal to each other. (Note that D is still not technically “an orthogonal matrix” unless With the problem as describ described ed so far, man many y solutions are possible, because we l = n) can increase the scale of D:,i if we decrease c i prop proportionally ortionally for all poin oints. ts. To giv givee With the problem as describ ed so far, man y solutions are p ossible, b ecause we the problem a unique solution, we constrain all of the columns of D to ha hav ve unit can increase the scale of if we decrease prop ortionally for all p oin ts. T o giv e D c norm. the problem a unique solution, we constrain all of the columns of D to have unit In order to turn this basic idea in into to an algorithm we can implement, the ﬁrst norm. thing we need to do is ﬁgure out how to generate the optimal co code de point c∗ for orderpto turn idea to an algorithm wethe candistance implement, the x . this eac each hIninput oint One basic way to do in this is to minimize betw etween een ﬁrst the c thing w e need to do is ﬁgure out how to generate the optimal co de p oint ∗ input point x and its reconstruction, g( c ). We can measure this distance usingfor a x . One each input point wayonents to do algorithm, this is to minimize theL2distance norm. In the principal comp components we use the norm: between the input point x and its reconstruction, g( c ). We can measure this distance using a norm. In the principal components algorithm, we use the L norm: c∗ = arg min x − g(c)2 . (2.54) c
c = arg min x g(c) . (2.54) We can switch to the squared L 2 norm instead of the L2 norm itself, b ecause both are minimized by the same value of c .−This  is b ecause the L 2 norm is nonL L norm W e can switch to the squared norm instead of the b ecause negativ negativee and the squaring op operation eration is monotonically increasing foritself, nonnegative both are minimized by the same value of c . This is b ecause the L norm is nonnegative and the squaring operation is monotonically increasing for nonnegative 48
CHAPTER 2. LINEAR ALGEBRA
argumen arguments. ts.
c∗ = arg min x − g(c)22 .
c arguments. c = arg min tox g(c) . The function being minimized simpliﬁes  −  − g(c))>to (x − g(c)) The function being minimized(xsimpliﬁes
(x Eq. g(c))2.30 (x) g(c)) (b (by y the deﬁnition of the L2 norm, − − − x>gEq. (c) − g (c) )>x + g(c) >g(c) = xL> xnorm, (by the deﬁnition of the 2.30 = x x x g(c) g (c) x + g(c) g(c) (b (by y the distributiv distributivee property) − − = x>x − 2x > g(c) + g (c)>g(c) (by the distributive property)
(2.55) (2.55) (2.56) (2.56) (2.57) (2.57) (2.58)
x 2to x the = is x equal g(c)transp + g (cose ) g(ofc)itself (2.58) (b (because ecause the scalar g(x)> x transpose itself). ). − being minimized again, to omit the ﬁrst term, We can now change the function (because the scalar g(x) x is equal to the transpose of itself ). since this term do does es not dep depend end on c: We can now change the function being minimized again, to omit the ﬁrst term, c∗ dep = arg since this term do es not endmin on−c2: x> g(c) + g (c)>g(c). (2.59) c
c = arg min 2x g(c) + g (c) g(c). (2.59) To mak makee further progress, we must substitute in the deﬁnition of g(c): − ∗ > > To make further progress, −2x c = argwe min Dc + c>inDthe must substitute Dcdeﬁnition of g(c): (2.60) c
c = arg min 2x >Dc + c >D Dc = arg min −2x Dc + c Ilc − c = arg min c IDc) 2x Dc +on (b (by y the orthogonalit orthogonality y and unit norm constraints − > = arg minconstraints c >D −2x Dc +on (by the orthogonality and unit norm c)
(2.60) (2.61) (2.61) (2.62)
c
= arg min 2x Dc + c c (2.62) We can solve this optimization problem using vector calculus (see Sec. 4.3 if − you do not know how to do this): We can solve this optimization problem using vector calculus (see Sec. 4.3 if > > you do not know how to do ∇ this): (2.63) c (−2x D c + c c) = 0 c)0= 0 (2.63) Dxc++2cc = ( −22xD> (2.64) ∇ − c = D >x. (2.65) 2D x + 2c = 0 (2.64) − c = Dw . optimally enco (2.65)a This mak makes es the algorithm eﬃcient: wee xcan encode de x just using matrixv op T o enco a vector, we apply the enco function matrixvector ector operation. eration. encode de encoder der This makes the algorithm eﬃcient: we can optimally encode x just using a f (xa)vector, = D > xwe matrixvector operation. To encode . apply the encoder function(2.66) 49 D x. f (x) =
(2.66)
CHAPTER 2. LINEAR ALGEBRA
Using a further matrix multiplication, we can also deﬁne the PCA reconstruction op operation: eration: Using a further matrix multiplication, also x. the PCA reconstruction (2.67) r(x) = g (f (wxe))can =D D>deﬁne operation: Next, we need to choose encoding the x.. To do so, we revisit (2.67) r(xthe ) = enco g (f (ding x)) =matrix DD D 2 idea of minimizing the L distance bet etw ween inputs and reconstructions. How However, ever, D Next, w e need to choose the enco ding matrix . T o do so, w e revisit the since we will use the same matrix D to deco decode de all of the points, we can no longer idea of minimizing between inputsminimize and reconstructions. ever, L distance consider the points the in isolation. Instead, we must the Frob robenius eniusHow norm of D since w e will use the same matrix to deco de all of the p oints, we can no longer the matrix of errors computed ov over er all dimensions and all points: consider the points in isolation. Instead, we must minimize the Frob enius norm of the matrix of errors computed over all dimensions and all points: s 2 X (i) D ∗ = arg min xj − r (x(i))j sub subject ject to D> D = Il (2.68) D
i,j
D = arg min x r (x ) sub ject to D D = I (2.68) ∗ D , we will start by considering the case To derive the algorithm for ﬁnding − D where l = 1 In this case, is just a single vector, d. Substituting Eq. 2.67 in 1.. into to D T o derive the algorithm for ﬁnding , we will startto by considering the case Eq. 2.68 and simplifying D in into to d , the problem reduces s D where l = 1. In this case, X is just a single vector, d. Substituting Eq. 2.67 into Xinto d, the problem reduces to Eq. 2.68 and simplifying D  x x(i) − dd>x(i)  22 sub subject ject to d2 = 11.. d ∗ = arg min (2.69) d
i
x dd x sub ject to d = 1. d = arg min (2.69) The ab abo ove form formulation ulation is the most direct wa way y of performing the substitution,  −    but is not the most stylistically pleasing way to write the equation. It places the Thevalue abovde>form is theofmost erforming the substitution, scalar on the right the vdirect ector wa con conv ven entional tional to write x (i) ulation d. yItofispmore but is not the most stylistically pleasing w a y to write the equation. It places the scalar co coeﬃcients eﬃcients on the left operate erate on. We therefore usually write Xof vector they op scalar alueula d x suc such h a vform formula as on the right of the vector d. It is more conventional to write scalar coeﬃcients on the left Xof vector they operate on. We therefore usually write  x x(i) − d>x(i) d 22 sub subject ject to d2 = 11,, d∗ = (2.70) such a formula as arg min d
i
x d x d sub ject to d = 1, d = arg min or, exploiting the fact that a scalar is its own transp transpose, ose, as  −    X ∗ (i)> (i) 2 d = arg min  x x − x dd dd  sub subject ject to  1.. or, exploiting the fact that a scalar is its own 2transpose, as d2 = 1 d
(2.70)
(2.71)
i
d = arg min X x x dd sub ject to d = 1. (2.71) The reader should aim to become familiar with such cosmetic rearrangements.  −    At this point, it can be helpful to rewrite the problem in terms of a single The reader should aim to become familiar with such cosmetic rearrangements. design matrix of examples, rather than as a sum ov over er separate example vectors. A t this p oint, it can b e helpful to rewrite the problem of amatrix single n b e the X This will allo allow w us to use more compact notation. Let X ∈ Rinm×terms design matrix of examples, rather than as a sum ov er separate example vectors. (i)> deﬁned by stacking all of the vectors describing the poin oints, ts, such R that X i,: = x . This willnow allorewrite w us tothe useproblem more compact notation. Let X be the matrix W e can as deﬁned by stacking all of the vectors describing the points,∈such that X = x . > 2 d ∗ =the arg problem min X − subject ject to d> d = 11.. (2.72) We can now rewrite asX dd F sub d
d = arg min X X dd50 sub ject to d d = 1.  − 
(2.72)
CHAPTER 2. LINEAR ALGEBRA
Disregarding the constraint for the moment, we can simplify the Frob robenius enius norm portion as follo follows: ws: Disregarding the constraint for norm X − X we dd>can 2F simplify the Frobenius(2.73) argthe minmoment, portion as follows: d min X Xdd (2.73) arg > > > = arg min Tr X− X dd X − X dd (2.74)  −  d
(b (by y Eq. 2.49)
= arg min Tr
X
X dd
X
X dd
(2.74)
− − > > > > > (by Eq. 2.49 ) = arg min Tr( r(X X X − X X dd − dd X X + dd> X >X dd>) (2.75) d dd X X + dd X X dd ) = arg min Tr(X X X X dd (2.75) > > > > > = arg min Tr( r(X X X ) − Tr(X X dd ) − Tr(dd X X ) + Tr( r(dd dd>X > X dd> ) − − d (2.76) = arg min Tr(X X ) Tr(X X dd ) Tr(dd X X ) + Tr(dd X X dd ) > > > > > > > = arg min − Tr( r(dd dd X X dd ) (2.77) r(X X −X dd ) − Tr(dd −X X ) + Tr( (2.76) d = argterms min not X ddd)do T X the X )arg +T r(dd (2.77) Tr(in X r(dd (b (because ecause inv volving not aﬀect min ) X X dd ) − − > aﬀect > )> (because terms not in v olving d do the arg r(X Xnot = arg min −2 Tr( X dd>) + Tr( r(dd ddmin X X dd>) (2.78) d
= arg min 2 Tr(X X dd ) + Tr(dd X X dd ) (b (because ecause we can cycle the order of the matrices inside a trace, Eq. 2.52) − > matrices > > (because we can=cycle the − order of a dd trace, arg min X dd>) + Tinside r( r(X X>X ddEq. 2 Tr( r(X X the ) 2.52)
(2.78) (2.79)
d
= arg min 2 Tr(X X dd ) + Tr(X X dd dd ) (using the same prop propert ert erty y again) − At this poin oint, t, we rein reintro tro troduce duce the constrain constraint: t: (using the same prop erty again)
(2.79)
> duce the > A t this w tro t: >) sub subject ject to d > d = 1 2 Tt,r( r(X Xe >rein arg min p−oin X dd ) + Tr( r(X X constrain X dd>dd
(2.80)
arg min 2 Tr(X X>dd ) +> Tr(X X>dd dd> ) sub ject to d> d = 1 = arg min −2 Tr( r(X X X dd ) sub r(X X X dd ) + Tr( subject ject to d d = 1 − d
(2.80) (2.81)
d
= arg 2 Tr(X X dd ) + Tr(X X dd ) sub ject to d d = 1 (due to the min constraint) − (due to the constraint) = arg min − Tr( r(X X > X dd>) sub subject ject to d>d = 1
(2.81) (2.82)
d
= arg min Tr(X X dd ) sub ject to d d = 1 = arg max Tr( r(X X > X dd> ) sub subject ject to d> d = 1 − d = arg max Tr(X> X>dd ) sub ject to d> d = 1 = arg max Tr( r(d d X X d) sub subject ject to d d = 1
(2.82) (2.83) (2.83) (2.84)
d
= arg max Tr(d X X d) sub ject to d d = 1 51
(2.84)
CHAPTER 2. LINEAR ALGEBRA
This optimization problem ma may y be solved using eigendecomp eigendecomposition. osition. Sp Speciﬁcally eciﬁcally eciﬁcally,, the optimal d is given by the eigen eigenv vector of X >X corresp corresponding onding to the largest This optimization problem may be solved using eigendecomposition. Speciﬁcally, eigen eigenv value. the optimal d is given by the eigenvector of X X corresponding to the largest Invalue. the general case, where l > 1, the matrix D is given by the l eigen eigenv vectors eigen corresp corresponding onding to the largest eigenv eigenvalues. alues. This may be shown using pro proof of by l > 1this D is In the general case, where , thepro matrix given by the l eigenvectors induction. We recommend writing proof of as an exercise. corresponding to the largest eigenvalues. This may be shown using proof by Linear algebra is one of the fundamen fundamental tal mathematical disciplines that is induction. We recommend writing this proof as an exercise. necessary to understand deep learning. Another key area of mathematics that is Linear algebra is one of the fundamentaltheory mathematical that is ubiquitous in mac machine hine learning is probability theory, , presenteddisciplines next. necessary to understand deep learning. Another key area of mathematics that is ubiquitous in machine learning is probability theory, presented next.
52
Chapter 3 Chapter 3
Probabilit Probability y and Information Theory Probability and Information Theory describee probabilit In this chapter, we describ probability y theory and information theory theory.. In this chapter, we describe probability theory and information theory.
Probabilit Probability y theory is a mathematical framew framework ork for represen representing ting uncertain In this chapter, we describe probability theory and information theory. statemen It pro a means of quan uncertain y and axioms for deriving statements. ts. provides vides quantifying tifying uncertaintt Probabilit y theory is a mathematical framew ork for represen ting uncertain new uncertain statements. In artiﬁcial intelligence applications, we use probability statemen projor vides means of the quanlaws tifying uncertaintyy tell and us axioms for systems deriving theory ints. twIt o ma major waays. First, of probabilit probability how AI new uncertain statements. In artiﬁcial intelligence applications, w e use probability should reason, so we design our algorithms to compute or approximate various theory in twderiv o maed jorusing ways.probabilit First, the laws of. Second, probabilit tell use us how AI systems expressions derived probability y theory theory. wey can probability and should reason, so we design our algorithms toofcompute orAI approximate statistics to theoretically analyze the beha ehavior vior prop proposed osed systems. various expressions derived using probability theory. Second, we can use probability and Probabilit Probability y theory is analyze a fundamental tool ol ofof man many disciplines of science and statistics to theoretically the behato vior propyosed AI systems. engineering. We provide this chapter to ensure that readers whose background is Probabilit y theory is a fundamental tool of man y disciplines of science primarily in soft with limited exp to probability theory and can softw ware engineering exposure osure engineering. W e provide this chapter to ensure that readers whose background is understand the material in this book. primarily in software engineering with limited exposure to probability theory can While probabilit probability y theory allows us to make uncertain statements and reason understand the material in this book. in the presence of uncertaint uncertainty y, information allows us to quan quantify tify the amount of While probabilit y theory allows us to make uncertain statements and reason uncertain y in a probabilit distribution. uncertaintt probability y in the presence of uncertainty, information allows us to quantify the amount of If you are already familiar with probability theory and information theory theory,, uncertainty in a probability distribution. you ma may y wish to skip all of this chapter except for Sec. 3.14, which describ describes es the If y ou are already familiar with probability theory and information theory graphs we use to describ describee structured probabilistic mo models dels for machine learning. If, you ou hav maye wish to skipnoallprior of this chapter for Sec. 3.14, which describshould es the y have absolutely exp experience erience except with these sub subjects, jects, this chapter graphs we usetotosuccessfully describe structured moresearch dels for machine If b e suﬃcient carry outprobabilistic deep learning pro projects, jects,learning. but we do you havethat absolutely no prior experienceresource, with these this chapter suggest you consult an additional suc such hsub asjects, Ja Jaynes ynes (2003 ). should be suﬃcient to successfully carry out deep learning research pro jects, but we do suggest that you consult an additional resource, such as Jaynes (2003). 53 53
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
3.1
Wh Why y Probabilit Probability? y?
Man Many branches of computer science entirely tirely 3.1 y branc Whhes y Probabilit y? deal mostly with entities that are en deterministic and certain. A programmer can usually safely assume that a CPU will Many branc of computer science deal .mostly entities do that are en tirely execute eac each hhes machine instruction ﬂawlessly ﬂawlessly. Errorswith in hardware occur, but are deterministic and certain. A programmer can usually safely assume that a CPU will rare enough that most softw software are applications do not need to be designed to account execute h machine instruction ﬂawlessly . Errors hardware do occur, butinare for them.eacGiv Given en that man many y computer scientists andinsoftw software are engineers work a rare enough that most softw are applications do not need to b e designed to account relativ relatively ely clean and certain environmen environment, t, it can be surprising that mac machine hine learning for them. Giv en that man y computer scientists and softw are engineers work in a mak makes es hea heavy vy use of probabilit probability y theory theory.. relatively clean and certain environment, it can be surprising that machine learning isvy because learning alwa always ys deal with uncertain quantities, makThis es hea use of machine probabilit y theorymust . and sometimes may also need to deal with sto stocchastic (nondeterministic) quan quantities. tities. This is b ecause machine learning must alwa ys deal with uncertain quantities, Uncertain Uncertaintty and sto stocchasticit hasticity y can arise from man many y sources. Researc Researchers hers ha hav ve made and sometimes may also need to deal with sto c hastic (nondeterministic) quan comp compelling elling argumen arguments ts for quantifying uncertaint uncertainty y using probability since at tities. least Uncertain t y and sto c hasticit y can arise from man y sources. Researc hers ha v e made the 1980s. Man Many y of the arguments presented here are summarized from or inspired comp elling argumen ts for quantifying uncertainty using probability since at least b y Pearl (1988 ). the 1980s. Many of the arguments presented here are summarized from or inspired Nearly all activities require some ability to reason in the presence of uncertaint uncertainty y. by Pearl (1988). In fact, beyond mathematical statements that are true by deﬁnition, it is diﬃcult Nearly activities require some ability to reason theeven presence y. to think ofall any prop that is absolutely true orinany absolutely proposition osition event t thatofisuncertaint In fact,teed beyond mathematical statements that are true by deﬁnition, it is diﬃcult guaran guaranteed to occur. to think of any proposition that is absolutely true or any event that is absolutely There are three possible sources of uncertain uncertaintty: guaranteed to occur. There are three possible sources of uncertainty: 1. Inheren Inherentt stochasticit stochasticity y in the system being mo modeled. deled. For example, most in interpretations terpretations of quantum mechanics describ describee the dynamics of subatomic 1. particles Inherent as stochasticit y in the system b eingcreate modeled. For example, being probabilistic. We can also theoretical scenariosmost that ineterpretations quantum mechanics describ dynamics of card subatomic w postulate toofha hav ve random dynamics, such easthe a hypothetical game particles being probabilistic. e can alsosh create theoretical scenarios where weasassume that the cardsWare truly shuﬄed uﬄed in into to a random order.that we postulate to have random dynamics, such as a hypothetical card game where we assume that the cards are truly shuﬄed intocan a random order. 2. Incomplete observ observability ability ability. . Ev Even en deterministic systems app appear ear sto stochastic chastic when we cannot observ observee all of the variables that drive the behavior of the 2. system. Incomplete observ ability EvMont en deterministic systems can sho appwear sto chastic For example, in .the Monty y Hall problem, a game show con contestan testan testant t is when w e cannot observ e all of the v ariables that drive the b ehavior of the ask asked ed to choose betw etween een three do doors ors and wins a prize held behind the chosen system. Foordo example, Montwhile y Halla problem, a game showThe contestan t is do door. or. Tw doors ors leadintothe a goat third leads to a car. outcome ask ed the to choose betw threeis do ors and winsbut a prize ehind thet’schosen giv given en contestan contestant’s t’seen choice deterministic, fromheld the bcon contestan testan testant’s poin ointt do or. T w o do ors lead to a goat while a third leads to a car. The outcome of view, the outcome is uncertain. given the contestant’s choice is deterministic, but from the contestant’s point of view, the mo outcome uncertain. 3. Incomplete modeling. deling.is When we use a mo model del that must discard some of the information we hav havee observ observed, ed, the discarded information results in 3. uncertain Incomplete mo deling. When we use a mo that must discard uncertaintty in the mo model’s del’s predictions. Fordel example, supp suppose ose we some build of a the we hav e observ the discarded information in rob robot otinformation that can exactly observe theed, lo location cation of every ob object ject aroundresults it. If the uncertainty in the model’s predictions. For example, suppose we build a robot that can exactly observe the54location of every ob ject around it. If the
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
rob robot ot discretizes space when predicting the future lo location cation of these ob objects, jects, then the discretization makes the robot immediately become uncertain ab about out rob ot discretizes space when predicting the future lo cation of these ob jects, the precise position of ob objects: jects: eac each h ob object ject could be anywhere within the then the discretization makes the robot immediately become uncertain about discrete cell that it was observ observed ed to occup ccupy y. the precise position of ob jects: each ob ject could be anywhere within the discrete cell that it was observed to occupy. In man many y cases, it is more practical to use a simple but uncertain rule rather than a complex but certain one, ev even en if the true rule is deterministic and our In man y cases, is more practical to use a simple but rule. uncertain rule rather mo modeling deling system hasitthe ﬁdelit ﬁdelity y to accommo accommodate date a complex For example, the than a complex but certain one, ev en if the true rule is deterministic and our simple rule “Most birds ﬂy” is cheap to develop and is broadly useful, while a rule mothe deling system hasﬂy the ﬁdelityfor to vaccommo date a complex rule. Foryet example, of form, “Birds ﬂy, , except ery young birds that ha hav ve not learnedthe to simple rule “Most birds ﬂy” is c heap to develop and is broadly useful, while a rule ﬂy ﬂy,, sick or injured birds that hav havee lost the abilit ability y to ﬂy ﬂy,, ﬂightless sp species ecies of birds of the form, ﬂy ,,except ery y.oung that to hadev ve not learned to including the “Birds cassow cassowary ary ary, ostric ostrich h for andvkiwi. . ” is birds exp expensive ensive develop, elop,yet main maintain tain and ﬂy, sick or injured thatofhav lost the abilitvyery to brittle ﬂy, ﬂightless species of birds comm communicate, unicate, and birds after all thise eﬀort is still and prone to failure. including the cassowary, ostrich and kiwi. . . ” is expensive to develop, maintain and Giv Given en that we need a means of representing and reasoning ab about out uncertaint uncertainty y, communicate, and after all of this eﬀort is still very brittle and prone to failure. it is not immediately ob obvious vious that probabilit probability y theory can provide all of the to tools ols Given that we need a means ofapplications. representing Probability and reasoning about uncertaint y, we wan want t for artiﬁcial in intelligence telligence theory was originally it iselop noted immediately that probabilit yts.theory can provide all of the tools dev develop eloped to analyze ob thevious frequencies of ev even en ents. It is easy to see how probability we wantcan for bartiﬁcial intelligence applications. Probability originally theory e used to study ev even en ents ts like dra drawing wing a certaintheory hand w ofascards in a dev elop ed to analyze the frequencies of ev en ts. It is easy to see how probability game of poker. These kinds of even events ts are often rep repeatable. eatable. When we sa say y that theory can bhas e used to study ev like drawing a certain cards in p en an outcome a probability oftsoccurring, it means that hand if we of repeated thea game of poker. ts are often rep eatable. we sa y that p exp experimen erimen eriment t (e.g.,These draw akinds handofofeven cards) inﬁnitely man many y times,When then prop proportion ortion p an outcome has a probability of o ccurring, it means that if w e repeated the of the rep repetitions etitions would result in that outcome. This kind of reasoning do does es not p exp erimen t (e.g., draw a hand of cards) inﬁnitely man y times, then prop ortion seem immediately applicable to prop propositions ositions that are not rep repeatable. eatable. If a do doctor ctor of the repaetitions result in the thatpatient outcome. kind of reasoning es not analyzes patientwould and sa says ys that has This a 40% chance of havingdothe ﬂu, seem immediately applicable to prop ositions that are not rep eatable. If a do ctor this means something very diﬀerent—w diﬀerent—wee can not mak makee inﬁnitely man many y replicas of analyzes a patient and sa ys that the patient has a 40% chance of having ﬂu, the patient, nor is there an any y reason to believe that diﬀeren diﬀerentt replicas of the the patien patient t this means something very diﬀerent—w e can not mak e inﬁnitely man y replicas of would present with the same symptoms yet hav havee varying underlying conditions. In the patient, nor do is there an y reason to b elieve that t replicasto of represent the patienat the case of the diagnosing the patient, we diﬀeren use probability doctor ctor would the1 same symptoms yet hav e varying underlying conditions. In de degr gr greee present of belief elief,with , with indicating absolute certaint certainty y that the patient has the ﬂu the case of the doabsolute ctor diagnosing patient, we usedo probability represent a and 0 indicating certain certainttythe that the patient does es not hav haveetothe ﬂu. The de gr e e of b elief , with 1 indicating absolute certaint y that the patient has the ﬂu former kind of probability probability,, related directly to the rates at which even events ts occur, is and 0 indicating absolute certain t y that the patient do es not hav e the The kno known wn as fr freequentist pr prob ob obability ability ability,, while the latter, related to qualitative ﬂu. lev levels els of former of probability , related directly certain certainttkind y, is known as Bayesian pr prob ob obability ability ability.. to the rates at which events occur, is known as frequentist probability, while the latter, related to qualitative levels of If we list several properties that we expect common sense reasoning ab about out certainty, is known as Bayesian probability. uncertain uncertaintty to ha hav ve, then the only wa way y to satisfy those prop properties erties is to treat If w e list several properties that w e expect common sense reasoning about Ba Bay yesian probabilities as beha ehaving ving exactly the same as frequentist probabilities. uncertain ty to haevw e,an then the onlythe wayprobabilit to satisfy those prop is to F or example, if w ant t to compute probability y that a play player ererties will win a ptreat oker Ba y esian probabilities as b eha ving exactly the same as frequentist probabilities. game giv given en that she has a certain set of cards, we use exactly the same form formulas ulas F or example, if w e w an t to compute the probabilit y that a play er will win a p oker as when we compute the probabilit probability y that a patien patientt has a disease giv given en that she game given that she has a certain set of cards, we use exactly the same formulas 55 a patient has a disease given that she as when we compute the probability that
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
has certain symptoms. For more details ab about out why a small set of common sense assumptions implies that the same axioms must con control trol both kinds of probability probability,, has certain symptoms. F or more details ab out why a small set of common sense see Ramsey (1926). assumptions implies that the same axioms must control both kinds of probability, Probability y can).be seen as the extension of logic to deal with uncertaint uncertainty y. Logic see Probabilit Ramsey (1926 pro provides vides a set of formal rules for determining what prop propositions ositions are implied to Probabilit y can be the seenassumption as the extension logicother to deal uncertaint be true or false given that of some set with of prop propositions ositionsy.isLogic true pro vides a set of formal rules for determining what prop ositions are implied to or false. Probabilit Probability y theory provides a set of formal rules for determining the b eeliho trueoor false given the assumption that other of other propositions is true lik likeliho elihoo d of a prop proposition osition being true giv given en some the lik likeliho eliho elihoo oset d of prop propositions. ositions. or false. Probability theory provides a set of formal rules for determining the likelihood of a proposition being true given the likelihood of other propositions.
3.2
Random Variables
A random variable isV aariables variable that can take on diﬀeren diﬀerentt values randomly randomly.. W Wee 3.2 Random typically denote the random variable itself with a low lower er case letter in plain typeface, A r andom variable is a v ariable that can take on diﬀeren values randomly . W x1 and x2e and the values it can take on with low lower er case script letters. tFor example, typically denote random ariable itselfvwith a low er case typeface, are both p ossiblethe values thatvthe random ariable x can takeletter on. Fin or plain vectorv vectorvalued alued x x and the v alues it can take on with low er case script letters. F or example, and variables, we would write the random variable as x and one of its values as x. On are b oth p ossible v alues that the random v ariable x can take on. F or vectorv alued its own, a random variable is just a description of the states that are possible; it x. On variables, we would write the randomdistribution variable as xthat andsp one of its values aseach m ust be coupled with a probability speciﬁes eciﬁes how likely of its own, a random these states are. variable is just a description of the states that are possible; it must be coupled with a probability distribution that speciﬁes how likely each of Random variables may be discrete or contin continuous. uous. A discrete random variable these states are. is one that has a ﬁnite or countably inﬁnite num umb ber of states. Note that these Random v ariables may b e discrete or contin uous. discrete random variable states are not necessarily the integers; they can also Ajust be named states that is one has a ﬁnite ore any countably inﬁnite num er of states. Note that these are notthat considered to hav have numerical value. Abcontin continuous uous random variable is states are not necessarily the integers; they can also just b e named states that asso associated ciated with a real value. are not considered to have any numerical value. A continuous random variable is associated with a real value.
3.3
Probabilit Probability y Distributions
A prob ob obability ability distribution is a description of how likely a random variable or 3.3pr Probabilit y Distributions set of random variables is to take on each of its possible states. The way we A probeability distribution is a description how likely a random ariable or or describ describe probability distributions dep depends ends onofwhether the variables arevdiscrete set of uous. random variables is to take on each of its possible states. The way we con contin tin tinuous. describe probability distributions depends on whether the variables are discrete or continuous.
3.3.1
Discrete Variables and Probabilit Probability y Mass Functions
3.3.1 Discrete Variables and Probabilit y Mass Functions A probabilit probability y distribution ov over er discrete variables ma may y be describ described ed using a pr prob ob obaability mass function (PMF). We typically denote probabilit probability y mass functions with a A probabilit y distribution ov er discrete v ariables ma y b e describ ed using a probacapital P . Often we asso associate ciate each random variable with a diﬀerent probability bility mass function (PMF). We typically denote probability mass functions with a 56 capital P . Often we associate each random variable with a diﬀerent probability
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
mass function and the reader must infer which probability mass function to use based on the identit identity y of the random variable, rather than the name of the function; mass function and the ust which probability mass function to use P (x) is usually not the reader same asmP (y)infer . based on the identity of the random variable, rather than the name of the function; probabilit probability mass function from a state of a random variable to P (xThe ) is usually notythe same as P (ymaps ). the probabilit probability y of that random variable taking on that state. The probabilit probability y y mass from a state a random variable x is denoted (x), withmaps thatThe x =probabilit as Pfunction a probability of 1 of indicating that x = x to is the probabilit of that random variable taking state. The Sometimes probability = xthat certain and a yprobabilit probability y of 0 indicating that x on is imp impossible. ossible. = x P ( x = x is that x biguate is denoted , with a probability of 1 indicating that xvariable to disam disambiguate whic which has PMF )to use, we write the name of the random certain andP (ax probabilit y of 0 indicating x = xﬁrst, is imp ossible. Sometimes explicitly: we deﬁne athat variable then use ∼ notation to = x). Sometimes to disam biguate which PMF to use, we write of the random variable sp specify ecify whic which h distribution it follo follows ws later: x ∼ Pthe (x)name . explicitly: P (x = x). Sometimes we deﬁne a variable ﬁrst, then use notation to Probabilit Probability mass functions can on many at the same time. Suc Such h specify which ydistribution it follo wsact later: x Pv(ariables x). ∼ a probability distribution over many variables is known as a joint pr prob ob obability ability ∼ variables at the same time. Such Probabilit y mass functions can act on many P ( = = y = x distribution distribution.. x x, y ) denotes the probabilit probability y that x and y = y a probability distribution o v er many v ariables is known as a joint probability sim simultaneously ultaneously ultaneously.. We ma may y also write P (x, y) for brevity brevity.. distribution. P (x = x, y = y) denotes the probability that x = x and y = y To be a probability variable x, a function P must simultaneously . We maymass also function write P (on x, ya) random for brevity . satisfy the follo following wing prop properties: erties: To be a probability mass function on a random variable x, a function P must satisfy the domain following erties: • The of prop P must be the set of all possible states of x. (xP ) ≤must 1. An • ∀ x ∈domain x, 0 ≤ Pof The beimp theossible set of ev allen ptossible states of impossible even ent has probabilit probability y 0x.and no state can even en entt that is guaran guaranteed teed to happ happen en • be less probable than that. Likewise, an ev 0 ( x ) 1 . x x , P An imp ossible ev en t has probabilit y 0 and no state can has probabilit probability y 1, and no state can ha hav ve a greater chance of occurring. b∀e less than • P ∈ probable ≤ ≤ that. Likewise, an event that is guaranteed to happen P (x) = y • hasx∈x probabilit , and no to state haerty ve aasgreater chance ofdo. ccurring. 11.. 1W e refer thiscan prop property being normalize normalized Without this prop propert ert erty y, we could obtain probabilities greater than one by computing the P (xy) = . Wof e refer this erty as being normalized. Without this probabilit probability of 1one man many ytoev even en ents tsprop occurring. • property, we could obtain probabilities greater than one by computing the of one ofa man y ev ents occurring. Forprobabilit example,yconsider single discrete random variable x with k diﬀeren diﬀerentt states. We can place a uniform distribution on x —that is, make each of its states equally ForPexample, a single discrete randomtovariable x with k diﬀerent states. lik likely—b ely—b ely—by y settingconsider its probabilit probability y mass function We can place a uniform distribution on x —that is, make each of its states equally likely—by setting its probability mass function1 to P (x = xi ) = (3.1) k 1 (xrequirements =x)= (3.1) for all i. We can see that this ﬁts P the k for a probability mass function. The value 1k is positiv ositivee because k is a positiv ositivee in integer. teger. We also see that for all i. We can see that this ﬁts the requirements for a probability mass function. X e1 integer. The value is positive bX ecause k is a positiv We also see that k = = 11,, P (x = xi ) = (3.2) k k i i 1 k P (x = x ) = (3.2) = = 1, k so the distribution is prop properly erly normalized. k 57 so the distribution is properly normalized. X X
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
3.3.2
Con Contin tin tinuous uous Variables and Probabilit Probability y Densit Density y Functions
When orking contin tinuous uous random variables, we describ describeey probabilit probability y dis3.3.2 wCon tinwith uouscon Vtin ariables and Probabilit y Densit Functions tributions using a pr prob ob obability ability density function (PDF) rather than a probability When w orking with con tinuous yrandom ariables, awe describpe mprobabilit dismass function. To be a probabilit probability densit density yvfunction, function ust satisfyy the tributions using a probability density function (PDF) rather than a probability follo following wing prop properties: erties: mass function. To be a probability density function, a function p must satisfy the Theprop domain of p must be the set of all possible states of x. follo•wing erties: • ∀ x ∈domain x, p(x) ≥ Note weset doofnot x) ≤ 1of . x. The of 0p. m ust that be the allrequire possiblep(states R •• xp(x)xdx , p(= x)11.. 0. Note that we do not require p(x) 1. • ∀probabilit ≥ A probability density y function p(x) do does es not giv givee the ≤ probability of a sp speciﬁc eciﬁc p(∈x)dx =y 1densit . state directly directly, , instead the probability of landing inside an inﬁnitesimal region with • probability density function p(x) does not give the probability of a speciﬁc A volume δx is given by p(x)δx. state directly, instead the probability of landing inside an inﬁnitesimal region with We can integrate the densit density y function to ﬁnd the actual probability mass of a volume δx is given by p(x)δx. R set of points. Speciﬁcally Speciﬁcally,, the probabilit probability y that x lies in some set S is giv given en by the W e can integrate the densit y function to ﬁnd example, the actualthe probability mass of xa in integral tegral of p (x) ov over er that set. In the univ univariate ariate probabilit probability y that S R set of oints. Speciﬁcally probabilit y that x lies in some set is given by the lies in pthe in interv terv terval al [a, b] is, the given by ] p(x)dx. integral of p (x) over that set. In the[a,b univ ariate example, the probability that x For an example of a probability density function corresp corresponding onding to a sp speciﬁc eciﬁc lies in the interval [a, b] is given by p(x)dx. probabilit probability y density over a contin continuous uous random variable, consider a uniform distribuor an example a probability density function corresp to u a (sp x; eciﬁc a, b), tionFon an in interv terv terval al ofofthe real num numbers. bers. We can do this with aonding function probabilit y density o v er a contin uous random v ariable, consider a uniform distribuwhere a and b are the endpoints of the in interv terv terval, al, with b > a. The “;” notation means u( xa ; a, b), tion on an in terv al of the real num bers. W e can do this withfunction, a function “parametrized by”; we consider x to Rbe the argument of the while and a and b are that b> a. The the endpoints the intervT al,o with means bwhere are parameters deﬁne theoffunction. ensure that there“;”isnotation no probability “parametrized by”;in wterv e consider while a[ a, and x toub(ex;the a, b)argument [a, b]. Within b], x 6∈function, mass outside the interv terval, al, we say = 0 for of allthe bu(are parameters that deﬁne the function. T o ensure that there is no probability 1 x; a, b) = b−a . We can see that this is nonnegative everywhere. Additionally dditionally,, it u ( x ; a, b [ a, b a, bb]], x mass outside the in terv al, we say ) = 0 for all ] . Within in integrates tegrates to 1. We often denote that x follo follows ws the uniform distribution on [[a, uy(xwriting ; a, b) =x ∼ .UW Additionally, it 6∈ b (a,e bcan ). see that this is nonnegative everywhere. integrates to 1. We often denote that x follows the uniform distribution on [a, b ] by writing x U (a, b). 3.4 Marginal Probability ∼ Sometimes we know the probabilit probability y distribution ov over er a set of variables and we wan wantt 3.4 Marginal Probability to know the probability distribution over just a subset of them. The probability Sometimes woevknow probabilit ywn distribution ov er a set ofability variables and we want distribution er thethe subset is kno known as the mar marginal ginal pr prob ob obability distribution. to know the probability distribution over just a subset of them. The probability For example, supp suppose ose we ha hav ve discrete random variables x and y, and we know distribution over the subset is known as the marginal probability distribution. P (x, y). We can ﬁnd P (x) with the sum rule: For example, suppose we have discrete random variables x and y, and we know X ∀ x ∈ x , P ( x = x ) = (3.3) P (x, y). We can ﬁnd P (x) with the sum ruleP: (x = x, y = y ). x
∀ ∈
x, P (x = x) =58
y
X
P (x = x, y = y ).
(3.3)
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
The name “marginal probabilit probability” y” comes from the pro process cess of computing marginal probabilities on pap paper. er. When the values of P (x, y ) are written in a grid with Thetname “marginal probabilit y” comes vfrom of computing marginal diﬀeren diﬀerent values of x in rows and diﬀerent aluesthe of pro columns, it is natural to y incess P ( probabilities on pap er. When the v alues of x , y ) are written in a grid with sum across a row of the grid, then write P ( x) in the margin of the pap paper er just to diﬀeren t v alues of in rows and diﬀerent v alues of in columns, it is natural to x y the righ rightt of the ro row. w. sum across a row of the grid, then write P ( x) in the margin of the paper just to For contin continuous uous variables, we need to use integration instead of summation: the right of the row. Z For continuous variables, we need to use integration instead of summation: p(x) = p(x, y )dy dy.. (3.4) p(x) =
3.5
p(x, y )dy.
(3.4)
Conditional Probability
In cases, we are inte interested rested in theZprobabilit probability y of some even event, t, given that some 3.5manyConditional Probability other ev even en entt has happened. This is called a conditional pr prob ob obability ability ability.. We denote In many cases, we are inte rested in the probabilit y of some even some x). This  x = that the conditional probabilit probability y that y = y giv given en x = x as P (y = t,y given other event probabilit has happened. is called a conditional conditional probability y can bThis e computed with the form formula ulaprobability. We denote the conditional probability that y = y given x = x as P (y = y x = x). This P (y = x = ula x) conditional probability can be computed with they,form  . (3.5) P (y = y  x = x) = P (x = x) P (y = y, x = x) . (3.5) P (y = y x = x) = PP (x( x==x)x) > 0. We cannot compute The conditional probability is only deﬁned when  the conditional probabilit probability y conditioned on an ev even en entt that nev never er happ happens. ens. The conditional probability is only deﬁned when P ( x = x) > 0. We cannot compute It is imp important ortant not to confuse conditional probability with computing what the conditional probability conditioned on an event that never happens. would happ happen en if some action were undertaken. The conditional probability that It is imp ortantGerman not to yconfuse conditional probability computing what a person is from Germany giv given en that they sp speak eak Germanwith is quite high, but if would happen if someperson actioniswere undertaken. The conditional probability that a randomly selected taught to sp German, their coun of origin speak eak country try a peserson is from German y givthe en consequences that they speak German butan if do does not change. Computing of an action isis quite calledhigh, making a randomly selected person is taught spthe eak domain German, try of, origin intervention query query.. Interv Intervention ention queriestoare of their causalcoun mo modeling deling deling, which do es not change. Computing the consequences of an action is called making an we do not explore in this book. intervention query. Intervention queries are the domain of causal modeling, which we do not explore in this book.
3.6
The Chain Rule of Conditional Probabilities
An Any y join joint t probabilit probability y distribution man many y random variables ma may y be decomp decomposed osed 3.6 The Chain Rule ofover Conditional Probabilities in into to conditional distributions over only one variable: Any joint probability distribution over many random variables may be decomposed (1) n variable: (i) into conditional  x (1), . . . , x(i−1) ). P (xdistributions , . . . , x (n) ) o=ver P (only x(1))Πone (3.6) i=2P (x x ,...,x )Π P (x P (ation x , . is . . ,kno x wn ) =asPthe (x chain . (3.6) This observ observation known rule or pr pro oduct rule of )probability probability. . It  follo follows ws immediately from the deﬁnition of conditional probability in Eq. 3.5. For This observation is known as the chain rule or product rule of probability. It follows immediately from the deﬁnition59of conditional probability in Eq. 3.5. For
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
example, applying the deﬁnition twice, we get example, applying thePdeﬁnition twice, (a, b, c) = P (awebget , c)P (b, c) P (b, c) P (a, b, c) P (a, b, c) P (b, c)
3.7
= = = =
P (b  c)P (c) P (a b, c)P (b, c) P (a  b, c)P (b  c)P (c). P (b c)P (c)
P (a, b, c) = P (a  b, c)P (b c)P (c). Indep Independence endence and Conditional Independence endence   Indep
T wo random variables x and yand are indep independent endent if theirIndep probability distribution can 3.7 Indep endence Conditional endence be expressed as a pro product duct of tw two o factors, one inv involving olving only x and one inv involving olving Two random variables x and y are independent if their probability distribution can only y: be expressed as a product of two factors, one involving only x and one involving only y: ∀x ∈ x, y ∈ y, p(x = x, y = y ) = p(x = x)p(y = y). (3.7) x, y xy,and p(xy=are x, yconditional = y ) = p(xly=indep x)p(endent y = y).given a random (3.7) Two random xvariables onditionally independent variable z if the∀conditional ∈ ∈ probability distribution over x and y factorizes in this T wo random v ariables way for ev every ery value of z: x and y are conditionally independent given a random variable z if the conditional probability distribution over x and y factorizes in this way for every value of z: ∀x ∈ x, y ∈ y, z ∈ z, p(x = x, y = y  z = z) = p(x = x  z = z )p(y = y  z = z). (3.8) x x, y y, z z, p(x = x, y = y z = z) = p(x = x z = z )p(y = y z = z). We can denote indep independence endence and conditional indep independence endence with compact (3.8) ∀ ∈ ∈ ∈    notation: x⊥y means that x and y are indep independen enden endent, t, while x⊥y  z means that x can denote indep endence conditional independence with compact andW y eare conditionally indep independen enden endentand t giv given en z. notation: x y means that x and y are independent, while x y z means that x and y are conditionally independent given z. ⊥ ⊥ 
3.8
Exp Expectation, ectation, Variance and Co Cov variance
f (xv)ariance The expeeExp ctation or exp expeecte cted d value of some function with resp respect ect to a probabilit probability y 3.8 exp ectation, Variance and Co distribution P (x ) is the av average erage or mean value that f tak takes es on when x is drawn f ( x The exp e ctation or exp e cte d value of some function ) with ect to a probability from P . For discrete variables this can be computed with aresp summation: distribution P (x ) is the average or mean value that f takes on when x is drawn X from P . For discrete variables a summation: [f (xcan )] =be computed Ex∼Pthis P (x)f (x)with , (3.9) x E [f (x)] = P (x)f (x), (3.9) while for con contin tin tinuous uous variables, it is computed with an integral: Z while for continuous variables, it is computed with an integral: Ex∼p [f (x)] = p(x)f (x)dx. (3.10) X E [f (x)] = p(x)f (x)dx. (3.10) 60
Z
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
When the iden identit tit tity y of the distribution is clear from the context, we ma may y simply E x[f ( x )] write the name of the random variable that the exp expectation ectation is ov over, er, as in inE )].. When the iden tit y of the distribution is clear from the context, w e ma y simply If it is clear whic which h random variable the expectation is over, we may E omit the [f (ov xer write the name of the random v ariable that the exp ectation is ov er, as in )]. subscript en entirely tirely tirely,, as in E[f (x)] )].. By default, we can assume that E[·] averages over If it is clear whic h random v ariable the expectation is o v er, w e may omit the the values of all the random brackets. ets. Lik Likewise, ewise, E variables inside the brack E when there is [f (xthe subscript entirely as inomit )]. square By default, we no am ambiguit biguit biguity y, we, may brack brackets. ets.can assume that [ ] averages over the values of all the random variables inside the brackets. Likewise, · when there is Exp Expectations ectations are linear, for example, no ambiguity, we may omit the square brackets. Expectations are forβexample, αf (x) + g(x)] = αEx [f (x)] + β Ex[g(x)], Ex [linear, E E E (xendent ) + βg(on x)] x=. α [f (x)] + β [g(x)], when α and β are not[αf dep dependent
(3.11) (3.11)
The varianc variancee gives a measure of ho how w muc uch h the values of a function of a random when α and β are not dependent on x. variable x vary as we sample diﬀeren diﬀerentt values of x from its probability distribution: The variance gives a measure of hohw much the values iof a function of a random 2 probability distribution: x[ffrom variable x vary as we sample t v(alues Var(f (diﬀeren x)) = E f (x) of −E (x)])its . (3.12)
E E V f (x )) = of f(f(x(x) )cluster [f (near x)]) their . exp When the variance is lo low, w,ar( the values expected ected value.(3.12) The − square ro root ot of the variance is known as the standar standard d deviation deviation.. When the variance is low, the values of f (x ) cluster near their expected value. The Thero covarianc ovariance gives es some sense ofashow much h tw two alues are. linearly related to square ot of thee vgiv ariance is known themuc standar do v deviation h variables: i eac each h other, as well as the scale of these The covariance gives some sense of how much two values are linearly related to each other, as wv(ell variables: Co Cov( f (as x),the g(yscale )) = Eof[(these f (x) − E [f (x)]) (g (y) − E [g(y)])] . (3.13) E E E Covv( f (x),ofg(the y)) = [(f (x) mean [f (that x)]) (the g (y)values [g(change y)])] . very (3.13) High absolute alues cov covariance ariance muc much h and are both far from their resp means at the same time. If the sign of the respectiv ectiv ectivee − − High absolute valuese,ofthen the bcov mean that theevon alues change high veryvmuc h co cov variance is positiv ositive, othariance variables tend to tak take relatively alues and are both far theirofresp e meansis at the same time. the sign of the sim simultaneously ultaneously ultaneously. . Iffrom the sign theectiv co cov variance negative, then one Ifvariable tends to co v ariance is p ositiv e, then b oth v ariables tend to tak e on relatively high v alues tak takee on a relatively high value at the times that the other tak takes es on a relatively low sim ultaneously If theOther sign of the covariance negative, then one vthe ariable tends to value and vice .versa. measures such as iscorr normalize con orrelation elation contribution tribution takeac e on relatively high to value at theonly times thethe other takes on relatively low of each h vaariable in order measure ho how wthat much variables arearelated, rather value also and b vice Other asseparate correlation normalize the contribution than eingversa. aﬀected by measures the scale such of the variables. of each variable in order to measure only how much the variables are related, rather The notions of co cov variance and dep dependence endence are related, but are in fact distinct than also being aﬀected by the scale of the separate variables. concepts. They are related because two variables that are indep independent endent hav havee zero The notions of co v ariance and dep endence are related, but are in fact distinct co cov variance, and tw two o variables that hav havee nonzero cov covariance ariance are dep dependent. endent. Ho Howwconcepts. They are related b ecause t w o v ariables that are indep endent hav e zero ev ever, er, independence is a distinct prop property erty from co cov variance. For two variables to hav havee co v ariance, and tw o v ariables that hav e nonzero cov ariance are dep endent. Ho wzero co cov variance, there must be no linear dep dependence endence betw etween een them. Indep Independence endence ever, independence is a distinct propco erty from cobvecause ariance. Forendence two variables to have is a stronger requirement than zero cov variance, indep independence also excludes zero covariance, there must no linear endence betw Indepbut endence nonlinear relationships. It isbepossible fordep tw two o variables toeen be them. dep dependent endent ha hav ve is a stronger requirement than zero co v ariance, b ecause indep endence also excludes zero cov covariance. ariance. For example, suppose we ﬁrst sample a real num number ber x from a nonlinear relationships. It is p ossible for tw o v ariables to b e dep endent have uniform distribution over the in interv terv terval al [− 1, 1] 1].. We next sample a randombut variable zero covariance. For example, suppose we ﬁrst sample a real number x from a uniform distribution over the interval [ 611, 1]. We next sample a random variable −
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
s. With probabilit probability y 12, we choose the value of s to be 1. Otherwise, we choose y by assigning the value of s to be − 1. We can then generate a random variable ariabley sy.=With s probabilit y , we choose the v alue of to b e 1 . Otherwise, we choose Clearly,, x and y are not indep independen enden endent, t, because x completely determines sx. Clearly s y the magnitude value of to 1. W eer,can then the of bye. How Howev ev ever, Co Cov( v(x, ygenerate ) = 0. a random variable by assigning are not indep endent, because x completely determines y = sx. Clearly, x and y − The covarianc ovariancee matrix of a random vector x ∈ Rn is an n × n matrix, suc such h that the magnitude of y. However, Cov(x, y) = 0. R Cov( v(x) i,j = Co Cov( v( v(x xxi, x j ). is an n n matrix, suc(3.14) The covariance matrix of Co a random vector h that ∈ × The diagonal elemen elements ts of theCo co cov variance give v( x) = Co v(the x , xvariance: ).
(3.14)
Cov( v(x xi , xi) = V Var( ar( ar(x . The diagonal elements of the Co covv( ariance give thexiv)ariance:
(3.15)
Cov(x , x ) = Var(x ).
3.9
(3.15)
Common Probability Distributions
Sev Several simple probability distributionsDistributions are useful in many con contexts texts in machine 3.9eral Common Probability learning. Several simple probability distributions are useful in many contexts in machine learning.
3.9.1
Bernoulli Distribution
The is a distribution ov Bernoulli li distribution over er a single binary random variable. 3.9.1Bernoul Bernoulli Distribution [0,, 1] It is controlled by a single parameter φ ∈ [0 1],, whic which h gives the probability of the The Bernoul li distribution is a distribution ov er a single binary random variable being equal to 1. It has the following prop properties: erties:random variable. [0 , φ It is controlled by a single parameter 1], which gives the probability of the =∈1) φ (3.16) random variable being equal to 1.PIt(xhas the=following properties: P (Px(= 0)1) == 1− x= φφ
P (x P=(xx)==0)φx=(11 − φφ)1−x ] =(1φ − φ) P (x = xE) x=[xφ φ(1φ− φ) Var x(Ex)[x=] =
3.9.2
Var (x) = φ(1
Multinoulli Distribution
φ)
(3.17) (3.16)
(3.18) (3.17) (3.19) (3.18) (3.20) (3.19) (3.20)
−
3.9.2multinoul Multinoulli Distribution The multinoulli li or cate ategoric goric gorical al distribution is a distribution ov over er a single discrete variable with k diﬀeren diﬀerentt states, where k is ﬁnite.1 The multinoulli distribution is The multinoulli or categorical distribution is a distribution over a single discrete 1 “Multinoulli” is a termt that waswhere recently by Gustavo anddistribution popularized by k coined variable with k diﬀeren states, is ﬁnite. The mLacerdo ultinoulli is
Murphy (2012). The multinoulli distribution is a special case of the multinomial distribution. A multinomial distribution is the distribution over vectors in {0 , . . . , n} k representing how many times each of the k categories is visited when n samples are drawn from a multinoulli distribution. Many texts use the term “multinomial” to refer to multinoulli distributions without clarifying that they refer only to the n = 1 case. 62
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
[0,, 1]k−1 , where pi giv parametrized by a vector p ∈ [0 gives es the probability of the ith state. The ﬁnal, k th state’s probability is given by 1 − 1> p. Note that we must , 1] , where p [0 parametrized givesused the to probability of the ith constrain Multinoulli distributions arep often refer to distributions ≤ 1a. vector 1 > p by k p . 1 The ﬁnal,of ob thjects, state’s probability is given by 1 that state Note1 has thatnumerical we must ostate. ver categories objects, so∈we do not usually assume constrain 1 . Multinoulli distributions are often used to refer to distributions p 1 − compute the exp value 1, etc. For this reason, we do not usually need to expectation ectation o v er categories of ob jects, so we do not usually assume ≤ or variance of multinoullidistributed random variables.that state 1 has numerical value 1, etc. For this reason, we do not usually need to compute the expectation The Bernoulli and multinoulli distributions are suﬃcient to describ describee an any y distrior variance of multinoullidistributed random variables. bution over their domain. This is because they mo model del discrete variables for whic which h The Bernoulli and multinoulli distributions are suﬃcient to describ e an y distriit is feasible to simply enumerate all of the states. When dealing with contin continuous uous bution overthere theirare domain. This because theyso moany del distribution discrete variables foredwhic v ariables, uncoun uncountably tablyis many states, describ described by h a it is feasible to simply enumerate all of the states. When dealing with contin uous small num umb ber of parameters must imp impose ose strict limits on the distribution. variables, there are uncountably many states, so any distribution described by a small number of parameters must impose strict limits on the distribution.
3.9.3
Gaussian Distribution
The mostGaussian commonly Distribution used distribution over real num numb bers is the normal distribution, 3.9.3 also kno as the Gaussian distribution : known wn The most commonly used distribution over real numbers is the normal distribution, r also known as the Gaussian distribution : 1 1 2 2 exp − 2 (x − µ) . N (x; µ, σ ) = (3.21) 2πσ2 2σ 1 1 exp (3.21) (x; µ, σ ) = (x µ) . See Fig. 3.1 for a plot of the densit density y function. 2π σ 2σ N − − The tw two o parameters R and σ y∈ function. (0, ∞ ) control the normal distribution. See Fig. 3.1 for a plotµof∈ the densit The parameter µ giv gives es the co coordinate ordinate R r of the central peak. This is also the mean of normal distribution. two parameters and σ (0 , ) control E[ x] = µµ. The standard the The distribution: deviation of thethe distribution is given by µ The parameter giv es the ordinate of the central p eak. This is also the mean of 2 co ∈ ∞ σ, and the variance E by σ . ∈ the distribution: [ x] = µ. The standard deviation of the distribution is given by When we ev evaluate aluate the PDF, we need to square and inv invert ert σ. When we need to σ, and the variance by σ . frequen frequently tly ev evaluate aluate the PDF with diﬀeren diﬀerentt parameter values, a more eﬃcient way When we evaluate the PDF, weisneed to asquare and inv When we needthe to of parametrizing the distribution to use parameter ) to control β ert ∈ (0σ,. ∞ frequen tlyorevinv aluate PDF of with t parameter values, a more eﬃcient way pr pre ecision inverse erse the variance thediﬀeren distribution: of parametrizing the distribution is to use a parameter β (0, ) to control the precision or inverse variance of ther distribution: ∈ ∞ β 1 −1 (3.22) exp − β (x − µ)2 . N (x; µ, β ) = 2π 2 β 1 (3.22) exp (x; µ, β ) = β (x µ) . Normal distributions are a sensible2choice for many applications. In the absence π 2 − of prior knowledge N ab about out what form a distribution ov− er the real num numbers bers should Normal distributions are a sensible choice for many applications. In the absence tak take, e, the normal distribution is a go goo od default choice for two ma major jor reasons. of prior knowledge about what form over thereal numbers should r a distribution First, many distributions we wish to mo model del are truly close to being normal take, the normal distribution is a good default choice for two ma jor reasons. distributions. The centr entral al limit the theor or orem em shows that the sum of many indep independent endent First, many distributions we wish to mo del are truly close to being normal random variables is approximately normally distributed. This means that in distributions. The central limit theorem shows that the sum of many independent 63 random variables is approximately normally distributed. This means that in
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
The normal distribution 0.35 0.40 0.30 0.35 0.25 0.30 0.20 0.25 0.15 0.20 0.10 0.15 0.05 0.10 0.00 −2.0 0.05
The normal distribution Ma Maxim xim ximu um aatt x = ¹ x = at ¹ In"ection points Maxim x =u¹m§a¾t
p(x)
p(x)
0.40
In"ection at x = ¹ points ¾ §
−1.5
−1.0
−0.5
0.0 0.5 1.0 1.5 2.0 x 0.00 2 Figure 3.1: The normal distribution : The normal ) exhibits N (x; µ, σ 1.5 −2.0 −1.5 −1.0 −0.5 0.0 distribution 0.5 1.0 2.0a classic “b “bell ell curv curve” e” shape, with the x co coordinate ordinate of its central peak given by µ, and the width x distribution (x; µ, σ ) exhibits a classic Thetrolled normalbydistribution Figure The normal of its p3.1: eak con controlled we depict the standar standard d normal distribution distribution,, σ. In this :example, “b ell µ curv shape, with = 0e”and σ =with 1. the x co ordinate of its central peakNgiven by µ, and the width of its p eak controlled by σ. In this example, we depict the standard normal distribution, with µ = 0 and σ = 1.
practice, man many y complicated systems can be mo modeled deled successfully as normally distributed noise, even if the system can be decomp decomposed osed into parts with more practice, man y complicated systems can b e mo deled successfully as normally structured beha ehavior. vior. distributed noise, even if the system can be decomposed into parts with more Second, out of all possible probability distributions with the same variance, structured behavior. the normal distribution enco encodes des the maxim maximum um amount of uncertaint uncertainty y ov over er the Second, out of all p ossible probability distributions with the same v ariance, real num umb bers. We can th thus us think of the normal distribution as being the one that the normal distribution des the maximum uncertaint y over and the inserts the least amoun amountt enco of prior kno knowledge wledge in into toamount a mo model. del.of F Fully ully dev developing eloping real n um b ers. W e can th us think of the normal distribution as b eing the one that justifying this idea requires more mathematical to tools, ols, and is postp postponed oned to Sec. inserts 19.4.2 19.4.2.. the least amount of prior knowledge into a model. Fully developing and justifying this idea requires more mathematical tools, and is postponed to Sec. The normal distribution generalizes to Rn, in whic which h case it is known as the 19.4.2. multivariate normal distribution distribution.. It ma may y be Rparametrized with a positiv ositivee deﬁnite The normal distribution generalizes to , in whic h case it is known as the symmetric matrix Σ: multivariate normal distribution. It may be parametrized with a positive deﬁnite s symmetric matrix Σ: 1 1 > −1 exp − (x − µ) Σ (x − µ) . N (x; µ, Σ) = (3.23) (2π) ndet(Σ) 2 1 1 exp (x; µ, Σ) = (x µ) Σ (x µ) . (3.23) (2π) det(Σ) 2 The gives es the mean− of the now w it is N parameter µ still giv − distribution, − though no vectorv ectorvalued. alued. The parameter Σ giv gives es the cov covariance ariance matrix of the distribution. µ The parameter still giv es the mean of the distribution, w itfor is As in the univ univariate ariatescase, when we wish to ev evaluate aluate the PDF though sev several eral no times vectorvalued. The parameter Σ gives the covariance matrix of the distribution. 64 to evaluate the PDF several times for As in the univariate case, when we wish
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
man many y diﬀeren diﬀerentt values of the parameters, the co cov variance is not a computationally eﬃcien eﬃcientt wa way y to parametrize the distribution, since we need to inv invert ert Σ to ev evaluate aluate man y diﬀeren t v alues of the parameters, the co v ariance is not a computationally the PDF. We can instead use a pr preecision matrix β: eﬃcient way to parametrize the distribution, since we need to invert Σ to evaluate the PDF. We can instead usesa precision matrix β: det( β ) 1 −1 > N (x; µ, β ) = exp − (x − µ) β(x − µ) . (3.24) (2π)n 2 det(β) 1 (x; µ, β ) = (x µ) β(x µ) . exp (3.24) π) to be a2diagonal matrix. An even simpler We often ﬁx the co cov variance (2 matrix − whose − co − matrix is a scalar version is theNisotr isotropic opic Gaussian distribution, cov variance Wethe often ﬁx covariance matrix to be a diagonal matrix. An even simpler times iden identit tit tity ythe matrix. s distribution, version is the isotropic Gaussian is a scalar whose covariance matrix times the identity matrix.
3.9.4
Exp Exponen onen onential tial and Laplace Distributions
3.9.4 Exponen tiallearning, and Laplace In the context of deep we oftenDistributions wan wantt to hav havee a probability distribution with a sharp point at x = 0. To accomplish this, we can use the exp exponential onential In the context of deep learning, w e often wan t to hav e a probability distribution distribution distribution:: with a sharp point at x =p(0x.; λT)o=accomplish this, λ1x≥0 exp (− λx) w . e can use the exponential (3.25) distribution: The exp exponen onen onential tial distribution probability p(uses x; λ) the = λindicator 1 exp (function λx) . 1 x≥0 to assign probabilit (3.25)y zero to all negativ negativee values of x. − 1 The exponential distribution uses the indicator function to assign probability A closely related probabilit probability y distribution that allo allows ws us to place a sharp peak zero to all negative values of x. of probabilit probability y mass at an arbitrary poin ointt µ is the Laplac aplacee distribution A closely related probability distribution that allows us to place a sharp peak  x − µ  1 of probability mass atLaplace( an arbitrary is the−Laplace distribution x; µ, γ )poin = t µexp . (3.26) 2γ γ x µ 1 Laplace(x; µ, γ ) = exp . (3.26) 2γ γ   − 3.9.5 The Dirac Distribution and Empirical Distribution − 3.9.5 The Dirac In some cases, we wishDistribution to sp specify ecify that and all of Empirical themass in aDistribution probability y distribution probabilit clusters around a single poin oint. t. This can be accomplished by deﬁning a PDF using In some wish to δsp the Diraccases, deltawe function, (xecify ): that all of the mass in a probability distribution clusters around a single point. This can be accomplished by deﬁning a PDF using (3.27) the Dirac delta function, δ(x): p(x) = δ(x − µ). The Dirac delta function is deﬁned zerovalued alued everywhere except p(x)such = δ(that x µit).is zerov (3.27) 0, yet integrates to 1. The Dirac delta function is not an ordinary function that − The Dirac delta function is deﬁned such that it is zerovalued except asso associates ciates eac each h value x with a realv realvalued alued output, instead it is everywhere a diﬀerent kind of 0, y et integrates to 1. The Dirac delta function is not an ordinary function that mathematical ob object ject called a gener generalize alize alized d function that is deﬁned in terms of its asso ciates h vintegrated. alue x withW a erealv alued of output, instead is a diﬀerent kindthe of prop properties erties eac when can think the Dirac deltait function as being mathematical ject called a generthat alizeput d function deﬁned in pterms of its limit poin ointt of aob series of functions less andthat less ismass on all oints other prop erties when integrated. W e can think of the Dirac delta function as b eing the than µ. limit point of a series of functions that put less and less mass on all points other 65 than µ.
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
By deﬁning p( x) to be δ shifted by −µ we obtain an inﬁnitely narrow and inﬁnitely high peak of probabilit probability y mass where x = µ. By deﬁning p( x) to be δ shifted by µ we obtain an inﬁnitely narrow and A common use of the Dirac delta distribution is as a component of an empiric empirical al inﬁnitely high peak of probability mass where x = µ. − distribution distribution,, m A common use of the Dirac delta 1distribution is as a component of an empirical X pˆ(x) = (3.28) δ(x − x(i)) distribution, m i=1 1 pˆ(x1 ) = (3.28) δ(x x ) whic which h puts probability mass m on m eac each h of the m poin oints ts x (1) , . . . , x(m) forming − a giv given en data set or collection of samples. The Dirac delta distribution is only , . . . , x Forforming which puts mass distribution on each of ov the poin ts xvariables. necessary to probability deﬁne the empirical over er m contin continuous uous discrete a giv en data set or collection of samples. The Dirac delta distribution is only X variables, the situation is simpler: an empirical distribution can be conceptualized necessary to deﬁne the empirical distribution over asso contin uous to variables. For discrete as a multinoulli distribution, with a probability each possible input associated ciated v ariables, the situation is simpler: an empirical distribution can b e conceptualized value that is simply equal to the empiric empirical al fr freequency of that value in the training as a m ultinoulli distribution, with a probability associated to each possible input set. value that is simply equal to the empirical frequency of that value in the training We can view the empirical distribution formed from a dataset of training set. examples as sp specifying ecifying the distribution that we sample from when we train a model W e can view the empirical distribution formed from a dataset of training on this dataset. Another imp important ortant persp erspective ective on the empirical distribution is examples as sp ecifying the distribution that w e sample from when w e train a model that it is the probabilit probability y density that maximizes the likelihoo likelihood d of the training data on this dataset. Another imp ortant p ersp ective on the empirical distribution is (see Sec. 5.5). that it is the probability density that maximizes the likelihood of the training data (see Sec. 5.5).
3.9.6
Mixtures of Distributions
3.9.6 Distributions It is alsoMixtures common toof deﬁne probability distributions by com combining bining other simpler probabilit probability y distributions. One common w wa ay of com combining bining distributions is to It is also common to deﬁne probability distributions by com construct a mixtur mixturee distribution distribution.. A mixture distribution is bining made other up of simpler several probabilit y distributions. One common w a y of com bining distributions is to comp componen onen onentt distributions. On eac each h trial, the choice of whic which h comp component onent distribution construct the a mixtur e distribution . A mixture distribution is made uptit of generates sample is determined by sampling a comp component onent iden identit tity y several from a comp onen t distributions. On eac h trial, the choice of whic h comp onent distribution multinoulli distribution: generates the sample is determined by sampling a component identity from a X multinoulli distribution: P (x) = P (c = i)P (x  c = i) (3.29) i
P (c = i)P (x c = i) P (x) = where P (c) is the multinoulli distribution ov over er comp component  onent identities.
(3.29)
We ha hav ve already seen one example of a mixture distribution: the empirical where P (c) is the multinoulli distribution over component identities. distribution ov over er realv realvalued alued variables is a mixture distribution with one Dirac X W e ha v e already seen oneexample. example of a mixture distribution: the empirical comp componen onen onentt for eac each h training distribution over realvalued variables is a mixture distribution with one Dirac The mixture mo model is one simple strategy for combining probability distributions comp onen t for eac hdel training example. to create a ric richer her distribution. In Chapter 16, we explore the art of building complex The mixture model is one simple strategy combining probabilit probability y distributions from simple ones informore detail. probability distributions to create a richer distribution. In Chapter 16, we explore the art of building complex 66 in more detail. probability distributions from simple ones
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
The mixture mo model del allows us to brieﬂy glimpse a concept that will be of paramoun paramountt imp importance ortance later—the latent variable. A laten latentt variable is a random The mixture mo del allows us to brieﬂy glimpse a iden concept will of variable that we cannot observe directly directly.. The component identit tit tity y vthat ariable c ofbethe paramoun t del impprovides ortance an later—the variable . A may latenbtevrelated ariable to is xa through random mixture mo model example.latent Laten Latent t variables vthe ariable that w e cannot observe directly . The component iden tit y v ariable c ( c) joint distribution, in this case, P (x, c) = P (x  c )P (c ). The distributionofPthe delt provides Latent vP ariables may be related to xvariables through (x  c) relating omixture ver the mo laten latent variable an andexample. the distribution the latent P (xshap P (the )P (c ). ThePdistribution P ( c) x cdistribution thethe joint distribution, this case,the , c) = (x ) ev to visible variables in determines shape e of even en though P (x to c) relating over the latenttovariable distribution thevariable. latent variables P (xthe it is p ossible describeand ) without reference the latent Laten Latentt P ( to the visible v ariables determines the shap e of the distribution x ) ev en though  variables are discussed further in Sec. 16.5. it is possible to describe P (x) without reference to the latent variable. Latent A veryare podiscussed werful andfurther common type16.5 of mixture mo model del is the Gaussian mixtur mixturee variables in Sec. . mo model, del, in whic which h the comp componen onen onents ts p (x  c = i ) are Gaussians. Each comp component onent has A v ery p o w erful and common type of mixture mo del is the Gaussian mixtur ( i ) ( i ) a separately parametrized mean µ and cov covariance ariance Σ . Some mixtures can hav haveee p ( = i x c mo del, in whic h the comp onen ts ) are Gaussians. Each comp onent has more constraints. For example, the cov covariances ariances could be shared across comp componen onen onents ts Σ . distribution, a separately parametrized and covariance Some mixtures have  a single via the constraint Gaussian thecan mixture Σ(i) = Σmean ∀i. Asµ with more constraints. Forconstrain example, the the cov could befor shared componen ts of Gaussians might co cov variances ariance matrix eac each h across component to be via the constraint Σ = Σ i. As with a single Gaussian distribution, the mixture diagonal or isotropic. of Gaussians might constrain ∀ the covariance matrix for each component to be In addition to the means and cov covariances, ariances, the parameters of a Gaussian mixture diagonal or isotropic. sp specify ecify the prior pr prob ob obability ability α i = P ( c = i) giv given en to each comp component onent i. The word In addition to the means and cov ariances, the parameters of a Gaussian mixture “prior” indicates that it expresses the mo model’s del’s beliefs about c before it has observed α = P ( = i i the prior prPob c )pr giv en to ,each comp . The after word By comparison, prob ob obability ability ability, because itonent is computed xsp. ecify ( cability  x) is a posterior b efore “prior” indicates that it expresses the mo del’s beliefs about it has observed c observ observation ation of x. A Gaussian mixture mo model del is a universal appr approximator oximator of after . By comparison, ayposterior obability , because it is computed xdensities, P ( c that x) isan in the sense any smo smooth oth pr densit density y can be approximated with an any y x observ ation of . A Gaussian mixture mo del is a universal appr oximator of  sp speciﬁc, eciﬁc, nonzero amoun amountt of error by a Gaussian mixture model with enough densities, in the sense that any smooth density can be approximated with any comp componen onen onents. ts. speciﬁc, nonzero amount of error by a Gaussian mixture model with enough Fig. 3.2 sho shows ws samples from a Gaussian mixture mo model. del. components. Fig. 3.2 shows samples from a Gaussian mixture model.
3.10
Useful Prop Properties erties of Common Functions
Certain oftenerties while working with probabilit probability y distributions, especially 3.10 functions Usefularise Prop of Common Functions the probabilit distributions used in deep learning mo probability y models. dels. Certain functions arise often while working with probability distributions, especially of these functions is the logistic gistic sigmoid: : models. the One probabilit y distributions usedlo in deepsigmoid learning 1 One of these functions is the logistic sigmoid : σ(x) = . (3.30) 1 + exp(−x) 1 σ(x) = . (3.30) + exp( x) the φ parameter of a Bernoulli The logistic sigmoid is commonly used1 to pro produce duce distribution because its range is (0 (0,, 1) 1),, whic which h lies − within the valid range of values φ parameter The logistic sigmoid is commonly used to pro duce a Bernoulli for the φ parameter. See Fig. 3.3 for a graph of thethe sigmoid function.ofThe sigmoid distribution because its range is (0, 1), which lies within the valid range of values for the φ parameter. See Fig. 3.3 for a graph of the sigmoid function. The sigmoid 67
x2
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
x1
Figure 3.2: Samples from a Gaussian mixture mo model. del. In this example, there are three comp componen onen onents. ts. From left to righ right, t, the ﬁrst comp component onent has an isotropic co cov variance matrix, Figure 3.2: Samples from a Gaussian mixture mo del. In this example, three meaning it has the same amount of variance in each direction. The secondthere has aare diagonal comp onen ts. F rom left to righ t, the ﬁrst comp onent has an isotropic co v ariance matrix, co cov variance matrix, meaning it can control the variance separately along each axisaligned meaning itThis has the same has amount variancealong in each Thealong second diagonal x 2 axis than x 1 aaxis. direction. example moreofvariance thedirection. thehas The covariance matrix, meaning it cancov control thematrix, varianceallowing separately along eachthe axisaligned third comp componen onen onentt has a fullrank covariance ariance it to control variance direction. This has more along the x axis than along the x axis. The separately alongexample an arbitrary basisvariance of directions. third comp onent has a fullrank covariance matrix, allowing it to control the variance separately along an arbitrary basis of directions.
function satur saturates ates when its argument is very positiv ositivee or very negative, meaning that the function becomes very ﬂat and insensitiv insensitivee to small changes in its input. function saturates when its argument is very positive or very negative, meaning Another commonly encountered function is the softplus function (Dugas et al., that the function becomes very ﬂat and insensitive to small changes in its input. 2001 2001): ): Another commonly encountered function is the function (Dugas(3.31) et al., ζ (x) = log (1 + exp( x))softplus . 2001): The softplus function can be useful producing the ζ (x) =for logpro (1 ducing + exp(x )) .β or σ parameter of a normal (3.31) distribution because its range is (0, ∞ ). It also arises commonly when manipulating β or σ The softplusin function be useful producing parameter of afrom normal expressions inv volving can sigmoids. Theforname of the the softplus function comes the , distribution b ecause its range is (0 ) . It also arises commonly when manipulating fact that it is a smo smoothed othed or “softened” version of expressions involving sigmoids. The ∞name of the softplus function comes from the x+ = max(0 , x)of . (3.32) fact that it is a smoothed or “softened” version See Fig. 3.4 for a graph of the softplus function. x = max(0 , x). (3.32) following properties erties aresoftplus all useful enough that you may wish to memorize See The Fig. follo 3.4 wing for a prop graph of the function. them: The following properties are all useful enough that you may wish to memorize them: exp(x) σ(x) = (3.33) exp(x) + exp(0) exp(x) σd(x) = (3.33) exp( ) +− exp(0) σ(x) = σ(x)(1 σ(x)) (3.34) dx d σ(x) = σ68 (x)(1 σ(x)) (3.34) dx −
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
The logistic sigmoid function 1.0
¾(x)
0.8
The logistic sigmoid function
0.6 0.4 0.2 0.0 −10
−5
0
5
10
x
Figure 3.3: The logistic sigmoid function. Figure 3.3: The logistic sigmoid function.
The softplus function 10
The softplus function
³(x)
8 6 4 2 0 −10
−5
0
5
x
Figure 3.4: The softplus function. Figure 3.4: The softplus function.
69
10
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
1 − σ(x) = σ (−x)
(3.35)
log 1 σσ(x (x))==−σζ((−xx)) dσ(x) = ζ (− x) log− ζ (x) = σ (x) dx − − d ζ−1 (x) = σ (x) x ∀x ∈ (0, 1)dx , σ (x) = log 1−x x (x)log = (exp( log x) − 1) , σ ∀xx > (0 0,, 1) ζ −1 (x) = 1 x ∀ ∈ Z x − x > 0,ζ ζ(x) (= x) = log σ((exp( y)dy x) 1) −∞ ∀ − ζζ((xx))= σ ( y ) dy − ζ (−x) = x
(3.36) (3.35) (3.36) (3.37) (3.37) (3.38) (3.38) (3.39) (3.39) (3.40) (3.40) (3.41)
The function σ−1 (x) is called the logit rarely ζ (xlo ) git ζin ( statistics, x) = x but this term is more (3.41) used in mac machine hine learning. − Zin−statistics, but this term is more rarely The function σ (x) is called the logit Eq. 3.41 pro provides vides extra justiﬁcation for the name “softplus.” The softplus used in machine learning. function is intended as a smo smoothed othed version of the positive part function, x + = Eq. provides extra justiﬁcation name “softplus.” The softplus max max{ {0, x3.41 } . The positive part function isfor thethe counterpart of the ne negative gative part function is intended as a smo othed v ersion of the p ositive p art function, x = − max{ {0, −x}. To obtain a smo function, x = max smooth oth function that is analogous to the max , x 0 . The p ositive part function is the counterpart of the ne gative p art negativ negativee part, one can use ζ (−x ). Just as x can be recov recovered ered from its positive part maxvia x .iden 0, the function, To tit obtain is analogous to {negativ }x e=part x− oth = x,function x and negative identit tity y x+ a−smo it is alsothat possible to reco recov verthe x ζ ( x negativ e part, one can use ) . Just as can b e recov ered from its p ositive part { − } using the same relationship bet etw ween ζ (x) and ζ (−x), as shown in Eq. 3.41. x = x, it is also possible to recover x and negative part via the iden − tity x using the same relationship between ζ (− x) and ζ ( x), as shown in Eq. 3.41. 3.11 Ba Bay yes’ Rule − W e often Ba ﬁndyourselv ourselves es in a situation where we know P ( y  x) and need to know 3.11 es’ Rule P ( x  y). Fortunately ortunately,, if we also know P (x), we can compute the desired quantit quantity y y x P ( W e often ﬁnd ourselv es in a situation where we know ) and need to know using Bayes’ rule: P ( x y). Fortunately, if we also know PP(x(x),)P w(eycan  the desired quantity  x)compute P ( x  y ) = . (3.42) using Bayes’ rule: P (y) P (x)P (y x) P (xin ythe ) =formula, it is . usually feasible to compute (3.42) P (y ) appears Note that while P (y)  P  not need to begin with knowledge of P (y). P (y) = x P (y  x)P (x), so we do Note that while P (y ) appears in the formula, it is usually feasible to compute Bay straightforward to deriv derive the deﬁnition of conditional P (y isxstraigh )P (x), tforward so we do not needetofrom begin with knowledge of P (y). P (yBa ) =yes’ rule probabilit probability y, but it is useful to know the name of this form formula ula since many texts  straightforward to derive from the deﬁnition of conditional yes’ rule is referBato it b y name. It is named after the Reverend Thomas Ba Bay yes, who ﬁrst probabilit y , but it is useful to know the name of this form ula since many disco discov vered a sp special ecial case of the formula. The general version presented heretexts was refer to it b y name. It is named after the Reverend Thomas Ba y es, who ﬁrst indep independen enden endently discov vered by PierreSimon Laplace. P tly disco discovered a special case of the formula. The general version presented here was independently discovered by PierreSimon Laplace. 70
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
3.12
Tec echnical hnical Details of Con Contin tin tinuous uous Variables
A prop proper er T formal understanding of of con contin tin tinuous uous random ariables and probabilit probability y 3.12 echnical Details Con tin uous vV ariables densit density y functions requires dev developing eloping probabilit probability y theory in terms of a branc branch h of A prop er formal understanding of con tin uous random v ariables and probabilit y mathematics kno known wn as me measur asur asuree the theory ory ory.. Measure theory is bey eyond ond the scop scopee of densit y functions requires dev eloping probabilit y theory in terms of a branc h of this textb textbo ook, but we can brieﬂy sk sketc etc etch h some of the issues that measure theory is mathematics known emplo employ yed to resolv resolve. e. as measure theory. Measure theory is beyond the scope of this textbook, but we can brieﬂy sketch some of the issues that measure theory is In ySec. 3.3.2 , wee.sa saw w that the probabilit probability y of a con contin tin tinuous uous vectorv vectorvalued alued x lying emplo ed to resolv in some set S is given by the in integral tegral of p( x) ov over er the set S. Some choices of set S In Sec. 3.3.2 , w e sa w that the probabilit y of a conto tinconstruct uous vectorv S 1lying can pro produce duceSparadoxes. For example, it is possible tw two oalued sets x and S S pbut ( x)Sov∩ in some set pis(xgiven by the in tegral of erSthe=set . Some choices of set ) + p ( ) > ∈ S x ∈ S ∅ S suc such h that 1 . These sets are generally 2 1 2 1 2 S can produce making paradoxes. Fheavy or example, itthe is inﬁnite possibleprecision to construct twonum setsbers,and constructed very use of of real numbers, for S S S S S p ( ) + p ( ) > = x x suc h that 1 but . These sets are generally example by making fractalshap fractalshaped ed sets or sets that are deﬁned by transforming constructed making very heavy of real num bers,is for 2 ∈ use of the inﬁnite ∈ ∩ precision ∅ of measure the set of rational num numbers. bers. One of the key contributions theory to example y making fractalshap or sets are deﬁned transforming pro provide vide abcharacterization of theed setsets of sets thatthat we can computeby the probability thewithout set of rational numbers. One of In thethis keybo contributions oftegrate measure to of encountering paradoxes. book, ok, we only in integrate ovtheory er sets is with provideely a characterization of the sets we can compute the relativ relatively simple descriptions, so set thisofasp aspect ectthat of measure theory nev never er probability becomes a of without encountering paradoxes. In this bo ok, w e only in tegrate o v er sets with relev relevant ant concern. relatively simple descriptions, so this aspect of measure theory never becomes a Fant or our purp purposes, oses, measure theory is more useful for describing theorems that relev concern. apply to most points in R n but do not apply to some corner cases. Measure theory For our purposes, measure theorythat is more fortsdescribing theorems that pro provides vides a rigorous wa way y Rof describing a setuseful of poin oints is negligibly small. Such apply to most p oints in but do not apply to some corner cases. Measure theory a set is said to hav havee “ me measur asur asuree zer zero o.” W Wee do not formally deﬁne this concept in this provides rigorous describing that a setthe of pin oin ts is negligibly Such textb textbo ook.a Ho How wev ever, er,wa it yisofuseful to understand intuition tuition that a setsmall. of measure a set oisccupies said tonohav e “ measur e zer o.” W e do formally Fdeﬁne this concept zero volume in the space we arenot measuring. or example, withinin ,a R2this textb o ok. Ho w ev er, it is useful to understand the in tuition that a set of measure line has measure zero, while a ﬁlled polygon has positiv ositivee measure. Likewise, R an zero o ccupies no volume in the space we are measuring. F or example, within ,a individual point has measure zero. An Any y union of countably many sets that each line measure a ﬁlled pzero olygon ositiv e measure. Likewise, an ha hav vehas measure zerozero, alsowhile has measure (so has the p set of all the rational num numb bers individual point zero. Any union of countably many sets that each has measure zero,has formeasure instance). have measure zero also has measure zero (so the set of all the rational numbers term from measure theory is “ almost everywher everywheree.” A prop property erty has Another measure useful zero, for instance). that holds almost ev everywhere erywhere holds throughout all of space except for on a set of Another useful termthe from measure otheory almost everywher .” space, A propthey erty measure zero. Because exceptions ccupy isa “negligible amounteof thatbholds almost everywhere throughout allimp of space on a set of can e safely ignored for man many y holds applications. Some importan ortan ortanttexcept resultsfor in probability measure zero. thevalues exceptions occupy negligible amount for of space, they theory hold for Because all discrete but only hold a“almost everywhere” con contin tin tinuous uous can b e safely ignored for man y applications. Some imp ortan t results in probability values. theory hold for all discrete values but only hold “almost everywhere” for continuous Another tec technical hnical detail of contin continuous uous variables relates to handling contin continuous uous values. random variables that are deterministic functions of one another. Supp Suppose ose we ha hav ve Another tec hnical detail of contin uous v ariables relates to handling contin uous two random variables, x and y , suc such h that y = g (x ), where g is an inv invertible, ertible, conrandom variables that are deterministic functions of one another. Suppose we have 2 BanachTarski theorem sets. g is an invertible, conx andprovides y , sucha fun y = g (of x )such twoThe random variables, thatexample , where 71
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
tin tinuous, uous, diﬀeren diﬀerentiable tiable transformation. One migh mightt exp expect ect that py (y ) = p x (g−1(y )) )).. This is actually not the case. tinuous, diﬀerentiable transformation. One might expect that p (y ) = p (g (y )). As a simple example, supp suppose ose we ha hav ve scalar random variables x and y. Suppose This isx actually not the case. (0,, 1) y = 2 and x ∼ U (0 1).. If we use the rule py (y ) = p x(2 y) then p y will be 0 As a simple example, suppalose 1 have scalar random variables x and y. Suppose ev everywhere erywhere except the interv interval [0 [0,,we interv terv terval. al. This means 2 ], and it will b e 1 on this in U (0, 1). If we use y = and x the rule p (y ) = p (2 y) then p will be 0 Z everywhere except it will1 be 1 on this interval. This means ∼ the interval [0, ]p, and (3.43) y (y )dy = , 2 1 p (y)dy = , (3.43) whic which h violates the deﬁnition of a probabilit probability y distribution. 2 common is of wrong becausey it fails to accoun accountt for the distortion whicThis h violates the mistake deﬁnition a probabilit distribution. of space in intro tro troduced duced by the function g . Recall that the probability of x lying in Z bvecause This common is wrong toenaccoun distortion δxfails )δx.the g can an inﬁnitesimally mistake small region with olume it is giv given by p(tx for Since g x of space in tro duced by the function . Recall that the probability of lying in expand or con contract tract space, the inﬁnitesimal volume surrounding x in x space ma may y an small with volume δx is given by p(x )δx. Since g can ha hav vinﬁnitesimally e diﬀeren diﬀerentt volume in region y space. expand or contract space, the inﬁnitesimal volume surrounding x in x space may To see ho how w to correct the problem, we return to the scalar case. We need to have diﬀerent volume in y space. preserv preservee the prop propert ert erty y To see how to correct the to. the scalar case. We need to pyproblem, (g(x))dy dy w=e return p x (x)dx (3.44) preserve the property Solving from this, we obtain p (g(x))dy = p (x)dx . (3.44)   ∂ x  Solving from this, we obtain p y (y) = px (g−1 (y)) (3.45) ∂y ∂x p (y) = p (g (y)) (3.45) or equiv equivalently alently ∂y ∂ g ( x) p ( x ) = p ( g ( x )) (3.46) x y or equivalently ∂x . ∂ g(x) p (ative x) = pgeneralizes (g(x)) to (3.46) . determinan In higher dimensions, the deriv derivative determinantt of the Jac Jacobian obian ∂xthe ∂xi alued vectors x and y , matrix matrix—the —the matrix with Ji,j = ∂y j . Th Thus, us, for realv realvalued In higher dimensions, the derivative generalizes to the determinant of the Jacobian forrealvalued matrix—the matrix with J = . Thus, ∂ g(x ) vectors x and y , px (x) = py (g(x)) det (3.47) . ∂ x ∂ g(x ) p (x) = p (g(x)) det . (3.47) ∂x
3.13
Information Theory
that rev Information theory is a branc branch h of applied mathematics revolv olv olves es around 3.13 Information Theory quan quantifying tifying how muc much h information is presen present inven en ented ted t in a signal. It was originally inv Information theory is a branc of applied mathematics that channel, revolves such around to study sending messages fromhdiscrete alphab alphabets ets over a noisy as quan tifying how muc h information is presen t in a signal. It w as originally inv en ted comm communication unication via radio transmission. In this context, information theory tells how to study messages discrete alphab over of a messages noisy channel, such as to design sending optimal co codes des and from calculate the exp expected ectedets length sampled from communication via radio transmission. In this context, information theory tells how to design optimal codes and calculate the72expected length of messages sampled from
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
sp speciﬁc eciﬁc probability distributions using various enco encoding ding schemes. In the con context text of mac machine hine learning, we can also apply information theory to con contin tin tinuous uous variables speciﬁcsome probability distributions using interpretations various encodingdoschemes. In. the conﬁeld text is of where of these message length not apply apply. This mac hine learning, we areas can also apply information to continscience. uous variables fundamen fundamental tal to many of electrical engineeringtheory and computer In this where some of these message length interpretations do not apply . This ﬁeld is textb textbo ook, we mostly use a few key ideas from information theory to characterize fundamental to many areas electrical engineering and science. In this probabilit probability y distributions or of quantify similarity betw etween een computer probability distributions. textb ook, detail we mostly use a few key ideas from theory to)characterize F or more on information theory theory, , see Co Cov vinformation er and Thomas (2006 or MacKa MacKay y probabilit y distributions or quantify similarity b etw een probability distributions. (2003). For more detail on information theory, see Cover and Thomas (2006) or MacKay The basic intuition behind information theory is that learning that an unlik unlikely ely (2003). ev even en entt has occurred is more informative than learning that a lik likely ely ev event ent has The basic intuitionsaying behind“the information is that learning that an unlikely occurred. A message sun rosetheory this morning” is so uninformative as ev en t has occurred is more informative than learning that a lik ely ev ent has to be unnecessary to send, but a message sa saying ying “there was a solar eclipse this o ccurred. A message saying “the sun rose this morning” is so uninformative as morning” is very informative. to be unnecessary to send, but a message saying “there was a solar eclipse this We would like to quantify information in a way that formalizes this intuition. morning” is very informative. Sp Speciﬁcally eciﬁcally eciﬁcally,, We would like to quantify information in a way that formalizes this intuition. Speciﬁcally • Lik Likely ely, ev even en ents ts should ha hav ve lo low w information con conten ten tent, t, and in the extreme case, ev even en ents ts that are guaranteed to happen should ha hav ve no information conten contentt Lik ely ev en ts should ha v e lo w information con ten t, and in the extreme case, whatso whatsoev ev ever. er. • events that are guaranteed to happen should have no information content • Less lik likely elyer.ev even en ents ts should ha hav ve higher information con conten ten tent. t. whatso ev lik ely ev ts should haha vevhigher information conten t. example, ﬁnding • Less Indep Independen enden endent t en ev even en ents ts should hav e additiv additive e information. For convey ey twice as • out that a tossed coin has come up as heads twice should conv Indep enden t ev en ts should ha v e additiv e information. F or example, muc uch h information as ﬁnding out that a tossed coin has come up asﬁnding heads out that a tossed coin has come up as heads twice should convey twice as • once. much information as ﬁnding out that a tossed coin has come up as heads In once. order to satisfy all three of these prop properties, erties, we deﬁne the selfinformation of an ev even en entt x = x to be In order to satisfy all three Iof(xthese ) = −prop log Perties, (x). we deﬁne the selfinformation (3.48) of an event x = x to be In this book, we alwa always ys use logIto . Our (x)mean = the log Pnatural (x). logarithm, with base e(3.48) deﬁnition of I( x) is therefore written in of nats. One nat is the amount of − unitsnatural In this book,gained we alwa use log to logarithm, with base e. Our 1 information byysobserving anmean even eventtthe of probability e . Other texts use base2 deﬁnition ofand ) is therefore written in units of nats. One nat is the of I( xunits logarithms called bits or shannons ; information measured in amount bits is just by observing an even t of probability . Other texts use base2 ainformation rescaling ofgained information measured in nats. logarithms and units called bits or shannons; information measured in bits is just When x is contin we use the same deﬁnition of information by analogy continuous, uous,measured analogy,, a rescaling of information in nats. but some of the prop properties erties from the discrete case are lost. For example, an even eventt When x is contin uous, we use the same deﬁnition of information b y analogy with unit density still has zero information, despite not being an ev even en entt that is, but some of the prop erties from the discrete case are lost. F or example, an event guaran guaranteed teed to occur. with unit density still has zero information, despite not being an event that is 73 guaranteed to occur.
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
Shannon entropy in nats
0.7 0.6 0.5
Shannon entropy of a binary random variable Shannon entropy of a binary random variable
0.4 0.3 0.2 0.1 0.0 0.0
0.2
0.4
0.6
0.8
1.0
p
Figure 3.5: This plot sho shows ws ho how w distributions that are closer to deterministic hav havee lo low w Shannon entrop entropy y while distributions that are close to uniform hav havee high Shannon en entrop trop tropy y. Figure 3.5: This plot sho ws ho w distributions that are closer to deterministic hav e lo w p On the horizon horizontal tal axis, we plot , the probabilit probability y of a binary random variable being equal Shannon yywhile distributions that are to puniform high0,Shannon entropy. to 1. Theentrop en entrop trop tropy is giv given en by (p − 1) log . When phav is enear the distribution log(1 (1 (1− − p )close − p log p Onnearly the horizon tal axis, bwecause e plot the , the probabilit y of aisbinary variable equal is deterministic, random variable nearly random alwa always ys 0. When bpeing is near 1, (p 1) log(1b ecause p ) p the log prandom to 1.distribution The entropyisisnearly given by . When pvariable is near is 0, nearly the distribution the deterministic, alwa always ys 1. peristhe − − variable − the is is nearly b ecause the random nearly alwaisysuniform 0. When near 1, p =deterministic, 00..5, the en When entrop trop tropy y is maximal, because distribution ov over tw two o the distribution is nearly deterministic, b ecause the random variable is nearly always 1. outcomes. When p = 0.5, the entropy is maximal, because the distribution is uniform over the two outcomes.
Selfinformation deals only with a single outcome. We can quantify the amoun amountt of uncertain uncertaintty in an en entire tire probabilit probability y distribution using the Shannon entr entropy opy opy:: Selfinformation deals only with a single outcome. We can quantify the amount of uncertainty in an en probabilit y )]distribution using : Htire (x) = Ex∼P [I (x P (xthe )]. Shannon entropy (3.49) = −Ex∼P [log
E E H (other x) = words, P (x [I (the x)] = also denoted H ( P ). In Shannon [log en entrop trop tropy y)]of. a distribution (3.49) is the exp expected ected amoun amountt of information in an ev even en entt−dra drawn wn from that distribution. It gives also denoted ) . In other w ords, the Shannon entrop y of2,a otherwise distribution the H ( P a lo low wer bound on the num numb ber of bits (if the logarithm is base theisunits expected amoun t of information even drawn from thatfrom distribution. It gives P. are diﬀeren diﬀerent) t) needed on av average eragein toan enco encode det symbols drawn a distribution a lo w er bound on the num b er of bits (if the logarithm is base 2, otherwise the units Distributions that are nearly deterministic (where the outcome is nearly certain) P. are t) needed on averagethat to enco symbols drawnhav from a distribution ha hav vediﬀeren lo low w en entrop trop tropy; y; distributions are de closer to uniform have e high entrop entropy y. See Distributions that are nearlyWhen deterministic (where is nearly Fig. 3.5 for a demonstration. x is contin continuous, uous, the the outcome Shannon entrop entropy y iscertain) kno known wn havthe e lodiﬀer w entrop y; entr distributions that are closer to uniform have high entropy. See as diﬀerential ential entropy opy opy.. Fig. 3.5 for a demonstration. When x is continuous, the Shannon entropy is known If we hav havee two separate probability distributions P ( x) and Q(x) ov over er the same as the diﬀerential entropy. random variable x, we can measure ho how w diﬀerent these two distributions are using If we hav e t w o separate probability the Kul Kullb lb lbackL ackL ackLeibler eibler (KL) diver divergenc genc gencee: distributions P ( x) and Q(x) over the same random variable x, we can measure how diﬀerent these two distributions are using the Kul lbackLeibler (KL) divergenc P (x)e: D KL(P kQ) = E x∼P log = E x∼P [log P (x) − log Q(x)] . (3.50) Q(x) P (x) E E D (P Q) = log = [log P (x) log Q(x)] . (3.50) Q(x) 74 k −
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
In the case of discrete variables, it is the extra amount of information (measured in bits if we use the base 2 logarithm, but in machine learning we usually use nats thenatural case of logarithm) discrete variables, the extra amount of information (measured andInthe neededittoissend a message containing symbols drawn in bits if we use the base 2 logarithm, but in machine learning w e usually use nats from probabilit probability y distribution P , when we use a co code de that was designed to minimize and the natural logarithm) needed to send a message containing symbols drawn the length of messages dra drawn wn from probabilit probability y distribution Q. from probability distribution P , when we use a code that was designed to minimize The KL div divergence ergence has many useful prop properties, erties, most notably that it is nonthe length of messages drawn from probability distribution Q. negativ negative. e. The KL divergence is 0 if and only if P and Q are the same distribution in The KLdiscrete divergence has many useful propev erties, mostinnotably it tin is uous nonthe case of variables, or equal “almost everywhere” erywhere” the casethat of con contin tinuous P Q negativ e. The KL divergence is 0 if and only if and are the same distribution in variables. Because the KL div divergence ergence is nonnegativ nonnegativee and measures the diﬀerence the case variables,itorisequal everywhere” in the case of contin uous b et etw w een oftwdiscrete o distributions, often“almost conceptualized as measuring some sort of vdistance ariables.bBecause the KL div ergence is nonnegativ e and measures the diﬀerence etw etween een these distributions. How However, ever, it is not a true distance measure b et w een t w o distributions, it is often conceptualized measuring some of 6 DKL( QkP )asfor Q.sort because it is not symmetric: DKL( P kQ ) = some P and This distance b etw een these distributions. How ever, it is not a true distance measure asymmetry means that there are imp importan ortan ortantt consequences to the choice of whether Q ) =3.6 P ) detail. D). See ( P Fig. D for( Q because is not symmetric: for some P and Q. This to use Dit ( P k Q ) or D ( Q k P more KL KL asymmetry means that there are impkortan6 t consequences to the choice of whether k A quan quantit tit tity y that is closely related to the KL div divergence ergence crossentr ossentr ossentropy opy to use D (P Q) or D (Q P ). See Fig. 3.6 for more detail.is the cr H (P, Q ) = H ( P ) + DKL (P kQ), whic which h is similar to the KL div divergence ergence but lac lacking king k that is closely k related to the KL divergence is the crossentropy A quan tit y the term on the left: H (P, Q ) = H ( P ) + D (P HQ(P, ), whic isEsimilar to the KL divergence but lac king Q) =h − (3.51) x∼P log Q(x). the term on the left: k Eect to Q is equiv Minimizing the crossentrop crossentropy resp respect equivalent the H y(P,with Q) = log Q(x ). alent to minimizing (3.51) KL div divergence, ergence, because Q do does es not participate omitted term. − ect to Qinisthe Minimizing the crossentropy with resp equivalent to minimizing the computing man many ofesthese quan quantities, tities, it common encoun encounter ter expresKL When divergence, because Qydo not participate in isthe omittedtoterm. sions of the form 0 log 0. By con conv ven ention, tion, in the con context text of information theory theory,, we When computing man y of these quan tities, it is common to encoun ter exprestreat these expressions as limx→0 x log x = 00.. sions of the form 0 log 0. By convention, in the context of information theory, we treat these expressions as lim x log x = 0.
3.14
Structured Probabilistic Mo Models dels
Mac Machine hine Structured learning algorithms often in inv volv olvee probabilit probability y distributions ov over er a very 3.14 Probabilistic Models large num umb ber of random variables. Often, these probabilit probability y distributions in inv volv olvee Mac hine learning algorithms often in v olv e probabilit y distributions ov er a very direct in interactions teractions betw etween een relatively few variables. Using a single function to large num ber entire of random ariables. Often, these probabilit distributions volve describ describe e the join jointt vprobabilit probability y distribution can be yvery ineﬃcien ineﬃcienttin(both direct interactions een relatively few variables. Using a single function to computationally andbetw statistically). describe the entire joint probability distribution can be very ineﬃcient (both Instead of using a single function to represen representt a probability distribution, we computationally and statistically). can split a probability distribution in into to man many y factors that we multiply together. of supp using a we single to represen t a probability we For Instead example, suppose ose ha hav vfunction e three random variables: a, b and cdistribution, . Supp Suppose ose that split a probability into manthe y factors together. acan inﬂuences the value ofdistribution b and b inﬂuences value that of c, we butmultiply that a and c are F or example, supp ose we ha v e three random v ariables: a , b and c . Supp ose that indep independen enden endentt giv given en b. We can represent the probabilit probability y distribution over all three a inﬂuences the value of b and b inﬂuences the value of c, but that a and c are independent given b. We can represent the probability distribution over all three 75
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
q ∗ = argminq D KL(pkq ) (pp(xq)) q∗k(x) p(x) q (x)
q = argmin D Probability Density
Probability Density
q = argmin D
q ∗ = argmin q DKL (q kp)
x
(p( q xp)) q ∗k(x) p(x) q (x)
x
Figure 3.6: The KL divergence is asymmetric. Suppose we ha hav ve a distribution p(x) and wish to approximate it with another distribution q ( x). We hav havee the choice of minimizing p(x) and Figure D 3.6: divergence iseasymmetric. havcehoice a distribution either or D illustrate theSuppose eﬀect ofwethis using a mixture of pkq) KL KL (The KL ( q kp). W ) q ( x wish to approximate it with another distribution . W e hav e the c hoice of minimizing two Gaussians for p, and a single Gaussian for q . The choice of whic which h direction of the D (p qto ( q p). We illustrate ) or either the eﬀect of this crequire hoice using a mixture of KL divergence useDis problemdep problemdependen enden endent. t. Some applications an approximation two Gaussians for p,high and aprobabilit for qthat . Thethe choice which direction the k single Gaussian that usually kplaces probability y anywhere true ofdistribution placesofhigh KL divergence to use is problemdep enden t. Some applications require an approximation probabilit probability y, while other applications require an appro approximation ximation that rarely places high that usually places y anywhereplaces that the distribution placesofhigh probabilit probability y an anywhere ywherehigh that probabilit the true distribution low true probabilit probability y. The choice the probabilitofy, the while applications requireofan appro ximation that rarely places direction KLother div divergence ergence reﬂects which these considerations takes priorit priority y for high eac each h probabilit y an ywhere that the true distribution places low probabilit y . The choice of the application. (L (Left) eft) The eﬀect of minimizing DKL ( pkq). In this case, we select a q that has direction of theyKL divergence reﬂects which yof. these priorit y for p has high p has multipletakes q cho high probabilit probability where probabilit probability Whenconsiderations mo modes, des, hooses oseseac toh (Left) ( p q). yInmass q that The application. The eﬀect of minimizing this on case, select a(Right) has blur the modes together, in order to put highDprobabilit probability allwofe them. p (has high probabilit y where . When multiple moprobability des, q cho oses to k pa has DKL q that eﬀect of minimizing . In probabilit this case, ywe select has low where q kp)high (Right) together, in order tomput highmo probabilit mass on all of The hasthe lo low wmodes probabilit probability y. When ultiple modes des thaty are suﬃcien suﬃciently tlythem. widely separated, pblur p has q that ahas ). In this q pergence eﬀect of minimizing case, we select low mo probability where as in this ﬁgure, theDKL (div divergence is minimized by cahoosing single mode, de, in order to p p has lo w probabilit y . When has m ultiple mo des that are suﬃcien tly widely separated, k avoid putting probability mass in the lowprobabilit lowprobability y areas b et etw ween mo modes des of p. Here, we as in this the ﬁgure, the KL divergence is minimized by choosing a singleWmo de, inalso order to q is chosen illustrate outcome when to emphasize the left mode. e could hav have e avhiev oid putting probability lowprobabilit y areas bthe etwright een mo des ofIfpthe . Here, we ac achiev hieved ed an equal value ofmass the in KLthe div divergence ergence by choosing mode. mo modes des illustrate the outcome when qnis emphasize the left mode. We could alsoofhav are not separated by a suﬃcie suﬃcien tlychosen strongto lo low w probabilit probability y region, then this direction thee ac hiev ed an equal v alue of the KL div ergence b y c hoosing the right mode. If the mo des KL div divergence ergence can still choose to blur the mo modes. des. are not separated by a suﬃciently strong low probability region, then this direction of the KL divergence can still choose to blur the mo des.
76
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
variables as a pro product duct of probability distributions ov over er tw two o variables: variables as a product of pprobability (a, b, c) = pdistributions (a)p(b  a)p(cov er b).two variables:
(3.52)
p(a, b, c) = p(a)p(b a)p(c b). (3.52) These factorizations can greatly reduce the num numb ber of parameters needed  a num  ber of parameters that is to describ describee the distribution. Each factor uses numb These factorizations can greatly reduce the num of parameters exp exponen onen onential tial in the num number ber of variables in the factor. Thisber means that we can needed greatly to describ the of distribution. factor uses ber of that is reduce theecost representing Each a distribution if wae num are able to parameters ﬁnd a factorization exp tial in the num of in into toonen distributions over ber few fewer er vvariables ariables.in the factor. This means that we can greatly reduce the cost of representing a distribution if we are able to ﬁnd a factorization We can describ describee these kinds of factorizations using graphs. Here we use the into distributions over fewer variables. word “graph” in the sense of graph theory: a set of ve vertices rtices that may be connected Weh can describ these kinds graphs. Here we use the to eac each other with eedges. Whenofwefactorizations represent theusing factorization of a probability w ord “graph” with in thea sense a seteof veob rtices that mo may e connected distribution graph,ofwgraph e call theory: it a structur structure d pr prob obabilistic abilistic model delbor gr graphic aphic aphical al to eac h other with edges. When we represent the factorization of a probability mo model del del.. distribution with a graph, we call it a structured probabilistic model or graphical There are two main kinds of structured probabilistic mo models: dels: directed and model. undirected. Both kinds of graphical mo models dels use a graph G in which each no node de There are t w o main kinds of structured probabilistic mo dels: directed and in the graph corresp corresponds onds to a random variable, and an edge connecting tw two o undirected. Both kinds of graphical mo dels use a graph in which each no de random variables means that the probability distribution is able to represen representt direct in the graph corresp onds to a random v ariable, and an edge connecting two G in interactions teractions bet etw ween those two random variables. random variables means that the probability distribution is able to represent direct Dir Direecte cted d mo models dels use graphs with directed edges, and they represen representt factorizainteractions between those two random variables. tions into conditional probability distributions, as in the example ab abov ov ove. e. Sp Speciﬁcally eciﬁcally eciﬁcally,, Dir e cte d mo dels use graphs with directed edges, and they represen t factorizaa directed mo model del con contains tains one factor for every random variable ariablex xi in the distribution, tions into conditional probability distributions, as in the example ab e. Sp eciﬁcally and that factor consists of the conditional distribution ov over er x i giv given enovthe paren parents ts of , for every random variable x in the distribution, xaidirected , denotedmo Pdel a G (con xi ):tains one factorY and that factor consists of the conditional distribution over x given the parents of p(x) = p (xi  P aG (xi )) . (3.53) x , denoted P a (x ): i p (x ) = p (x P a (x )) . (3.53) See Fig. 3.7 for an example of a directed graph and the factorization of probability  distributions it represen represents. ts. See Fig. 3.7 for an example of a directed graph and the factorization of probability Undir Undireecte cted d mo models dels use graphs with undirected edges, and they represen representt facdistributions it represents. Y torizations in into to a set of functions; unlik unlikee in the directed case, these functions are Undir e cte d mo dels use graphs with and t facusually not probability distributions ofundirected any kind. edges, An Any y set of they no nodes desrepresen that are all torizations to into a set of functions; unlik in the Each directed case, functions are ) in an connected each other in G is called a eclique. clique undirected C (ithese usually probability kind. factors Any setare of just nodes that arenot all ) C (iany mo model del isnot asso associated ciated withdistributions a factor φ(i) (of ). These functions, connectedytodistributions. each other inThe is output called aofclique. Each clique in an undirected probabilit probability each factor must be nonnegative, but φ ( mo del is asso ciated with a factor ) . These factors are just functions, not G C there is no constraint that the factor must sum or integrate to 1 like a probability probability distributions. The outputCof each factor must be nonnegative, but distribution. there is no constraint that the factor must sum or integrate to 1 like a probability The probability of a conﬁguration of random variables is pr prop op oportional ortional to the distribution. pro product duct of all of these factors—assignments that result in larger factor values are The probability of a conﬁguration of random variables is proportional to the 77 product of all of these factors—assignments that result in larger factor values are
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
a
b
c
d
e
Figure 3.7: A directed graphical mo model del ov over er random variables a , b, c , d and e. This graph corresp corresponds onds to probabilit probability y distributions that can b e factored as Figure 3.7: A directed graphical model over random variables a , b, c , d and e. This graph p(a, b, cy, d , e) = p(a)p(b  a)can p(c bae, b )p(d  b)as p(e  c). (3.54) corresp onds to probabilit distributions that factored p(ato , b,quickly (d distribution. This graph allo allows ws us seep(some of)pthe a c, d, e) = a)p(b properties a)p(c a, b b)p(e c). For example, (3.54) and c in interact teract directly directly,, but a and e in interact teract  only indirectly  via c.  This graph allows us to quickly see some properties of the distribution. For example, a and c interact directly, but a and e interact only indirectly via c.
more lik likely ely ely.. Of course, there is no guarantee that this pro product duct will sum to 1. We therefore divide by a normalizing constant Z, deﬁned to be the sum or integral likstates ely. Ofofcourse, there guarantee thatinthis proto duct will sum to 1. We φ functions, omore ver all the pro product ductisofno the order obtain a normalized therefore by a normalizing constant Z, deﬁned to be the sum or integral probabilit probability ydivide distribution: over all states of the product of the φ functions, in order to obtain a normalized 1 Y (i) (i) probability distribution: p(x) = φ . (3.55) C Z i 1 p(x) = φ . (3.55) See Fig. 3.8 for an example of anZundirectedC graph and the factorization of probabilit probability y distributions it represen represents. ts. See Fig. 3.8 for an example of an undirected graph and the factorization of Keep in mind that these graphical representations of factorizations are a probability distributions it represents. Y language for describing probability distributions. They are not mutually exclusive Keepofin mind that these graphical of factorizations are ay families probabilit probability y distributions. Beingrepresentations directed or undirected is not a prop propert ert erty language for describing probability distributions. They are not m utually exclusive of a probability distribution; it is a prop property erty of a particular description of a families of probabilit y distributions. Being directed or undirected is not aedprop probabilit probability y distribution, but an any y probability distribution may be describ described in bert othy ofays. a probability distribution; it is a property of a particular description of a w probability distribution, but any probability distribution may be described in both Throughout Part I and Part II of this book, we will use structured probabilistic ways. mo models dels merely as a language to describ describee which direct probabilistic relationships Throughout P art I and P art I I of choose this book, we will useNostructured probabilistic diﬀeren diﬀerentt mac machine hine learning algorithms to represent. further understanding mostructured dels merelyprobabilistic as a language to describ e which probabilistic relationships of mo models dels is needed un until tildirect the discussion of researc research h topics, diﬀeren t mac hine learning algorithms choose to represent. No further understanding in Part III, where we will explore structured probabilistic mo models dels in muc much h greater of structured probabilistic mo dels is needed un til the discussion of researc h topics, detail. in Part III, where we will explore structured probabilistic models in much greater 78 detail.
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
a
b
c
d
e
Figure 3.8: An undirected graphical model ov over er random variables a , b, c, d and e . This graph corresp corresponds onds to probabilit probability y distributions that can b e factored as Figure 3.8: An undirected graphical model over random variables a , b, c, d and e . This 1 (1) (3) graph corresp onds top(probabilit distributions can φ (a, b,that a, b, c, d, ey) = c)φ(2) (b,bde)φfactored (c, e). as (3.56) Z 1 φ properties = some (a, b, c)φ of (bthe , d)φdistribution. (c, e). a, quickly b, c, d, e)see (3.56) This graph allo allows ws usp(to For example, a Z and c in interact teract directly directly,, but a and e in interact teract only indirectly via c. This graph allows us to quickly see some properties of the distribution. For example, a and c interact directly, but a and e interact only indirectly via c.
This chapter has reviewed the basic concepts of probabilit probability y theory that are most relev relevant ant to deep learning. One more set of fundamen fundamental tal mathematical to tools ols This c hapter has reviewed the basic concepts of probabilit y theory that are remains: numerical metho methods. ds. most relevant to deep learning. One more set of fundamental mathematical tools remains: numerical methods.
79
Chapter 4 Chapter 4
Numerical Computation Numerical Computation Mac Machine hine learning algorithms usually require a high amoun amountt of numerical computation. This typically refers to algorithms that solve mathematical problems by Mac hine learning algorithms require avia high t of numerical compumetho methods ds that update estimatesusually of the solution an amoun iterative pro process, cess, rather than tation. This t ypically refers to algorithms that solve mathematical problems by analytically deriving a form formula ula providing a symbolic expression for the correct sometho ds that update estimates of the solution via an iterative pro cess, rather than lution. Common op operations erations include optimization (ﬁnding the value of an argument analytically deriving a formulaa providing symbolic the correct sothat minimizes or maximizes function) aand solvingexpression systems offor linear equations. lution. Common operations include optimization value of anbargument Ev Even en just ev evaluating aluating a mathematical function on a(ﬁnding digital the computer can e diﬃcult that minimizes or maximizes a function) and solving systems of linear equations. when the function inv involv olv olves es real num numbers, bers, whic which h cannot be represented precisely Ev en just ev aluating a mathematical function on a digital computer can be diﬃcult using a ﬁnite amount of memory memory.. when the function involves real numbers, which cannot be represented precisely using a ﬁnite amount of memory.
4.1
Ov Overﬂo erﬂo erﬂow w and Underﬂo Underﬂow w
The fundamental tal w diﬃculty performingwcontin continuous uous math on a digital computer 4.1 fundamen Overﬂo and in Underﬂo is that we need to represent inﬁnitely many real num numb bers with a ﬁnite num number ber Thebitfundamen tal This diﬃculty in pthat erforming continall uous math on a digital computer of patterns. means for almost real num numbers, bers, w wee incur some is thatximation we neederror to represent a ﬁniteIn num ber appro approximation when weinﬁnitely represen representmany t the nreal um umb bnum er inbers thewith computer. man many y of bit patterns. This means that for almost all real num bers, w e incur some cases, this is just rounding error. Rounding error is problematic, esp especially ecially when appro ximation error when w e represen t the n um b er in the computer. maniny it compounds across man many y op operations, erations, and can cause algorithms that In work cases, this is just rounding error.are Rounding errortois minimize problematic, especially when theory to fail in practice if they not designed the accum accumulation ulation of it compounds rounding error.across many operations, and can cause algorithms that work in theory to fail in practice if they are not designed to minimize the accumulation of One form of rounding error that is particularly dev devastating astating is underﬂow. Underrounding error. ﬂo ﬂow w occurs when num numbers bers near zero are rounded to zero. Many functions behav ehavee One form of rounding error that is particularly dev astating is underﬂow . Underqualitativ diﬀeren when their argument is zero rather than a small p ositive qualitatively ely diﬀerently tly ﬂo w boer. ccurs numbers near zero are rounded to zero. by Many behav n um umb Forwhen example, we usually wan want t to av avoid oid division zerofunctions (some softw software aree qualitatively diﬀerently when their argument is zero rather than a small positive number. For example, we usually want80 to avoid division by zero (some software 80
CHAPTER 4. NUMERICAL COMPUTATION
en environmen vironmen vironments ts will raise exceptions when this occurs, others will return a result with a placeholder notan notanum um umb ber value) or taking the logarithm of zero (this is en vironmen ts will raise exceptions thisnotan occurs,um others a result usually treated as −∞, whic which h then when becomes notanum umb ber if will it is return used for many with a placeholder notan um b er v alue) or taking the logarithm of zero (this is further arithmetic op operations). erations). usually treated as , which then becomes notanumber if it is used for many Another highly damaging form of numerical error is overﬂow. Overﬂo Overﬂow w occurs further arithmetic −∞ operations). when num umbers bers with large magnitude are appro approximated ximated as ∞ or −∞. Further Anotherwill highly damaging numerical error is overﬂow . Overﬂo w occurs arithmetic usually changeform theseofinﬁnite values into notanum notanumber ber values. when numbers with large magnitude are approximated as or . Further One example of a function that must be stabilized against underﬂow and arithmetic will usually change these inﬁnite values into notanum ber v ∞ −∞ alues. overﬂo erﬂow w is the softmax function. The softmax function is often used to predict the One example of a function that must distribution. be stabilizedThe against underﬂow and probabilities asso associated ciated with a multinoulli softmax function is odeﬁned verﬂowto is bthe softmax function. The softmax function is often used to predict the e probabilities associated with a multinoulli distribution. The softmax function is exp( exp(x xi ) softmax(x)i = Pn . (4.1) deﬁned to be exp(x xj ) j=1 exp( exp(x ) softmax( x) x= are equal to some . constantt c. Analytically (4.1), Consider what happ happens ens when all of the Analytically, i exp(x ) constan we can see that all of the outputs should be equal to 1n . Numerically Numerically,, this may x Consider what happ ens when all of the are equal to some constan t c. Analytically exp exp((c) will, not occur when c has large magnitude. If c is very negativ negative, e, then w e can see that means all of the should be equal to will . Numerically , this underﬂo underﬂow. w. This the outputs denominator of the softmax become 0, so the may ﬁnal c c exp ( c not occur when has large magnitude. If is v ery negativ e, then ) will exp((c) will ov result is undeﬁned. When c is very large P and positiv ositive, e, exp overﬂo erﬂo erﬂow, w, again underﬂow.inThis means the denominator of theundeﬁned. softmax will become 0, so the ﬁnal resulting the expression as a whole being Both of these diﬃculties c exp ( c result is undeﬁned. When is very large and p ositiv e, ) will ov erﬂo w, again softmax((z ) where z = x − max i xi . Simple can be resolved by instead ev evaluating aluating softmax resultingshows in the expression whole beingfunction undeﬁned. of these diﬃculties algebra that the valueasofathe softmax is notBoth changed analytically by softmax z ) where z = x max can be resolved by instead evaluating . Simple maxx x adding or subtracting a scalar from the input (vector. Subtracting results i i algebra shows argument that the value of bthe softmax function is not by − analytically in the largest to exp eing 0, whic which h rules out thechanged possibility of ov overﬂo erﬂo erﬂow. w. max x adding or subtracting a scalar from the input vector. Subtracting results Lik Likewise, ewise, at least one term in the denominator has a value of 1, which rules out in the largest argument to exp being 0, which rules out the ossibilityby of zero. overﬂow. the possibility of underﬂow in the denominator leading to apdivision Likewise, at least one term in the denominator has a value of 1, which rules out There is still one small problem. Underﬂow in the numerator can still cause the possibility of underﬂow in the denominator leading to a division by zero. the expression as a whole to ev evaluate aluate to zero. This means that if we implement There is still one small problem. Underﬂow in the numerator can cause log softmax softmax((x) by ﬁrst running the softmax subroutine then passing thestill result to the expression as waewhole evaluate to zero. −∞ This means that if we implement implement the log function, could to erroneously obtain . Instead, we must softmaxfunction (x) by ﬁrst running the softmax subroutine then passing thewa result to alogseparate that calculates in a numerically stable way y. The log softmax the softmax log function, we could obtain . Instead, log function can beerroneously stabilized using the same trick aswe we must used implement to stabilize a separate function that calculates in a n umerically stable way. The log softmax −∞ the softmax function. log softmax function can be stabilized using the same trick as we used to stabilize For the most part, we do not explicitly detail all of the numerical considerations the softmax function. in inv volv olved ed in implementing the various algorithms describ described ed in this book. Developers F or the most part, we do not explicitly detail all of the numerical of lowlev lowlevel el libraries should keep numerical issues in mind when considerations implementing in v olv ed in implementing the v arious algorithms describ ed in this b o ok. Developers deep learning algorithms. Most readers of this book can simply rely on lowof lowlev el libraries should k eep numerical issues in mind when implementing lev level el libraries that provide stable implementations. In some cases, it is possible deep learning aalgorithms. Most of this ook can simplyautomatically rely on lowto implement new algorithm andreaders hav havee the new b implementation level libraries that provide stable implementations. In some cases, it is possible to implement a new algorithm and hav81 e the new implementation automatically
CHAPTER 4. NUMERICAL COMPUTATION
stabilized. Theano (Bergstra et al., 2010; Bastien et al., 2012) is an example of a softw software are pack package age that automatically detects and stabilizes man many y common stabilized. Theano ( Bergstra et al. , 2010 ; Bastien et al. , 2012 ) is an example numerically unstable expressions that arise in the context of deep learning. of a software package that automatically detects and stabilizes many common numerically unstable expressions that arise in the context of deep learning.
4.2
Poor Conditioning
Conditioning to how rapidly a function changes with resp respect ect to small changes 4.2 Poorrefers Conditioning in its inputs. Functions that change rapidly when their inputs are perturb erturbed ed slightly Conditioning refers to how rapidly a function c hanges with resp ect to small can be problematic for scientiﬁc computation because rounding errors in thechanges inputs in its inputs. F unctions that change rapidly when their inputs are p erturb ed slightly can result in large changes in the output. can be problematic for scientiﬁc computation because rounding errors in the inputs Consider the function f ( x ) = A−1x. When A ∈ R n×n has an eigenv eigenvalue alue can result in large changes in the output. decomp decomposition, osition, its condition numb number er is R A Consider the function f ( x ) = A x has an eigenvalue . When decomposition, its condition number is λi ∈ (4.2) max . i,j λj λ max . (4.2) This is the ratio of the magnitude of the λlargest and smallest eigen eigenv value. When this num numb ber is large, matrix inv inversion ersion is particularly sensitive to error in the input. This is the ratio of the magnitude of the largest and smallest eigenvalue. When This sensitivit sensitivity y is an in intrinsic trinsic prop propert ert erty y of the matrix itself, not the result this number is large, matrix inversion is particularly sensitive to error in the input. Poorly conditioned matrices amplify of rounding error during matrix inv inversion. ersion. This sensitivit is anwe intrinsic prop of the matrix itself,Innot the result preexisting errorsywhen multiply by ert they true matrix inv inverse. erse. practice, the of rounding error during further matrix binv Poorly matrices amplify error will be comp compounded ounded y nersion. umerical errorsconditioned in the in inv version pro process cess itself. preexisting errors when we multiply by the true matrix inverse. In practice, the error will be compounded further by numerical errors in the inversion process itself.
4.3
Gradien GradientBased tBased Optimization
Most learningtBased algorithms Optimization in inv volv olvee optimization of some sort. Optimization 4.3 deep Gradien refers to the task of either minimizing or maximizing some function f (x ) by altering Most learning algorithms involve optimization some of sort. Optimization x f (x). . W Weedeep usually phrase most optimization problems inofterms minimizing f ( x refers to the task of either minimizing or maximizing some function ) b y altering Maximization ma may y be accomplished via a minimization algorithm by minimizing x −.f (W xe). usually phrase most optimization problems in terms of minimizing f (x). Maximization may be accomplished via a minimization algorithm by minimizing antt to minimize or maximize is called the obje objective ctive function f (The x). function we wan or criterion. When we are minimizing it, we may also call it the cost function function,, − The function we want to minimize or maximize is called the objective function loss function, or err error or function. In this book, we use these terms in interc terc terchangeably hangeably hangeably,, or criterion . When we are minimizing it, w e may also call it the c ost though some machine learning publications assign sp special ecial meaning to somefunction of these, loss function, or error function. In this book, we use these terms interchangeably, terms. though some machine learning publications assign special meaning to some of these We often denote the value that minimizes or maximizes a function with a terms. sup superscript erscript ∗. For example, we might say x∗ = arg min f (x). We often denote the value that minimizes or maximizes a function with a superscript . For example, we might say 82 x = arg min f (x). ∗
CHAPTER 4. NUMERICAL COMPUTATION
Gradient descent
2.0
Global minimum Gradient descentat x =0. 0
1.5
Since f (x) =0, gradient descent halts here.
1.0 0.5 0.0 −0.5
For x <0, we have f0(x) <0, so we can decrease f by moving rightward.
For x >0, we have f 0(x) >0, so we can decrease f by moving leftward.
−1.0
f(x) = 12 x2
−1.5 −2.0 −2.0
ff0((xx))= = xx −1.5
−1.0
−0.5
0.0 x
0.5
=x 1.0 f (x)1.5
2.0
Figure 4.1: An illustration of how the deriv derivativ ativ atives es of a function can b e used to follow the function downhill to a minimum. This tec technique hnique is called gr gradient adient desc descent ent ent.. Figure 4.1: An illustration of how the derivatives of a function can b e used to follow the function downhill to a minimum. This technique is called gradient descent.
We assume the reader is already familiar with calculus, but pro provide vide a brief review of how calculus concepts relate to optimization here. We assume the reader is already familiar with calculus, but provide a brief = f (to x),optimization Supp Suppose wecalculus hav havee a concepts function yrelate where both x and y are real num numbers. bers. review ofose how here. dy 0 The derivative of this function is denoted as f ( x) or as dx . The deriv derivative ative f 0 (x) y = f ( x x y Supp ose we hav e a function ) , where b oth and are real bers. giv gives es the slop slopee of f (x) at the point x. In other words, it sp speciﬁes eciﬁes hownum to scale (x) derivative denoted as f ( x ) orcorresp as .onding The deriv ativeinf the aThe small change of in this the function input in is order to obtain the corresponding change gives thef (slop ) )at . x. In other words, it speciﬁes how to scale output: x +e)of≈ff(x(x + the f 0(xp)oint a small change in the input in order to obtain the corresponding change in the The deriv derivative ative for minimizing a function because it tells us x). output: f (x + ) isf therefore (x) + f (useful ho how w to change x in order to make a small improv improvemen emen ementt in y . For example, we ≈is therefore The deriv ative useful for minimizing a function because tells us 0 . We it kno know w that f(x − sign (f ( x))) is less than f (x ) for small enough can thus x in order how to fchange makesteps a small emen t in . Fderiv or example, we (x) by moving x intosmall reduce withimprov opp opposite osite sign of ythe derivativ ativ ative. e. This x sign f ( ( f ( x f ( x kno w that ))) is less than ) for small enough . W e can thus tec technique hnique is called gr gradient adient desc descent ent (Cauc Cauch hy, 1847). See Fig. 4.1 for an example of f (x) by − reduce moving x in small steps with opposite sign of the derivative. This this technique. technique is0 called gradient descent (Cauchy, 1847). See Fig. 4.1 for an example of f (x) = 00,, the deriv derivative ative provides no information ab about out which direction thisWhen technique. 0 to mov move. e. Poin Points ts where f (x) = 0 are known as critic critical al points or stationary points. f ( x When ) = 0 , the deriv ative provides no information out which direction A lo loccal minimum is a point where f ( x) is low lower er than at allabneighboring poin oints, ts, f (x to mov ts where 0 are known critical inﬁnitesimal points or stationary f (x ) bas so it is e. noPoin longer possible to) = decrease y making steps. Apoints lo loccal. f ( x) isthan A local minimum ist awhere pointf (where loweratthan at all neighboring poin maximum is a poin oint all neigh neighb boring poin oints, ts, so it ts, is x ) is higher f ( x so it is no longer possible to decrease ) by making inﬁnitesimal steps. A local maximum is a point where f (x ) is higher than at all neighboring points, so it is 83
CHAPTER 4. NUMERICAL COMPUTATION
Types of critical points Minimum
Types Maximum of critical points
Saddle point
Minimum
Maximum
Saddle point
Figure 4.2: Examples of each of the three typ ypes es of critical poin oints ts in 1D. A critical point is a p oint with zero slop slope. e. Such a p oin ointt can either b e a lo local cal minimum, which is low lower er than Figure 4.2: Examples of each of the three t yp es of critical p oin ts in 1D. A critical p oint or is the neighboring p oints, a lo local cal maximum, which is higher than the neigh neighb b oring p oints, a saddle p oint with zero slop e. Such a p oin t can either b e a lo cal minimum, which is low er than p oint, which has neighbors that are b oth higher and low lower er than the p oin ointt itself. the neighboring p oints, a lo cal maximum, which is higher than the neighb oring p oints, or a saddle p oint, which has neighbors that are b oth higher and lower than the p oint itself.
not possible to increase f( x) by making inﬁnitesimal steps. Some critical points are neither maxima nor minima. These are known as sadd saddle le points oints.. See Fig. 4.2 f ( x not p ossible to increase ) b y making inﬁnitesimal steps. Some critical points for examples of each type of critical point. are neither maxima nor minima. These are known as sadd le points. See Fig. 4.2 A point that obtains lowest est value of f ( x) is a glob global al minimum minimum.. It for examples of each typethe of absolute critical plow oint. is possible for there to be only one global minim minimum um or multiple global minima of f ( x) is athat point that the absolute lowest of minima globare al minimum . It the A function. It isobtains also possible for there to bvealue lo local cal not globally is possibleInforthe there to beofonly one global minim um or multiple optimal. context deep learning, we optimize functionsglobal that minima may ha hav vofe the function. It is also p ossible for there to b e lo cal minima that are not globally man many y lo local cal minima that are not optimal, and man many y saddle points surrounded by optimal. In the context of deep we optimize functions that may ve v ery ﬂat regions. All of this makeslearning, optimization very diﬃcult, esp especially ecially whenhathe many to local are not optimal, and many saddle points surrounded input theminima functionthat is multidimensional. We therefore usually settle for ﬁndingby a v ery ﬂat regions. All of this makes optimization very diﬃcult, esp ecially when the value of f that is very low, but not necessarily minimal in any formal sense. See input4.3 to for thean function is multidimensional. We therefore usually settle for ﬁnding a Fig. example. value of f that is very low, but not necessarily minimal in any formal sense. See We often minimize functions that ha hav ve multiple inputs: f : R n → R. For the Fig. 4.3 for an example. concept of “minimization” to make sense, there must still be R only one R (scalar) f : W e often minimize functions that ha v e m ultiple inputs: . For the output. concept of “minimization” to make sense, there must still be only→one (scalar) For functions with multiple inputs, we must mak makee use of the concept of partial output. ∂ f ( x derivatives derivatives.. The partial deriv derivativ ativ ativee ∂x ) measures how f changes as only the i F or functions with multiple inputs, we must mak e use of the concept of partial variable xi increases at point x. The gr gradient adient generalizes the notion of deriv derivativ ativ ativee f ( x f derivatives . The partial deriv ativ e ) measures how c hanges as only to the case where the deriv derivativ ativ ativee is with resp respect ect to a vector: the gradient of f is the the x x v ariable increases at p oint . The gr adient generalizes the notion of deriv ativ vector containing all of the partial deriv derivativ ativ atives, es, denoted ∇xf ( x). Elemen Elementt i of thee to the case the deriv ativ e eis of with respect toect a vector: gradient of f is the f with gradien gradient t is where the partial deriv derivativ ativ ative resp respect to xi . Inthe multiple dimensions, vector containing all of the partial derivatives, denoted f ( x). Element i of the gradient is the partial derivative of f with . In multiple dimensions, 84 resp ect to x∇
CHAPTER 4. NUMERICAL COMPUTATION
Approximate minimization This local minimum
Approximate minimization performs nearly as well as f(x)
the global one, so it is an acceptable halting point.
Ideally, we would like to arrive at the global minimum, but this might not be possible.
This local minimum performs poorly, and should be avoided.
x
Figure 4.3: Optimization algorithms ma may y fail to ﬁnd a global minimum when there are multiple lo local cal minima or plateaus present. In the context of deep learning, we generally Figure such 4.3: Optimization y fail ﬁndminimal, a globalso minimum whencorresp there ond are accept solutions ev even en algorithms though theyma are not to truly long as they correspond m ultiple lo cal minima or plateaus present. In the context of deep learning, we generally to signiﬁcantly low values of the cost function. accept such solutions even though they are not truly minimal, so long as they corresp ond to signiﬁcantly low values of the cost function.
critical points are points where ev every ery element of the gradient is equal to zero. u The pdir dire ectional derivative vector) is is equal the slop slope of the critical oints are points whereinevdirection ery element (a of unit the gradient to ezero. function f in direction u. In other words, the directional deriv derivative ative is the deriv derivativ ativ ativee u The dir e ctional derivative in direction (a unit vector) is the slop e of the of the function f ( x + αu) with resp respect ect to α , ev evaluated aluated at α = 00.. Using the chain function in see direction derivative is the derivative ∂u. In other words, ∇x fdirectional f (x + αu) = u> the rule, we fcan that ∂α (x). of the function f ( x + αu) with respect to α , evaluated at α = 0. Using the chain f , we would lik direction in which f decreases the f (x + like αue)to = ﬁnd u the rule,Toweminimize can see that f (x ). fastest. We can do this using the directional deriv derivativ ativ ative: e: To minimize f , we would like to ﬁnd∇the direction in which f decreases the fastest. We can do this using the directional derivative: min u>∇ x f (x) (4.3) u,u > u=1
f (x) (4.3) min min uu (4.4) 2 ∇ xf (x)2 cos θ > u,u u=1 ∇ = min u f (x) Substituting cos θ where θ is the angle betw etween een u and the gradient. in u2 = 1 (4.4) and   ∇  ignoring factors that do not dep depend end on u, this simpliﬁes to minu cos θ. This is θ u where is the angle b etw een and the gradient. Substituting in ut. = and u minimized when poin oints ts in the opp opposite osite direction as the gradien gradient. In1other ignoring that dotsnot depend on uand , this to min pcos is θ. This w ords, thefactors gradient poin oints directly uphill, thesimpliﬁes negative gradient oints directly minimized points finbythe opposite direction as ofthe t. gradient. In other do downhill. wnhill. Wwhen e can udecrease moving in the direction thegradien negative w ords, the gradient p oin ts directly uphill, and the negative gradient p oints directly This is known as the metho method d of ste steep ep epest est desc descent ent or gr gradient adient desc descent ent. downhill. We can decrease f by moving in the direction of the negative gradient. Steepest est descent proposes oses a new oin oint t ent or gradient descent. ThisSteep is known as theprop metho d of steepp est desc =
0 Steepest descent proposes axnew poin =x − t∇x f (x)
(4.5)
x = x 85 f (x) − ∇
(4.5)
CHAPTER 4. NUMERICAL COMPUTATION
where is the le learning arning rate ate,, a positiv ositivee scalar determining the size of the step. We can cho hoose ose in sev several eral diﬀeren diﬀerentt wa ways. ys. A popular approach is to set to a small where is the le arning r ate , a p ositiv e scalar determining size of step. We constan constant. t. Sometimes, we can solve for the step size thatthe makes thethedirectional can cative hoosevanish. in several diﬀeren t ways. isAto popular approach set to a small f (x − is ∇to deriv derivative Another approach ev evaluate aluate xf (x)) for several constan t. Sometimes, we can solve for the step size that makes directional values of and choose the one that results in the smallest ob objective jectivethe function value. derivative vanish. isAnother is. to evaluate f (x f (x)) for several This last strategy called aapproach line se sear ar arch ch ch. values of and choose the one that results in the smallest ob function value. − jective ∇ Steep Steepest est descen descentt con conv verges when every element of the gradient is zero (or, in This last strategy is called a line search. practice, very close to zero). In some cases, we may be able to avoid running this Steep descent and convjust ergesjump whendirectly every element the gradient is solving zero (or,the in iterativ iterative e est algorithm, to the of critical point by practice, very zero). equation ∇xf (close x) = to 0 for x. In some cases, we may be able to avoid running this iterative algorithm, and just jump directly to the critical point by solving the Although gradient descent is limited to optimization in contin continuous uous spaces, the equation f (x) = 0 for x. general concept of making small mo mov ves (that are appro approximately ximately the best small mo mov ve) ∇ gradient descent is limited to optimization in continuous spaces, the Although to tow wards better conﬁgurations can be generalized to discrete spaces. Ascending an general concept of making small moves (that are appro best and smallNorvig move), ob objectiv jectiv jective e function of discrete parameters is called hil hilll ximately climbing the (Russel towards 2003 2003). ). better conﬁgurations can be generalized to discrete spaces. Ascending an ob jective function of discrete parameters is called hil l climbing (Russel and Norvig, 2003).
4.3.1
Bey Beyond ond the Gradien Gradient: t: Jacobian and Hessian Matrices
4.3.1 Bey Gradien t: Jacobian and Matrices Sometimes weond needthe to ﬁnd all of the partial deriv derivativ ativ atives esHessian of a function whose input and output are both vectors. The matrix containing all suc such h partial deriv derivatives atives is Sometimes need ﬁnd .allSp ofeciﬁcally the partial deriv ativ of a function f : R m whose → Rn, input kno known wn as a we Jac Jacobian obianto matrix matrix. Speciﬁcally eciﬁcally, , if we hav have e a es function then and output are b oth vectors. The matrix containing all suc h partial deriv atives is ∂ n×m of f is deﬁned such that Ji,j = Rj f (x) i .R the Jacobian matrix J ∈ R : known as a Jacobian matrix. Speciﬁcally, if we have a function f ∂x , then Rin e are also sometimes interested terested in deﬁned a deriv derivativ ativ ative e of a deriv derivative. ative. fThis known wn (x→ ) .is kno of f is such that Jn = the W Jacobian matrix J as a se seccond derivative derivative.. For example, for a function f : R → R, the deriv derivativ ativ ativee ∈ interested in a derivative of a derivative. This is kno W e are also sometimes ∂ 2 wn with resp respect ect to x i of the deriv derivativ ativ ativee of f with resp respect ect to xRj is denoted as ∂xi ∂xj f . R f : as a second derivative. For example, for a function , the deriv ative d2 00 f f ( x In a single dimension, we can denote b y ) . The second deriv derivativ ativ ative e tells 2 f with respect to x is→ f. with respect to x of the derivative of dx denoted as us how the ﬁrst deriv derivativ ativ ativee will change as we vary the input. This is imp important ortant (x). as Inecause a single dimension, we can denote stepf will by fcause The second deriv ativeemen tellst b it tells us whether a gradient muc uch h of an improv improvemen ement us we howwould the ﬁrst deriv ative will change as wealone. vary the Thisofisthe impsecond ortant as exp expect ect based on the gradient We input. can think b ecause it tells us whether a gradient step will cause as m uc h of an improv ement deriv derivative ative as measuring curvatur curvaturee. Supp Suppose ose we hav havee a quadratic function (many as we would ect in based on the alone.but Wecan canbethink of the second functions thatexp arise practice are gradient not quadratic approximated well deriv ative as measuring curvatur e . Supp ose we hav e a quadratic function (many as quadratic, at least lo locally). cally). If suc such h a function has a second deriv derivativ ativ ativee of zero, functions that arise in practice are not quadratic but can b e approximated well then there is no curv curvature. ature. It is a perfectly ﬂat line, and its value can be predicted as quadratic, at least lo cally). If suc h a function has a second deriv ativ e of zero, using only the gradient. If the gradient is 1, then we can mak makee a step of size then there is no curv ature. Itand is athe perfectly ﬂat line, and its value predicted . Ifbethe along the negative gradient, cost function will decrease bycan second using only the gradient. If the gradient is 1 , then w e can mak e a step of deriv derivative ative is negative, the function curves down downw ward, so the cost functionsize will along thedecrease negativeby gradient, and the cost ,function will decrease by is . Ifpositiv the second actually more than . Finally Finally, if the second deriv derivative ative ositive, e, the deriv ative is negative, the function curves down w ard, so the cost function will function curves upw upward, ard, so the cost function can decrease by less than . See Fig. actually decrease by more than . Finally, if the second derivative is positive, the 86 function curves upward, so the cost function can decrease by less than . See Fig.
CHAPTER 4. NUMERICAL COMPUTATION
No curv curvature ature
Positiv ositivee curv curvature ature
Negative curvature
No curvature
Positive curvature
x
f (x)
f (x)
f (x)
Negativ Negativee curv curvature ature
x
x
Figure 4.4: The second deriv derivativ ativ ativee determines the curv curvature ature of a function. Here we show quadratic functions with various curv curvature. ature. The dashed line indicates the value of the cost Figure 4.4: second deriv ativon e determines theinformation curvature of a function. Herea we show function we The would exp expect ect based the gradient alone as we make gradient quadratic functions with v arious curv ature. The dashed line indicates the v alue of the cost step downhill. In the case of negative curv curvature, ature, the cost function actually decreases function we the would exp ectpredicts. based onIn the gradient alone we make a gradient faster than gradient the case ofinformation no curv curvature, ature, theas gradient predicts the step downhill. In. the casecase of negative curv ature, the function actually decreases decrease correctly correctly. In the of p ositive curv curvature, ature, thecost function decreases slow slower er than faster than the gradient In the case ofono curv the gradient predicts the exp expected ected and even eventually tually bpredicts. egins to increase, so to too large ofature, step sizes can actually increase decrease correctly . In the the function inadverten inadvertently tly tly..case of p ositive curvature, the function decreases slower than expected and eventually begins to increase, so too large of step sizes can actually increase the function inadvertently.
4.4 to see how diﬀerent forms of curv curvature ature aﬀect the relationship betw etween een the value of the cost function predicted by the gradient and the true value. 4.4 to see how diﬀerent forms of curvature aﬀect the relationship between the value When our function has multiple input dimensions, there are many second of the cost function predicted by the gradient and the true value. deriv derivatives. atives. These deriv derivatives atives can be collected together into a matrix called the Whenmatrix our function has multiple input therethat are many second Hessian matrix. . The Hessian matrix H (f )(xdimensions, ) is deﬁned such derivatives. These derivatives can be collected together into a matrix called the Hessian matrix. The Hessian matrix H (f )(∂x2) is deﬁned such that f (x). (4.6) H (f )(x)i,j = ∂ x i ∂ xj ∂ f (x). (4.6) H (f )(x) = Equiv Equivalently alently alently,, the Hessian is the Jacobian ∂ofx the ∂ x gradient. An Anywhere ywhere that the second partial deriv derivativ ativ atives es are contin continuous, uous, the diﬀerential Equivalently, the Hessian is the Jacobian of the gradient. op operators erators are commutativ commutative, e, i.e. their order can be swapped: Anywhere that the second partial derivatives are continuous, the diﬀerential ∂ 2 their order can ∂ 2 be swapped: operators are commutative, i.e. f (x) = f (x). (4.7) ∂ x i ∂ xj ∂ x j ∂ xi ∂ ∂ f (x) = f (x). (4.7) H = H This implies that i,j is symmetric at such points. ∂j,ix, ∂sox the Hessian ∂ x matrix ∂x Most of the functions we encoun encounter ter in the context of deep learning ha hav ve a symmetric H = H This implies that , so the Hessian matrix is symmetric at points. Hessian almost everywhere. Because the Hessian matrix is real andsuch symmetric, Most of the functions we encoun ter in the context of deep learning ha v e a symmetric we can decomp decompose ose it in into to a set of real eigen eigenv values and an orthogonal basis of Hessian almost everywhere. Because the Hessian matrix is real and symmetric, we can decompose it into a set of real87eigenvalues and an orthogonal basis of
CHAPTER 4. NUMERICAL COMPUTATION
eigen eigenv vectors. The second deriv derivative ative in a sp speciﬁc eciﬁc direction represented by a unit vector d is giv given en by d> H d. When d is an eigenv eigenvector ector of H , the second deriv derivativ ativ ativee eigen v ectors. The second deriv ative in a sp eciﬁc direction represented by a unit in that direction is given by the corresp corresponding onding eigenv eigenvalue. alue. For other directions of d d H d d H v ector is giv en by . When is an eigenv ector of , the second deriv ativ e d , the directional second deriv derivativ ativ ativee is a weigh weighted ted average of all of the eigen eigenv values, in that direction is given by the corresp onding eigenv alue. F or other directions of with weigh eights ts bet etw ween 0 and 1, and eigenv eigenvectors ectors that hav havee smaller angle with d d , the directional second deriv ativeum is aeigenv weigh ted determines average of all the eigensecond values, receiving more weigh eight. t. The maxim maximum eigenvalue alue theofmaximum d with weighand ts bthe etwminim een 0 um andeigenv 1, and eigenv ectors that have smaller angle with deriv derivative ative minimum eigenvalue alue determines the minim minimum um second deriv derivative. ative. receiving more weight. The maximum eigenvalue determines the maximum second The (directional) second deriv derivative ative tells us how well we can exp expect ect a gradient derivative and the minimum eigenvalue determines the minimum second derivative. descen descentt step to perform. We can mak makee a secondorder Taylor series appro approximation ximation Thefunction (directional) second deriv ative tells us x how (0) well we can exp ect a gradient to the f (x) around the current point : descent step to perform. We can make a secondorder Taylor series approximation 1 to the function (0) f (x) f≈(xf)(xaround ) + (the x −current x(0) )>gp+oint(xx − :x(0) )> H (x − x (0)). (4.8) 2 1 ) g + f ( x ) f ( x ) + ( x x (x x(0) ) H (x x ). (4.8) where g is the gradient and H is the Hessian 2 at x . If we use a learning rate ≈ point x will − − − g. Substituting of , then the new be given by x (0) − this into our where is the gradient and is the Hessian at . If w e use a learning rate g H x appro approximation, ximation, we obtain g. Substituting this into our of , then the new point x will be given by x > − 1 2 > approximation, wefobtain (0) (0) (x − g) ≈ f (x ) − g g + g H g. (4.9) 2 1 g) thef (original x ) vgalue f (x here: g +of the g H g. (4.9) There are three terms function, the exp expected ected 2 − e of ≈ the function, − impro improv vemen ementt due to the slop slope and the correction we must apply There are three terms here: the original v alue thelast function, the expected to account for the curv curvature ature of the function. Whenofthis term is to too o large, the impro v emen t due to the slop e of the function, and the correction we must apply > gradien gradientt descen descentt step can actually mov movee uphill. When g H g is zero or negative, to account for the curv ature of the function. this last term toodecrease large, the forev f the Taylor series appro approximation ximation predicts thatWhen increasing forever er iswill gradien descen t stepthe can mov uphill. When orlarge negative, g H g is zerofor , so forev forever. er.t In practice, Taactually ylor series is eunlik unlikely ely to remain accurate large fe, the T ylorresort seriestoappro forever inincreasing g>will H g decrease one maust moreximation heuristic predicts choices ofthat this case. When is positiv ositive, forever. for In the practice, thestep Taylor is unlikelythe to T remain accurate for large , so solving optimal sizeseries that decreases aylor series approximation of g H g one m ust resort to more heuristic c hoices of in this case. When is p ositiv e, the function the most yields solving for the optimal step size that decreases g >g the Taylor series approximation of ∗ . (4.10) = the function the most yields g> H g g g . ector of H corresp (4.10) =the eigenv In the worst case, when g aligns with eigenvector corresponding onding to the g Hg 1 λ maximal eigenv eigenvalue alue max , then this optimal step size is given by λ max . To the In thet worst case, when gwaligns with the ector of H corresp to the exten extent that the function e minimize caneigenv be appro approximated ximated well bonding y a quadratic λ ,ofthen maximal eigenv alue this optimal size isthe given byof the. learning To the function, the eigen eigenv values the Hessian thus step determine scale exten t that the function w e minimize can b e appro ximated w ell b y a quadratic rate. function, the eigenvalues of the Hessian thus determine the scale of the learning The second deriv derivative ative can be used to determine whether a critical point is a rate. lo local cal maximum, a lo local cal minimum, or saddle point. Recall that on a critical point, second deriv be used to that determine whether aascritical peoint is a f 0(xThe f 00(ative x) > can f 0(x) increases )=0 . When 0, this means we mov move to the lo cal maximum, a lo cal minimum, or saddle p oint. Recall that on a critical p oint, righ right, t, and f 0 (x ) decreases as we mov movee to the left. This means f 0 ( x − ) < 0 and f (x) = 0. When f (x) > 0, this means that f (x) increases as we move to the 88 the left. This means f ( x ) < 0 and right, and f (x ) decreases as we move to −
CHAPTER 4. NUMERICAL COMPUTATION
f 0(x + ) > 0 for small enough . In other words, as we mov movee right, the slop slopee begins to poin ointt uphill to the right, and as we mo mov ve left, the slop slopee begins to point uphill f (the x + left. ) > 0Thus, .0Inand for small other mov e right, that the slop egins to whenenough wewe can conclude local cal f 0 (x ) = f 00(xw)ords, > 0, as x iseablo to p oin t uphill to the right, and as w e mo v e left, the slop e b egins to p oint uphill 0 00 minim minimum. um. Similarly Similarly,, when f ( x) = 0 and f (x) < 0, we can conclude that x is a to the left. Thus,This whenisfknown 0 and 0, we can test conclude that x is a, when local (x ) = as (cx ) >derivative lo local cal maximum. the fse sec ond test. . Unfortunately Unfortunately, 00 (x um. Similarly, when f ( x) = 0 and f (x)x < 0, we can conclude that x is a fminim ) = 00,, the test is inconclusive. In this case ma may y be a saddle poin oint, t, or a part lo cal maximum. This is known as the se c ond derivative test . Unfortunately , when of a ﬂat region. f (x) = 0, the test is inconclusive. In this case x may be a saddle point, or a part In multiple dimensions, we need to examine all of the second deriv derivatives atives of the of a ﬂat region. function. Using the eigendecomp eigendecomposition osition of the Hessian matrix, we can generalize multiple dimensions, to examine all of the derivpatives of the the In second deriv derivativ ativ ativee test we to need multiple dimensions. At second a critical oint, where function. Using of the matrix, we can generalize ∇ 0,, we the can eigendecomp examine the osition eigen eigenv values of Hessian the Hessian to determine whether x f( x) = 0 the second deriv ativ e test to multiple dimensions. A t a critical p oint, where the critical point is a lo local cal maximum, lo local cal minimum, or saddle point. When the f ( x ) = 0 , w e can examine the eigen v alues of the Hessian to determine whether Hessian is positiv ositivee deﬁnite (all its eigenv eigenvalues alues are positive), the poin ointt is a lo local cal the critical pointcan is ablo maximum, localthat minimum, or saddlesecond point. deriv Whenativ thee ∇ minim minimum. um. This e cal seen by observing the directional derivativ ative Hessian is positiv e deﬁnite (all itsand eigenv alues reference are positive), poin t is second a local in any direction must be positive, making to thethe univ univariate ariate minim um.test. ThisLik can be seen bythe observing the directional second derivalues ative deriv derivative ative Likewise, ewise, when Hessian that is negativ negative e deﬁnite (all its eigenv eigenvalues in any direction must b e p ositive, and making reference to the univ ariate second are negative), the point is a lo local cal maxim maximum. um. In multiple dimensions, it is actually deriv ativetotest. ewise, evidence when theofHessian negativ deﬁnite (all When its eigenv p ossible ﬁnd Lik positive saddle ispoin oints ts inesome cases. at alues least are negative), the p oint is a lo cal maxim um. In multiple dimensions, it is actually one eigenv eigenvalue alue is positiv ositivee and at least one eigenv eigenvalue alue is negative, we kno know w that p ossible to ﬁnd p ositive evidence of saddle p oin ts in some cases. When at least local cal maximum on one cross section of f but a lo local cal minimum on another x is a lo one eigenv alue is p ositiv e and at least one eigenv alue is negative, we kno w that cross section. See Fig. 4.5 for an example. Finally Finally,, the multidimensional second localtest maximum one crosse,section of the a ariate local minimum on another x is aative f but deriv derivative can be on inconclusiv inconclusive, just like univ univariate version. The test is cross section. See Fig. 4.5 for an example. Finally , the multidimensional second inconclusiv inconclusivee whenev whenever er all of the nonzero eigenv eigenvalues alues ha hav ve the same sign, but at deriv ative test can b e inconclusiv e, just like the univ ariate version. The test least one eigenv eigenvalue alue is zero. This is because the univ univariate ariate second deriv derivative ative test is is inconclusiv e whenev er all of the nonzero eigenv alues ha v e the same sign, but at inconclusiv inconclusivee in the cross section corresp corresponding onding to the zero eigenv eigenvalue. alue. least one eigenvalue is zero. This is because the univariate second derivative test is In multiple dimensions, there can b e a wide variety of diﬀerent second deriv derivatives atives inconclusive in the cross section corresponding to the zero eigenvalue. at a single point, because there is a diﬀerent second deriv derivative ative for each direction. In multiple dimensions, there can b e a wide v ariety of diﬀerent second deriv derivativ atives The condition num numb ber of the Hessian measures how muc much h the second derivativ atives es at a single p oint, b ecause there is a diﬀerent second deriv ative for each direction. vary ary.. When the Hessian has a poor condition num number, ber, gradien gradientt descent performs The condition ber of measures how muc h the second deriv ativin es p oorly orly. . This is num because in the oneHessian direction, the deriv derivative ative increases rapidly rapidly, , while vary. When the Hessian has a slowly poor condition ber, gradien tare descent performs another direction, it increases slowly. . Gradientnum descent is unaw unaware of this change p o orly . This is b ecause in one direction, the deriv ative increases rapidly , while in in the deriv derivativ ativ ativee so it do does es not know that it needs to explore preferentially in another direction, it increases slowly . Gradient descent is unaw are of this c hange the direction where the deriv derivativ ativ ativee remains negativ negativee for longer. It also mak makes es it in the deriv ative so it do es not know that it needs to explore preferentially in diﬃcult to choose a go d step size. The step size m ust b e small enough to av goo o avoid oid direction where the deriv ativgoing e remains negativ e for longer. also mak es it othe versho ershooting oting the minimum and uphill in directions with Itstrong positive diﬃcult to This choose a goodmeans step size. step size size ismto usto b e small enough to avoid curv curvature. ature. usually that The the step too small to make signiﬁcant o v ersho oting the minimum and going uphill in directions with strong positive progress in other directions with less curv curvature. ature. See Fig. 4.6 for an example. curvature. This usually means that the step size is too small to make signiﬁcant This issue candirections be resolved byless using information from4.6 theforHessian matrix to progress in other with curv ature. See Fig. an example. This issue can be resolved by using information from the Hessian matrix to 89
CHAPTER 4. NUMERICAL COMPUTATION
0
f(x1 ;x2 )
500
−500
−15
x1 0
15
−15
15 0 x2
Figure 4.5: A saddle p oint containing b oth p ositive and negative curv curvature. ature. The function 2 2 in this example is f (x ) = x1 − x 2. Along the axis corresp corresponding onding to x1, the function Figure 4.5: ard. A saddle ointiscontaining bector oth pof ositive and negative curv The function curv curves es upw upward. This paxis an eigenv eigenvector the Hessian and has a ature. p ositive eigenv eigenvalue. alue. ) = x to x x2 ,. the in this the example is f (xonding Along the axis corresp onding to x direction , the function Along axis corresp corresponding function curv curves es down downward. ward. This is an curv es upward. axis with is an− eigenvector of alue. the Hessian and“saddle has a ppositive eigenvfrom alue. eigen eigenv vector of theThis Hessian negative eigenv eigenvalue. The name oint” derives x Along the axis corresp onding to , theThis function estessential downward. This direction is an the saddlelike shap shapee of this function. is thecurv quin quintessential example of a function eigenvaector of pthe Hessian with negative eigenvalue. The “saddle oint” derives with saddle oint. In more than one dimension, it is notname necessary to pha hav ve an eigen eigenv vfrom alue the saddlelike shap e of this function. This is the quin tessential example of a function of 0 in order to get a saddle point: it is only necessary to hav havee both positive and negative with vaalues. saddleWpeoint. In more onepdimension, is signs not necessary to havas e an eigen alue eigen eigenv can think of athan saddle oint with bitoth of eigenv eigenvalues alues being a vlo local cal of 0 in um order to getone a saddle point: it is only necessary to hav e both positive andsection. negative maxim maximum within cross section and a lo local cal minim minimum um within another cross eigenvalues. We can think of a saddle point with both signs of eigenvalues as being a local maximum within one cross section and a lo cal minimum within another cross section.
90
CHAPTER 4. NUMERICAL COMPUTATION
20
x2
10 0 −10 −20 −30 −30 −20 −10
0 x1
10
20
Figure 4.6: Gradient descent fails to exploit the curv curvature ature information contained in the Hessian matrix. Here we use gradient descent to minimize a quadratic function f ( x) whose Figure 4.6: Gradient descent num failsber to exploit curvthat ature information in the Hessian matrix has condition number 5. This the means the direction ofcontained most curv curvature ature f ( x ) Hessian matrix. Here we use gradient descent to minimize a quadratic function whose has ﬁve times more curv curvature ature than the direction of least curv curvature. ature. In this case, the most Hessian hasdirection condition that isthe of most curv curvature aturematrix is in the [1 [1,,num and 5. theThis leastmeans curv curvature ature in direction the direction [1, −curv The 1]> ber 1]> .ature has lines ﬁve times morethe curv ature thanedthe least curv ature. this case,quadratic the most red indicate path follow followed by direction gradient of descent. This veryInelongated [1 , 1]on.and [1, descending 1] . The curvatureresembles is in the direction the least curvature is intime the direction function a long cany canyon. Gradient descent wastes rep repeatedly eatedly red yon lineswindicate the path ed steep by gradient descent. This the verystep elongated quadratic can canyon alls, b ecause they follow are the steepest est feature. Because size is −somewhat function a long cany Gradient rep eatedly descending to too o large,resembles it has a tendency to on. ov overshoot ershoot thedescent b ottomwastes of the time function and thus needs to can yon w alls, b ecause they are the steep est feature. Because the step size is somewhat descend the opp opposite osite cany canyon on wall on the next iteration. The large p ositive eigenv eigenvalue alue to othe large, it hascorresp a tendency ershoot the b ottom of the function and indicates thus needs to of Hessian corresponding ondingtotoovthe eigenv eigenvector ector p oin ointed ted in this direction that descend the oppderiv ositeativ cany onrapidly wall onincreasing, the next iteration. The large algorithm p ositive eigenv this directional derivativ ative e is so an optimization basedalue on of the Hessian corresp onding thesteep eigenv p ointed in this direction indicates that the Hessian could predict thattothe steepest estector direction is not actually a promising search this directional ative is rapidly increasing, so an optimization algorithm based on direction in this deriv context. the Hessian could predict that the steep est direction is not actually a promising search direction in this context.
91
CHAPTER 4. NUMERICAL COMPUTATION
guide the search. The simplest metho method d for doing so is known as Newton Newton’s ’s metho method d. Newton’s metho method d is based on using a secondorder Taylor series expansion to guide ximate the search. metho d :for doing so is known as Newton’s method. appro approximate f (x)The nearsimplest some poin oint t x(0) Newton’s method is based on using a secondorder Taylor series expansion to approximate(0) f (x) near some point x : 1 f (x) ≈ f (x )+(x−x(0) ) >∇ x f (x(0))+ (x−x (0))> H (f )(x(0) )( )(x x−x(0)). (4.11) 2 1 ) (x of )+this (x)then f (solve x )+ (xthex critical (x x ) H (f )(x )(x x ). (4.11) Iffwe for pf oint 2 function, we obtain: ≈ − ∇ − − (0) obtain: If we then solve for thex∗critical H (of f )(this x(0) function, = x(0) p−oint )−1 ∇x f (xwe ). (4.12) H (f )(xfunction, = x quadratic ) fNewton’s (x ). metho (4.12) When f is a positive xdeﬁnite method d consists of applying Eq. 4.12 once to jump to directly.. When f is − the minimum of ∇the function directly f When is a p ositive deﬁnite quadratic function, Newton’s metho d consists of not truly quadratic but can be lo locally cally approximated as a positive deﬁnite quadratic, applying Eq. 4.12 consists once to jump to the Eq. minimum the function directlyely . When f is Newton’s method of applying 4.12 mof ultiple times. Iterativ Iteratively up updating dating not truly quadratic but be locally approximated a pappro ositive deﬁnite quadratic, the approximation and can jumping to the minimum ofasthe approximation ximation can reach Newton’s method consists of applying Eq. 4.12 m ultiple times. Iterativ ely dating the critical point muc much h faster than gradient descent would. This is a useful up prop propert ert erty y the approximation and jumping to the minimum of the appro ximation can reach near a lo local cal minimum, but it can be a harmful prop property erty near a saddle point. As the criticalinpoint h faster than gradient This iswhen a useful erty discussed Sec.muc 8.2.3 , Newton’s metho method ddescent is onlywould. appropriate theprop nearby near a lo caltminimum, but(all it can e a harmful erty near point. As critical poin oint is a minimum the beigen eigenv values of prop the Hessian areapsaddle ositive), whereas discussed in Sec. , Newton’s is only appropriate when the nearby gradien gradientt descen descent t is8.2.3 not attracted to metho saddledpoints unless the gradient points tow toward ard critical p oin t is a minimum (all the eigen v alues of the Hessian are p ositive), whereas them. gradient descent is not attracted to saddle points unless the gradient points toward Optimization algorithms such as gradient descen descentt that use only the gradien gradientt are them. called ﬁrstor ﬁrstorder der optimization algorithms algorithms.. Optimization algorithms such as NewOptimization algorithms such as gradient t thatse use onlyder the optimization gradient are ton’s metho method d that also use the Hessian matrixdescen are called sec condor ondorder called ﬁrstor der optimization algorithms algorithms (No Nocedal cedal and Wright , 2006). . Optimization algorithms such as Newton’s method that also use the Hessian matrix are called secondorder optimization The optimization algorithms emplo employ algorithms (Nocedal and Wright, 2006 ). yed in most contexts in this book are applicable to a wide variety of functions, but come with almost no guaran guarantees. tees. This The optimization algorithms emplo y ed in most contexts in this book are is because the family of functions used in deep learning is quite complicated. In applicable to a wide v ariety of functions, but come with almost no guaran tees. This man many y other ﬁelds, the dominant approach to optimization is to design optimization is because the of family functions used in deep learning is quite complicated. In algorithms for afamily limited of functions. many other ﬁelds, the dominant approach to optimization is to design optimization In the context of deep learning, we sometimes gain some guarantees by restrictalgorithms for a limited family of functions. ing ourselv ourselves es to functions that are either Lipschitz continuous or ha hav ve Lipsc Lipschitz hitz In the context of deep learning, we sometimes gain some guarantees by restrictf con deriv A Lipsc contin function is a function whose rate contin tin tinuous uous derivativ ativ atives. es. Lipschitz hitz continuous uous ingchange ourselvisesbounded to functions are either Lipschitz continuous or have Lipschitz of by a that Lipschitz constant L: continuous derivatives. A Lipschitz continuous function is a function f whose rate of change is bounded by : x − y 2 . ∀xa, ∀Lipschitz y, f (x) −constant f (y) ≤ L (4.13) L y, f (xit) allo f (ws y) us to quantify x y .our assumption that (4.13)a This prop propert ert erty y is usefulxb, ecause allows small change in the input such−as gradien gradientt descen descentt will hav havee ∀ made ∀  by an − algorithm  ≤ L This prop ert y is useful b ecause it allo ws us to quantify our assumption that a a small change in the output. Lipschitz con contin tin tinuit uit uity y is also a fairly weak constrain constraint, t, small change in the input made by an algorithm such as gradient descent will have a small change in the output. Lipschitz 92 continuity is also a fairly weak constraint,
CHAPTER 4. NUMERICAL COMPUTATION
and many optimization problems in deep learning can be made Lipschitz con contin tin tinuous uous with relatively minor mo modiﬁcations. diﬁcations. and many optimization problems in deep learning can be made Lipschitz continuous Perhaps the most successful ﬁeld of sp specialized ecialized optimization is convex optimizawith relatively minor modiﬁcations. tion tion.. Conv Convex ex optimization algorithms are able to provide many more guarantees P erhaps the mostrestrictions. successful ﬁeld ofvsp optimization is convex optimizaby making stronger Con Conv execialized optimization algorithms are applicable tion. to Conv exexoptimization algorithms ablethe to provide more guarantees only conv convex functions—functions forare which Hessian many is positiv ositive e semideﬁnite b yerywhere. making stronger restrictions. Convex optimization algorithms are papplicable ev everywhere. Suc Such h functions are wellbehav wellbehaved ed because they lack saddle oints and only to conv ex functions—functions for which the Hessian is p ositiv e semideﬁnite all of their lo local cal minima are necessarily global minima. Ho How wev ever, er, most problems ev erywhere. Suc h functions are wellbehav ed b ecause they lack saddle points and in deep learning are diﬃcult to express in terms of conv convex ex optimization. Conv Convex ex all of their local minima minima. However, most problems optimization is used only are as anecessarily subroutineglobal of some deep learning algorithms. Ideas in deep diﬃcult to express inalgorithms terms of conv Convthe ex from thelearning analysisare of conv convex ex optimization can ex be optimization. useful for proving optimization is deep used only as aalgorithms. subroutine of some deep learningthe algorithms. Ideas con conv vergence of learning How Howev ev ever, er, in general, imp importance ortance of from the analysis of conv ex optimization algorithms can b e useful for proving the con conv vex optimization is greatly diminished in the context of deep learning. For convergence of deep learning algorithms. Howsee ever, in the imp ortance of) more information ab about out conv convex ex optimization, Bo Boyd ydgeneral, and Vanden andenb berghe (2004 conRo vex optimization or Rock ck ckafellar afellar (1997).is greatly diminished in the context of deep learning. For more information about convex optimization, see Boyd and Vandenberghe (2004) or Rockafellar (1997).
4.4
Constrained Optimization
Sometimes we wish not only to maximize or minimize a function f (x) ov over er all 4.4 Constrained Optimization possible values of x. Instead we ma may y wish to ﬁnd the maximal or minimal value of Sometimes we wish not only to maximize or minimize a function er all f (x) ov f (x) for values of x in some set S. This is known as constr onstraine aine ained d optimization optimization. . Poin Points ts x p ossible v alues of . Instead w e ma y wish to ﬁnd the maximal or minimal v alue of x that lie within the set S areScalled fe feasible asible poin oints ts in constrained optimization fterminology. (x) for values set . This is known as constrained optimization. Points terminology . of x in some S x that lie within the set are called feasible points in constrained optimization We often wish to ﬁnd a solution that is small in some sense. A common terminology. approac approach h in such situations is to imp impose ose a norm constrain constraint, t, such as x ≤ 1. We often wish to ﬁnd a solution that is small in some sense. A common One hsimple approac is simply to as mo approach h to constrained modify dify approac in such situations is to imposeoptimization a norm constrain t, such x gradient 1. descen descentt taking the constraint into account. If we use a small constant step size ,  ≤ Onemake simple approac h to constrained simply to modify S. Ifgradient we can gradien gradient t descent steps, thenoptimization pro project ject the is result bac back k into we use descen t taking the constraint into account. If we use a small constant step size , a line searc search, h, we can search only ov over er step sizes that yield new x poin oints Sts that are we can make t descent steps, pro ject result into . If we use feasible, or wegradien can pro pointthen on the line the back into bac thekconstraint region. project ject each that yield x poin a line searc h, wethis canmetho searchd only stepmore sizes eﬃcient new ts that are When possible, method can bov e er made by pro projecting jecting the gradient feasible, or we can pro ject each point region on the bline back into the the step constraint region. in into to the tangen tangent t space of the feasible efore taking or beginning When p ossible, this metho d can b e made more eﬃcient by pro jecting the gradient the line search (Rosen, 1960). into the tangent space of the feasible region before taking the step or beginning more sophisticated approach is to design a diﬀerent, unconstrained optithe A line search (Rosen, 1960 ). mization problem whose solution can be conv converted erted in into to a solution to the original, A more sophisticated approach is to design a diﬀerent, constrained optimization problem. For example, if we wan wantt tounconstrained minimize f( x)optifor mization problem whose solution can b e conv erted in to a solution to the original, 2 2 x ∈ R with x constrained to ha hav ve exactly unit L norm, we can instead minimize constrained optimization problem. For example, if we want to minimize f( x) for R x with x constrained to have exactly unit L norm, we can instead minimize 93 ∈
CHAPTER 4. NUMERICAL COMPUTATION
g(θ ) = f ([cos θ, sin θ]> ) with resp respect ect to θ, then return [ cos θ, sin θ] as the solution to the original problem. This approac approach h requires creativit creativity; y; the transformation gb(et θ ) = f ([ cos θ, sin θ ] θ cos θ, sin θfor ) with resp ect to , then return [ ] aseac theh solution etw ween optimization problems must be designed sp speciﬁcally eciﬁcally each case we to the original problem. This approac h requires creativit y; the transformation encoun encounter. ter. between optimization problems must be designed speciﬁcally for each case we The Karush–Kuhn–T Karush–Kuhn–Tucker ucker (KKT) approach1 pro provides vides a very general solution encounter. to constrained optimization. With the KKT approach, we in intro tro troduce duce a new function The Karush–Kuhn–T ucker (KKT) approach pro vides a very general solution called the gener generalize alize alized d Lagr agrangian angian or gener generalize alize alized d Lagr agrange ange function . to constrained optimization. With the KKT approach, we introduce a new function To deﬁne the Lagrangian, we ﬁrst need to describ describee in terms of equations called the generalized Lagrangian or generalized LagrangeSfunction. and inequalities. W Wee wan antt a description of S in terms ofSm functions g (i) and n To deﬁne Lagrangian, of equations (i) (x)need h(j) the = 0Sto 0} . The i, gﬁrst anddescrib ∀j, h (je)( x )in≤terms S = {x  ∀we functions so that equations m g and inequalities. W e w an t a description of in terms of functions and ( i ) h (nj) in inv volving g are called onstraints aints and the inequalities inv involving olving S the equality constr h = ( x ) = 0 ( x ) 0 x i, g and j, h functions so thatconstr . The equations are called ine inequality quality onstraints aints aints.. involving g are called the{ equality c onstr aints and the inequalities ∀ ∀ ≤ } involving h We in intro tro troduce duce new variables λi and α j for each constraint, these are called the are called inequality constraints. KKT multipliers. The generalized Lagrangian is then deﬁned as We introduce new variables λ and these are called the Xα for each constraint, X (i) is then deﬁned (j ) as KKT multipliers. The generalized Lagrangian L(x, λ, α) = f (x) + λ ig (x) + α j h (x). (4.14) i
j
L(x, λ, α) = f (x) + λ g (x) + α h (x). (4.14) We can no now w solve a constrained minimization problem using unconstrained optimization of the generalized Lagrangian. Observe that, so long as at least one We can noexists w solve a constrained problem unconstrained feasible point and f (x) is not pminimization ermitted to hav have e value using ∞, then X optimization of the generalized Lagrangian. ObserveX that, so long as at least one feasible point exists and f (xmin ) ismax not pmax ermitted hav L(x, to λ, α ). e value , then (4.15) x λ α,α≥0 ∞ min max max L(x, λ, α). (4.15) has the same optimal ob objectiv jectiv jectivee function value and set of optimal points x as has the same optimal ob jective function as min fv(alue x). and set of optimal points x (4.16) x∈S
min f (x). are satisﬁed, This follows because any time the constraints
(4.16)
This follows because any time max the maxconstraints L(x, λ, α)are = fsatisﬁed, (x),
(4.17)
max max L(x, λ, α) = f (x), while any time a constraint is violated,
(4.17)
while any time a constraintmax is violated, max L(x, λ, α) = ∞.
(4.18)
λ
λ
α,α≥0
α,α≥0
(4.18) max max L(x, λ, α) = . These prop properties erties guarantee that no infeasible poin ointt will ever be optimal, and that ∞ the optimum within the feasible poin oints ts is unchanged. These properties guarantee that no infeasible point will ever be optimal, and that 1 KKT approach generalizes method Lagrange multipliers which allows equality the The optimum within the feasiblethepoin ts is of unchanged. constraints but not inequality constraints.
94
CHAPTER 4. NUMERICAL COMPUTATION
To perform constrained maximization, we can construct the generalized Lagrange function of −f (x), whic which h leads to this optimization problem: To perform constrained maximization, we can construct the generalized LaX X (i)optimization problem: grange function ofmax f (xmax ), whic this min α j h(j)(x). −fh(xleads ) + to λ (4.19) ig (x) + x −λ α,α≥0 i j min max max f (x) + λ g (x) + α h (x). (4.19) We ma may y also conv convert ert this to − a problem with maximization in the outer lo loop: op: X X ) (j )the outer lo op: We may also conv ert thismin to afproblem (x) − max min (x) + with λ ig (imaximization α j hin (x). (4.20) x α , α ≥0 λ X X i j max min min f (x) + λ g (x) α h (x). (4.20) The sign of the term for the equality constraints do does es not matter; we may deﬁne it − with addition or subtraction as we wish, because the optimization is free to cho hoose ose The sign of the term for the equality constraints do es not matter; we may deﬁne it an any y sign for each λi. with addition or subtraction as we wish, is free to choose Xbecause the optimization X The inequality constrain constraints ts are particularly in interesting. teresting. We say that a constraint any sign for each λ . h(i) (x ) is active if h(i) ( x ∗) = 00.. If a constraint is not activ active, e, then the solution to The inequality constrain ts are particularly in teresting. W e sayathat constraint the problem found using that constrain constraintt would remain at least lo local cala solution if h ( x h ( x ) is active if ) = 0 . If a constraint is not activ e, then the solution to that constrain constraintt were remo remov ved. It is possible that an inactiv inactivee constrain constraintt excludes the problem found that constrain would remain least region a local of solution if other solutions. Forusing example, a conv convex ex tproblem with anatentire globally that constrain remoﬂat, ved. region It is possible an could inactivhav e constrain t excludes optimal poin oints tst w (aere wide, of equalthat cost) have e a subset of this other solutions. F or example, a conv ex problem with an entire region of globally region eliminated by constrain constraints, ts, or a nonconv nonconvex ex problem could hav havee better lo local cal optimal p oin ts (a wide, ﬂat, region of equal cost) could hav e a subset of this stationary poin oints ts excluded by a constraint that is inactiv inactivee at conv convergence. ergence. How Howev ev ever, er, region eliminated byconv constrain ts, remains or a nonconv ex problem could have bor etter cal the point found at convergence ergence a stationary point whether notlothe stationary points excluded by a constraint is inactiv ergence. Howthen ever, h(ei) at inactiv inactivee constrain constraints ts are included. Because that an inactive hasconv negativ negative e value, the solution point found at conv ergence remains a stationaryhave point notthus the the to min e αi whether = 0. Weorcan x maxλ max α,α≥0 L( x, λ, α) will hav h inactiv e constrain ts are included. Because an inactive has negativ e v alue, then αh((x ) = 0. In other words, for all i , we know that at observ observee that at the solution, αh min max max the solution to ) will havbee αactive = 0.atW e can thus α x,)α≤ least one of the constraints i ≥ 0 andLh( (xi),(λ 0 must the solution. αh(xidea, ) = 0we observ e that the solution, . Incan other all ithe , wesolution know that at T o gain someatintuition for this saywords, that for either is on α inequalit (x ) we 0mmust leastboundary one of theimp constraints 0 and hy and be its active at m the solution. the imposed osed by the inequality ust use KKT ultiplier to T o gain some intuition for this idea, we can say that either the solution is on ≥ ≤ inﬂuence the solution to x, or the inequalit inequality y has no inﬂuence on the solution and the b oundary imp osed b y the inequalit y and we must use its KKT multiplier to we represent this by zeroing out its KKT multiplier. inﬂuence the solution to x, or the inequality has no inﬂuence on the solution and The prop properties erties that the gradien gradientt of the generalized Lagrangian is zero, all we represent this by zeroing out its KKT multiplier. constrain constraints ts on both x and the KKT multipliers are satisﬁed, and α h (x) = 0 The prop that the gradien t of theconditions generalized Lagrangian zero,and all are called theerties KarushKuhnT KarushKuhnTuc uc uck ker (KKT) (Karush , 1939;isKuhn h (x) = 0 constrain ts on bogether, oth x and theprop KKT multipliers satisﬁed,poin andtsαof constrained Tuck ). T these describ the optimal ucker er, 1951 properties erties describee are oints are called theproblems. KarushKuhnTucker (KKT) conditions (Karush, 1939 ; Kuhn and optimization Tucker, 1951). Together, these properties describe the optimal points of constrained For more information about out the KKT approach, see No Nocedal cedal and Wrigh rightt (2006). optimization problems. ab For more information about the KKT approach, see Nocedal and Wright (2006). 95
CHAPTER 4. NUMERICAL COMPUTATION
4.5
Example: Linear Least Squares
Supp Suppose we wan wantt to ﬁnd the value of x that minimizes 4.5 oseExample: Linear Least Squares Suppose we want to ﬁnd the value of 1x that minimizes (4.21) f (x) = Ax − b22 . 2 1 . solve this problem eﬃciently (4.21) . Ax that b can f (x)algorithms = There are sp specialized ecialized linear algebra eﬃciently. 2  it −  gradien Ho How wev ever, er, we can also explore ho how w to solve using gradientbased tbased optimization as There are sp ecialized linear algebra algorithms that can solve this problem eﬃciently. a simple example of ho how w these techniques work. However, we can also explore how to solve it using gradientbased optimization as First, we need to obtain the gradient: a simple example of how these techniques work. First, we need to the gradient: ∇obtain (Ax − b) = A>Ax − A >b. A> xf (x) =
(4.22)
b) =taking A Ax (x)gradient = A (Ax b. See Algorithm (4.22) We can then follo follow w fthis do downhill, wnhill, smallAsteps. 4.1 ∇ − − for details. We can then follow this gradient downhill, taking small steps. See Algorithm 4.1 for details. 4.1 An algorithm to minimize f( x) = 12 Ax − b22 with resp respect ect to x Algorithm using gradient descent. Ax b with respect to x Algorithm 4.1 An algorithm to minimize f( x) = Set the step size () and tolerance (δ ) to small, positive num umb bers. using gradient descent.>  −  > while  A A Ax b 2 > δ do >−(A Setx the ) and (δ ) to small, positive numbers. >b A Ax ← xstep − size − Atolerance while A Ax A b > δ do end while x x  A −Ax A b end ← while − problem using Newton’s metho One can−also solve this method. d. In this case, because the true function is quadratic, the quadratic approximation employ employed ed by Newton’s One can also solve this problem using Newton’s metho d. In this case, metho method d is exact, conv converges erges to the global minimum in b a ecause single and the algorithm the true function is quadratic, the quadratic approximation employ ed by Newton’s step. method is exact, and the algorithm converges to the global minimum in a single No Now w supp suppose ose w wee wish to minimize the same function, but sub subject ject to the step. > x ≤ 1. To do so, we introduce the Lagrangian constrain constraintt x Now suppose we wish to minimize the same function, but sub ject to the > Lagrangian so, we introduce the constraint x x 1. To do L(x, λ) = f (x) + λ x x − 1 . (4.23) ≤ L(x, λ) = f (x) + λ x x 1 . (4.23) We can now solve the problem − We can now solve the problem min max L(x, λ). (4.24) x
λ,λ≥0
min max L( x, λ). (4.24) The smallestnorm solution to the unconstrained least squares problem may be found using the Mo MoorePenrose orePenrose pseudoinv pseudoinverse: erse: x = A+ b. If this point is feasible, The smallestnorm solution to the unconstrained squares problem then it is the solution to the constrained problem.least Otherwise, we mustmay ﬁndbae found using the MoorePenrose pseudoinverse: x = A b. If this point is feasible, 96 problem. Otherwise, we must ﬁnd a then it is the solution to the constrained
CHAPTER 4. NUMERICAL COMPUTATION
solution where the constraint is active. By diﬀerentiating the Lagrangian with resp respect ect to x, we obtain the equation solution where the constraint is active. By diﬀerentiating the Lagrangian with > respect to x, we obtain theA equation (4.25) Ax − A> b + 2λx = 00.. b +form 2λx = 0. A will Ax take A the This tells us that the solution − This tells us that the solution take x =will (A > A +the 2λIform ) −1A >b.
(4.25) (4.26)
= (A suc A + λI ) theAresult b. ob (4.26) The magnitude of λ must bex chosen such h 2that obeys eys the constrain constraint. t. We can ﬁnd this value by performing gradient ascent on λ. To do so, observ observee The magnitude of λ must be chosen such that the result obeys the constraint. We ∂ gradient ascent can ﬁnd this value by performing on λ. To do so, observe L(x, λ) = x> x − 1. (4.27) ∂λ ∂ L(x , λ) ative = x isx positiv 1. e, so to follow the deriv (4.27) When the norm of x exceeds 1,∂ this derivative ositive, derivativ ativ ativee λ deriv − uphill and increase the Lagrangian with resp respect ect to λ, we increase λ . Because the When thet norm exceeds 1, this increased, derivative solving is positiv e, so to follow the deriv co coeﬃcien eﬃcien eﬃcient on theofxx> x penalt enalty y has the linear equation for xativ wille uphill andaincrease Lagrangian withThe resppro ect cess to λof , we increase . Because the no now w yield solutionthe with smaller norm. process solving theλlinear equation x x x coeﬃcien t on the penalt y has increased, solving theand linear for will λ con x has and adjusting contin tin tinues ues until the correct norm theequation deriv derivativ ativ ative e on λ is no w yield a solution with smaller norm. The pro cess of solving the linear equation 0. and adjusting λ continues until x has the correct norm and the derivative on λ is This concludes the mathematical preliminaries that we use to dev develop elop machine 0. learning algorithms. We are no now w ready to build and analyze some fullﬂedged This concludes the mathematical preliminaries that we use to develop machine learning systems. learning algorithms. We are now ready to build and analyze some fullﬂedged learning systems.
97
Chapter 5 Chapter 5
Mac Machine hine Learning Basics hine Mac Learning Basics Deep learning is a sp speciﬁc eciﬁc kind of mac machine hine learning. In order to understand deep learning well, one must ha have ve a solid understanding of the basic principles Deep learning is a spThis eciﬁcchapter kind ofpro mac hinea brief learning. to understand of mac machine hine learning. provides vides courseIninorder the most important deep learning w ell, one must ha ve a solid understanding of the basic principles general principles that will be applied throughout the rest of the bo book. ok. No Novice vice of mac hine learning. This c hapter pro vides a brief course in the most important readers or those who wan antt a wider persp perspectiv ectiv ectivee are encouraged to consider mac machine hine general principles that will b e applied throughout the rest of the bo ok. No vice learning textb textbo ooks with a more comprehensive co cov verage of the fundamen fundamentals, tals, suc such h readers or ythose who ant a wider persp are encouraged to consider machine hine as Murph Murphy (2012 ) orwBishop (2006 ). Ifectiv youe are already familiar with mac machine learning textb ooks coverage the fundamen tals, suc h learning basics, feelwith free atomore skip comprehensive ahead to Sec. 5.11 . Thatofsection cov covers ers some peras Murph y (traditional 2012) or Bishop (2006 ). If tec youhniques are already familiar withinﬂuenced machine sp spectiv ectiv ectives es on mac machine hine learning techniques that hav have e strongly learning basics, feel free to skip ahead to Sec. 5.11 . That section cov ers some perthe dev developmen elopmen elopmentt of deep learning algorithms. spectives on traditional machine learning techniques that have strongly inﬂuenced e elopmen begin with deﬁnition what a learning algorithm is, and present an the W dev t of adeep learningofalgorithms. example: the linear regression algorithm. W Wee then pro proceed ceed to describ describee how the W e b egin with a deﬁnition of what a learning algorithm is, and present an challenge of ﬁtting the training data diﬀers from the challenge of ﬁnding patterns example: the linear regression algorithm. We learning then proceed to describ ee how the that generalize to new data. Most mac machine hine algorithms hav have settings challenge of ﬁtting the training data diﬀers fromexternal the challenge ﬁndingalgorithm patterns called hyp yperparameters erparameters that must be determined to the of learning that generalize new Mostusing machine learning algorithms havlearning e settings itself; we discusstoho how w todata. set these additio additional nal data. Mac Machine hine is called h yp erparameters that m ust b e determined external to the learning algorithm essen essentially tially a form of applied statistics with increased emphasis on the use of itself; we discuss how toestimate set these using additio nal data. hine learning is computers to statistically complicated functions and aMac decreased emphasis essen tially form of applied with functions; increased w emphasis on present the usethe of on pro proving ving aconﬁdence in interv terv tervals alsstatistics around these e therefore statistically and a decreased emphasis tcomputers wo centraltoapproac approaches hes to estimate statistics:complicated frequen frequentist tistfunctions estimators and Ba Bay yesian inference. on promachine ving conﬁdence tervals around functions; e thereforeofpresent the Most learning in algorithms can bethese divided in into to thewcategories sup supervised ervised tlearning wo central approac hes to statistics: frequen tist estimators and Ba y esian inference. and unsup unsupervised ervised learning; we describ describee these categories and give some Most machine learning algorithms can befrom divided the categories of sup ervised examples of simple learning algorithms eachinto category category. . Most deep learning learning andare unsup ervised learning; we describ e thesecalled categories and give somet algorithms based on an optimization algorithm stochastic gradien gradient examples ofe simple learning from each category . Most deep learning descen descent. t. W describe how toalgorithms com combine bine various algorithm comp components onents suc such h as an algorithms are based on an optimization algorithm called stochastic gradient descent. We describe how to combine v98 arious algorithm components such as an 98
CHAPTER 5. MACHINE LEARNING BASICS
optimization algorithm, a cost function, a model, and a dataset to build a mac machine hine learning algorithm. Finally Finally,, in Sec. 5.11, we describe some of the factors that hav havee optimization algorithm, a cost function, model, and a dataset toThese build challenges a machine limited the ability of traditional mac machine hinea learning to generalize. learning algorithm. , in Sec. e describe some of the that these have ha hav ve motiv motivated ated the Finally dev developmen elopmen elopment t of 5.11 deep, w learning algorithms thatfactors ov overcome ercome limited the ability of traditional mac hine learning to generalize. These challenges obstacles. have motivated the development of deep learning algorithms that overcome these obstacles.
5.1
Learning Algorithms
A machine learning algorithm is an algorithm that is able to learn from data. But 5.1 Learning Algorithms what do we mean by learning? Mitchell (1997) provides the deﬁnition “A computer A machine algorithm is anerience algorithm thatresp is able learnclass fromofdata. E with T program is learning said to learn from exp experience respect ect totosome tasksBut what do w e mean b y learning? Mitchell ( 1997 ) provides the deﬁnition “A computer and performance measure P , if its performance at tasks in T , as measured by P , E witha resp program saidexp toerience learn from experience to v some of eriences tasks T E .” One impro improv ves is with experience can imagine veryect wide arietyclass of exp experiences P, and performance measure P ,measures if its performance in Te, any as measured , tasks dotasks not mak make attempt inbythis E T , and performance P , and weat E .” One can impro vesprovide with exp imagine wide ariety experiences b ook to a erience formal deﬁnition of what ma may yabveery used for veac each h ofofthese entities. , tasks , and p erformance measures , and we do not mak e any attempt E T P Instead, the follo following wing sections provide intuitiv intuitivee descriptions and examples in of this the b o ok to provide a formal deﬁnition of what ma y b e used for eac h of these entities. diﬀeren diﬀerentt kinds of tasks, performance measures and exp experiences eriences that can be used Instead, the follo wing sections provide intuitiv e descriptions and examples of the to construct machine learning algorithms. diﬀerent kinds of tasks, performance measures and experiences that can be used to construct machine learning algorithms.
5.1.1
The Task, T T Mac Machine hine The learni learning allo allows ws us to tac tackle kle tasks that are too diﬃcult to solv solvee with 5.1.1 Tng ask,
ﬁxed programs written and designed by human beings. From a scien scientiﬁc tiﬁc and Mac hine learni ng allo ws us to tac kle tasks that are too diﬃcult to solv e with philosophical point of view, machine learning is in interesting teresting because developing our ﬁxed programs written and designed b y human b eings. F rom a scien tiﬁc and understanding of machine learning en entails tails developing our understanding of the philosophical of view, machine learning is interesting because developing our principles thatpoint underlie intelligence. understanding of machine learning entails developing our understanding of the In this that relativ relatively ely formal deﬁnition of the word “task,” the pro process cess of learning principles underlie intelligence. itself is not the task. Learning is our means of attaining the ability to perform the relatively of bthe word the pro cess of learning task.InFthis or example, if formal we wan antdeﬁnition t a rob robot ot to e able to “task,” walk, then walking is the task. itself is not the task. our to means abilitytotodirectly performwrite the W e could program theLearning rob robot ot to islearn walk,oforattaining we couldthe attempt For example, if we w ant to a rob ot man to bually e able. to walk, then walking is the task. atask. program that sp speciﬁes eciﬁes how walk manually ually. We could program the rob ot to learn to walk, or we could attempt to directly write Mac Machine hine learning tasks are usually describ described ed in terms of ho how w the mac machine hine a program that speciﬁes how to walk manually. learning system should pro an example . An example is a collection of fe process cess featur atur atures es learning tasksely aremeasured usually describ ed inob terms ho w the mac hine thatMac ha hav vhine e been quantitativ quantitatively from some object ject orofev even en ent t that we w an antt learning system shouldsystem processtoan example . eAn examplerepresen is a collection of featur the mac machine hine learning pro process. cess. W typically represent t an example ases a that ha v e b een quantitativ ely measured from some ob ject or ev en t that w e w an n vector x ∈ R where eac each h en entry try xi of the vector is another feature. For example,t the features machine of learning system to process. We typically represent an example as a the R an image are usually the values of the pixels in the image. vector x where each entry x of the vector is another feature. For example, the features ∈ of an image are usually the values of the pixels in the image. 99
CHAPTER 5. MACHINE LEARNING BASICS
Man Many y kinds of tasks can be solved with machine learning. Some of the most common mac machine hine learning tasks include the following: Many kinds of tasks can be solved with machine learning. Some of the most common machine learning include the computer following: program is ask • Classiﬁc Classiﬁcation ation ation: : In this tasks typ ypee of task, the asked ed to sp specify ecify
whic which h of k categories some input belongs to. To solv solvee this task, the learning Classiﬁcation : In thisask t yp e of task, the computer program ecify {1ask , . . ed . , kto }. sp f : Rn → is algorithm is usually to pro a function When asked ed produce duce categories input belongs to. To solv evector this task, learning • ywhic = fh(of x)k, the x tothe mo model del some assigns an input described byR a category . , k . When f : ts of the 1, . .classiﬁcation algorithm isyusually ask ed duce are a function iden identiﬁed tiﬁed b numeric co code detoy.pro There other varian ariants y = f ( x x ) , the mo del assigns an input described b y vector to aer →{ }category task, for example, where f outputs a probability distribution ov over classes. idenexample tiﬁed by of numeric code y. There areob other variants of the classiﬁcation An a classiﬁcation task is object ject recognition, where the input f outputs task, for example, where a probability distribution overand classes. is an image (usually described as a set of pixel brightness values), the An example of a classiﬁcation task is ob ject recognition, where the input output is a numeric co code de identifying the ob object ject in the image. For example, is an image (usually described as a set of pixel values), and the the Willo Willow w Garage PR2 rob robot ot is able to act as abrightness waiter that can recognize output tis kinds a numeric code and identifying ob ject in the on image. For example, diﬀeren of drinks deliv them to people command (Go diﬀerent deliver er the Goo o dthe Willo w Garage PR2 rob ot is able to act as a w aiter that can recognize fello fellow w et al. al.,, 2010). Mo Modern dern ob object ject recognition is best accomplished with diﬀeren t kinds of drinks and deliv er them to and people on command (Goject oddeep learning (Krizhevsky et al., 2012 ; Ioﬀe Szegedy , 2015). Ob Object fello w et al. , 2010 ). Mo dern ob ject recognition is b est accomplished with recognition is the same basic tec technology hnology that allo allows ws computers to recognize deep learning ( Krizhevsky et al. , 2012 ; Ioﬀe and Szegedy, 2015tag ). Ob ject faces (Taigman et al. al.,, 2014), which can be used to automatically people recognition is the same teccomputers hnology that ws computers to recognize in photo collections andbasic allow to allo interact more naturally with faces ( T aigman et al. , 2014 ), which can b e used to automatically tag p eople their users. in photo collections and allow computers to interact more naturally with • Classiﬁc Classiﬁcation ation with missing inputs inputs:: Classiﬁcation b ecomes more challenging if their users. the computer program is not guaranteed that every measurement in its input Classiﬁc ation with missing inputsIn : Classiﬁcation ecomes more challenging if v ector will alwa always ys b e pro provided. vided. order to solv solveeb the classiﬁcation task, the the computer program nottoguaranteed that function every measurement in its input • learning algorithm onlyishas deﬁne a single mapping from a vector vector to will ys be pro vided. When In order to solv theinputs classiﬁcation the input a alwa categorical output. some of ethe may be task, missing, learningthan algorithm onlya has to deﬁne a singlefunction, function the mapping from a vector rather providing single classiﬁcation learning algorithm input to a categorical output. When some of the inputs may b e missing, must learn a set of functions. Eac Each h function corresp corresponds onds to classifying x with than providing singlemissing. classiﬁcation function, the learning algorithm arather diﬀeren diﬀerent t subset of its ainputs This kind of situation arises frequen frequently tly x m ust learn a set of functions. Eac h function corresp onds to classifying with in medical diagnosis, because many kinds of medical tests are exp expensiv ensiv ensivee or ain diﬀeren t subset of its inputs missing. This kind of situation arises frequen tly inv vasive. One wa way y to eﬃcien eﬃciently tly deﬁne suc such h a large set of functions is to learn in medical diagnosis, b ecause many kinds of medical tests are exp ensiv e or a probability distribution over all of the relev relevant ant variables, then solv solvee the invasive. Onetask way by to eﬃcien tly deﬁne suc h amissing large setvariables. of functions is ntoinput learn classiﬁcation marginalizing out the With aariables, probability distribution over the relev ant variables, then solv e the v we can no now w obtain allall 2n ofdiﬀeren diﬀerent t classiﬁcation functions needed classiﬁcation task set by of marginalizing out the With an single input for eac each h possible missing inputs, butmissing we onlyvariables. need to learn v ariables, we can no w obtain all 2 diﬀeren t classiﬁcation functions needed function describing the join jointt probabilit probability y distribution. See Go Goo odfello dfellow w et al. for eac h p ossible set of missing inputs, but w e only need to learn a (2013b) for an example of a deep probabilistic mo model del applied to such asingle task function describing thethe join t probabilit y distribution. Goodfello et bal. in this way ay. . Man Many y of other tasks describ described ed in thisSee section can w also e (generalized 2013b) for an example of a deep probabilistic mo del applied to such a task to work with missing inputs; classiﬁcation with missing inputs is in this w ay . Manyof of the machine other tasks describ eddo. in this section can also be just one example what learning can generalized to work with missing inputs; classiﬁcation with missing inputs is just one example of what machine100learning can do.
CHAPTER 5. MACHINE LEARNING BASICS
• Regr gression ession ession:: In this type of task, the computer program is ask asked ed to predict a numerical value giv given en some input. To solv solvee this task, the learning algorithm Reask gression In this taype of task,f the program is ask edistosimilar predicttoa is asked ed to: output function type of task → R. This : Rncomputer numerical value given that somethe input. To solve this task, the learning algorithm • classiﬁcation, except format is diﬀeren diﬀerent. t. An example of R R of output is ask ed to output a function . This type of task is similar to f : a regression task is the prediction of the exp expected ected claim amount that an classiﬁcation, that the format of outputpremiums), is diﬀerent.orAn of → insured personexcept will mak make e (used to set insurance theexample prediction a regression taskofissecurities. the prediction the exp ected claim are amount that for an of future prices Theseofkinds of predictions also used insured person will make (used to set insurance premiums), or the prediction algorithmic trading. of future prices of securities. These kinds of predictions are also used for algorithmic trading. ranscription anscription: : In this type of task, the machine learning system is ask asked ed to • T observ observee a relativ relatively ely unstructured represen representation tation of some kind of data and T r anscription : In this type of task, the machine learninginsystem asked to transcrib transcribee it into discrete, textual form. For example, opticalischaracter observe a relativ unstructured represen tation of some kind of data and • recognition, the ely computer program is shown a photograph containing an transcrib e it into discrete, textual form. F or example, in optical c haracter image of text and is asked to return this text in the form of a sequence recognition, computer isdeshown a photograph containing an of charactersthe (e.g., in ASCIIprogram or Unico Unicode format). Go Google ogle Street View uses image of text and is asked to return this text in the form of a sequence deep learning to process address num numbers bers in this wa way y (Go Goo o dfello dfellow w et al. al.,, of c haracters (e.g., in ASCII or Unico de format). Go ogle Street View uses 2014d Another example is sp recognition, where the computer program 2014d). ). speec eec eech h deep learning process address bersa in this waof y (cGo o dfellow al., is pro provided vided an to audio waveform andnum emits sequence haracters or et word 2014d ). Another example is speec h recognition, computer program ID co codes des describing the words that were sp spoken oken where in the the audio recording. Deep is pro vided an audio w a v eform and emits a sequence of c haracters or ord learning is a crucial comp component onent of mo modern dern speech recognition systems w used ID ma codes words that were spoken theGo audio at major jor describing companiesthe including Microsoft, IBMinand Google oglerecording. (Hin Hinton ton etDeep al. al.,, learning is a crucial comp onent of mo dern speech recognition systems used 2012b 2012b). ). at ma jor companies including Microsoft, IBM and Google (Hinton et al., ). tr Machine translation anslation anslation:: In a mac machine hine translation task, the input already consists • 2012b of a sequence of symbols in some language, and the computer program must Machine translation : In a mac task, the inputThis already consists con conv vert this in into to a sequence of hine sym symb btranslation ols in another language. is commonly of a sequence of symbols in some language, and the computer program must • applied to natural languages, such as to translate from English to Frenc rench. h. convert this into a sequence symbto olsha in another language. This is commonly Deep learning has recently of begun e an imp impact on this kind hav v important ortant applied natural languages, as to translate from of task (to Sutskev Sutskever er et al. al.,, 2014; such Bahdanau et al. al.,, 2015 ). English to French. Deep learning has recently begun to have an important impact on this kind Structur d output tasks et inv any).task where the output task (eSutskev er: Structured et al., 2014output ; Bahdanau al. , 2015 • of Structure output: involve olve is a vector (or other data structure con containing taining multiple values) with important Structur e d output : Structured output tasks involve where the output relationships bet etw ween the diﬀeren diﬀerentt elemen elements. ts. Thisany is atask broad category category, , and is a v ector (or other data structure con taining m ultiple v alues) with important • subsumes the transcription and translation tasks describ described ed ab abov ov ove, e, but also relationships b et w een the diﬀeren t elemen ts. This is a broad category , and man many y other tasks. One example is parsing—mapping a natural language subsumes the and tasksstructure described abtagging ove, butno also sen sentence tence in into to a transcription tree that describ describes es translation its grammatical and nodes des man y other tasks. One example is parsing—mapping a natural language of the trees as b eing verbs, nouns, or adverbs, and so on. See Collob Collobert ert (2011) sen tence in to a tree that describ es its grammatical structure and tagging nodes for an example of deep learning applied to a parsing task. Another example of pixelwise the trees assegmen b eing verbs, or adverbs, and computer so on. Seeprogram Collobertassigns (2011) is segmentation tationnouns, of images, where the for anpixel example deep to learning applied to a. parsing task. Another example ev every ery in anofimage a sp speciﬁc eciﬁc category category. For example, deep learning can is pixelwise segmentation of images, where the computer program assigns every pixel in an image to a speciﬁc 101category. For example, deep learning can
CHAPTER 5. MACHINE LEARNING BASICS
be used to annotate the lo locations cations of roads in aerial photographs (Mnih and Hin Hinton ton, 2010). The output need not hav havee its form mirror the structure of b e used to annotate the lo cations of roads inyleaerial (Mnih and the input as closely as in these annotationst annotationstyle tasks.photographs For example, in image Hinton, 2010 ). computer The output need not haves e its the structure of captioning, the program observ observes an form imagemirror and outputs a natural the input as closely as in these annotationst yle tasks. F or example, in image language sen sentence tence describing the image (Kiros et al., 2014a,b; Mao et al., captioning, theetcomputer program observ es ,an image and outputs 2015 ; Vin Vinyals yals al. al.,, 2015b ; Donahue et al. al., 2014 ; Karpath Karpathy y and aLinatural , 2015; language sen tence describing the image ( Kiros et al. , 2014a , b ; Mao et al., Fang et al. al.,, 2015; Xu et al. al.,, 2015). These tasks are called structured output 2015 yals et , 2015b; must Donahue et al. , 2014v;alues Karpath andall Li,tightly 2015; tasks; Vin because theal.program output several thaty are F ang et al. , 2015 ; Xu et al. , 2015 ). These tasks are called structured output in interrelated. terrelated. For example, the words pro produced duced by an image captioning tasks because must output several values that are all tightly program must the formprogram a valid sen sentence. tence. interrelated. For example, the words produced by an image captioning • A nomaly m dete detection ction: : In of task, the computer program sifts through program ustction form a vthis alidtype sentence. a set of even events ts or ob objects, jects, and ﬂags some of them as being unusual or at atypical. ypical. A nomaly dete ction : In this type of task, the computer program sifts through An example of an anomaly detection task is credit card fraud detection. By a set of even ts or ob jects, and ﬂags asome of them as beingyunusual or atmisuse ypical. • mo modeling deling your purchasing habits, credit card compan company can detect An example of an detection task iscard credit fraud By of your cards. If aanomaly thief steals your credit or card credit carddetection. information, modeling our hases purchasing habits, a credit compan y can detect misuse the thief thief’s ’sypurc purchases will often come from acard diﬀerent probabilit probability y distribution of your cards. If a thief steals y our credit card or credit card information, over purc purchase hase typ ypes es than your own. The credit card company can prev preven en entt the thief purchases will on often from diﬀerent probabilit y distribution fraud by’splacing a hold an come account asaso as that card has been used soon on overanpurc hase types than your own. credit etcard preven for uncharacteristic purchase. See The Chandola al. company (2009) forcan a survey oft fraud by detection placing a metho hold on anomaly methods. ds.an account as soon as that card has been used for an uncharacteristic purchase. See Chandola et al. (2009) for a survey of • Synthesis and sampling sampling: : Inds.this type of task, the mac machine hine learning algorithm anomaly detection metho is ask asked ed to generate new examples that are similar to those in the training Synthesis and sampling : In thisvia type of task,learning the maccan hineblearning algorithm data. Syn Synthesis thesis and sampling machine e useful for media is asked to generate new examples that are similar to those in the training • applications where it can be exp expensiv ensiv ensivee or boring for an artist to generate large data. Syn thesis and sampling via machine learning can bcan e useful for media volumes of conten contentt by hand. F For or example, video games automatically applications where it can b e exp ensiv e or b oring for an artist to generate large generate textures for large ob objects jects or landscap landscapes, es, rather than requiring an vartist olumes conten by hand. or example, games cansome automatically to of man manually uallyt label eachFpixel (Luo etvideo al., 2013 ). In cases, we generate textures for large ob jects or landscap es, rather than requiring w an antt the sampling or synthesis pro procedure cedure to generate some speciﬁc kind an of artist to man ually label each pixel ( Luo et al. , 2013 ). In some cases, we output giv given en the input. For example, in a sp speec eec eech h synthesis task, we provide a w an t the sampling or synthesis pro cedure to generate speciﬁc kind of written sentence and ask the program to emit an audio some waveform con containing taining givversion en the input. or example, in aisspaeec h synthesis task, w e provide a aoutput sp spok ok oken en of thatFsentence. This kind of structured output task, written sentence and qualiﬁcation ask the program emitisannoaudio wacorrect veform output containing but with the added thattothere single for a sp ok en version of that sentence. This is a kind of structured output task, eac each h input, and we explicitly desire a large amoun amountt of variation in the output, but with the added qualiﬁcation that there no realistic. single correct output for in order for the output to seem more naturalisand each input, and we explicitly desire a large amount of variation in the output, • Imputation values values: : In thisnatural type ofand task, the machine learning in order for of themissing output to seem more realistic. algorithm is giv given en a new example x ∈ Rn, but with some entries xi of x Imputation missing must valuesprovide : In this type of task, machine missing. Theofalgorithm a prediction of thethe values of thelearning missing R x x algorithm is giv en a new example , but with some entries of x • en entries. tries. missing. The algorithm must provide a∈prediction of the values of the missing 102 entries.
CHAPTER 5. MACHINE LEARNING BASICS
• Denoising Denoising:: In this type of task, the machine learning algorithm is given in input a corrupte orrupted d example x˜ ∈ Rn obtained by an unknown corruption pro process cess Denoising : In this t ype of task, the machine learning algorithm is given in n from a cle clean an example x ∈ R .RThe learner must predict the clean example ˜ x input a cits orrupte d example obtained by an unknown corruption process • x ˜, or from corrupted version more generally predict the conditional R x from a cle an example . The learner m ust predict the clean example x ˜ ). probabilit probability y distribution p(x∈ x ˜, or more generally predict the conditional x from its corrupted version x ∈ ˜ ). mass function estimation: In the densit probabilit y distribution pob (xability • Density x estimation or pr prob obability density y estimation problem, the machine learning algorithm is asked to learn a  Density estimation abilityp mass(xfunction density pmodel : Rn or → pr Rob function , where ) can be estimation interpreted: asIna the probability model estimation problem, thecontin machine algorithm is asked to learn • densit density y function tin tinuous) uous)learning or a probabilit probability y mass function (if x isa R(if x isRcon p (xw function pon the: space that , where ) ere candrawn be interpreted probability discrete) the examples from. Toasdoa suc such h a task x densit y function (if is con tin uous) or a probabilit y mass function (if x is → well (w (wee will sp specify ecify exactly what that means when we discuss performance discrete) onP ), thethe space that the needs examples were drawn from. Toof dothe suchdata a task measures algorithm to learn the structure it w ell (w e will sp ecify exactly what that means when we discuss performance has seen. It must kno know w where examples cluster tigh tightly tly and where they P measures ), the algorithm needs to learn the structure of thethat datathe it are unlik unlikely ely to occur. Most of the tasks describ described ed ab abo ove require has seen. It m ust kno w where examples cluster tigh tly and where they learning algorithm has at least implicitly captured the structure of the are unlikely to occur. Most of y the tasks describ ed us abotoveexplicitly require that the probabilit Densit estimation allo capture probability y distribution. Density allows ws learning algorithmInhas at leastweimplicitly the structureon ofthat the that distribution. principle, can then captured p erform computations probability distribution. Densit y estimation us toFor explicitly capture distribution in order to solve the other tasksallo aswswell. example, if we that distribution. In principle, w e can then p erform computations on ( x), ha hav ve performed density estimation to obtain a probabilit probability y distribution pthat distribution in order to solvetothe other as vwalue ell. imputation For example, if wIfe w e can use that distribution solv solve e the tasks missing task. p( en, x), e performed densityand estimation to other obtainvalues, a probabilit y distribution ahavvalue all of the denoted given, x i is missing x −i, are giv w e can that distribution to ov solv task. If p(xi imputation  x −i). In practice, then weuse kno know w the distribution over ere itthe is missing given byvalue a v alue is missing and all of the other v alues, denoted , are en, x x densit density y estimation do does es not alw alwaays allow us to solve all of these relatedgiv tasks, (x) arex computationally then we in kno w thecases distribution overop iterations is givenon byp(px ). In practice, b ecause many the required operations densit y estimation does not always allow us to solve all of these related tasks, in intractable. tractable. because in many cases the required operations on p( x) are computationally intractable. Of course, many other tasks and types of tasks are possible. The typ ypes es of tasks we list here are in intended tended only to pro provide vide examples of what machin machinee learning can Of course, many other tasks and types of tasks are p ossible. The types of tasks do, not to deﬁne a rigid taxonomy of tasks. we list here are intended only to provide examples of what machine learning can do, not to deﬁne a rigid taxonomy of tasks.
5.1.2
The Performance Measure, P P In order The to ev evaluate aluate the abilities of a mac machine hine learning algorithm, we must design 5.1.2 Performance Measure,
a quantitativ quantitativee measure of its performance. Usually this performance measure P is In orderto tothe evaluate abilities of aout mac sp speciﬁc eciﬁc task Tthe b eing carried byhine thelearning system. algorithm, we must design a quantitative measure of its performance. Usually this performance measure P is For tasks such as classiﬁcation, classiﬁcation with missing inputs, and transcripspeciﬁc to the task T b eing carried out by the system. tion, we often measure the ac accur cur curacy acy of the mo model. del. Accuracy is just the prop proportion ortion F or tasks such as classiﬁcation, classiﬁcation with missing inputs, and transcripof examples for which the mo model del pro produces duces the correct output. We can also obtain tion, we often measure the accuracy of the model. Accuracy is just the proportion of examples for which the model produces 103 the correct output. We can also obtain
CHAPTER 5. MACHINE LEARNING BASICS
equiv equivalent alent information by measuring the err error or rate ate,, the prop proportion ortion of examples for whic which h the mo model del produces an incorrect output. We often refer to the error rate as equiv alent information theaerr or rate, the proportion for the exp expected ected 01 loss. by Themeasuring 01 loss on particular example is 0 ifofitexamples is correctly which theand model output. We often refer to itthe error ase classiﬁed 1 ifproduces it is not. an Forincorrect tasks suc such h as density estimation, do does es notrate mak make the exp ected 01 loss. The 01 loss on a particular example is 0 if it is correctly sense to measure accuracy accuracy,, error rate, or any other kind of 01 loss. Instead, we classiﬁed 1 if it is not. For tasks suchthat as density estimation, it does not alued make m ust use and a diﬀerent performance metric gives the mo model del a contin continuousv uousv uousvalued sense for to measure accuracyThe , error rate, or any approach other kindisofto01 loss. Instead, we score eac each h example. most common rep report ort the average must use a diﬀerent performance gives the model a continuousvalued logprobabilit logprobability y the mo model del assigns metric to somethat examples. score for each example. The most common approach is to report the average Usually we are in interested terested in ho how w well the mac machine hine learning algorithm performs logprobability the model assigns to some examples. on data that it has not seen before, since this determines ho how w well it will work when Usually we are in terested in ho w w ell the mac hine learning algorithm measures performs deplo deploy yed in the real world. We therefore ev evaluate aluate these performance on data thatset it has not seen sincefrom thisthe determines hofor w well it willthe work when using a test of data that b isefore, separate data used training mac machine hine deplo y ed in the real w orld. W e therefore ev aluate these p erformance measures learning system. using a test set of data that is separate from the data used for training the machine The choice of performance measure may seem straightforw straightforward ard and ob objectiv jectiv jective, e, learning system. but it is often diﬃcult to choose a performance measure that corresponds well to choice of performance measure may seem straightforward and ob jective, the The desired behavior of the system. but it is often diﬃcult to choose a performance measure that corresponds well to In some cases, this is because it is diﬃcult to decide what should b e measured. the desired behavior of the system. For example, when performing a transcription task, should we measure the accuracy In system some cases, this is because is diﬃcult or to should decide wwhat e measured. of the at transcribing en entire tireitsequences, e useshould a morebﬁnegrained F or example, when performing a transcription task, should we measure the accuracy performance measure that giv gives es partial credit for getting some elemen elements ts of the of the system at transcribing en tire sequences, or should w e use a more ﬁnegrained sequence correct? When performing a regression task, should we penalize the performance that es partial credit for mistakes getting some ts mak of the system more measure if it frequen frequently tlygiv mak makes es mediumsized or if elemen it rarely makes es sequence correct? When p erforming a regression task, should we penalize the very large mistakes? These kinds of design choices dep depend end on the application. system more if it frequently makes mediumsized mistakes or if it rarely makes In other cases, we kno know w what quan quantit tit tity y we would ideally like to measure, but very large mistakes? These kinds of design choices depend on the application. measuring it is impractical. For example, this arises frequently in the context of In yother cases, wMany e knowofwhat tity we wouldmo ideally like to measure, but densit density estimation. the bquan est probabilistic models dels represen represent t probabilit probability y measuring it is impractical. F or example, this arises frequently in the context of distributions only implicitly implicitly.. Computing the actual probabilit probability y value assigned to densit y estimation. Many of the b est probabilistic mo dels represen t probabilit a sp speciﬁc eciﬁc poin ointt in space in man many y suc such h mo models dels is in intractable. tractable. In these cases, oney distributions only implicitly . Computing the actual probabilit y v alue assigned to must design an alternativ alternativee criterion that still corresp corresponds onds to the design ob objectives, jectives, a sp eciﬁc apoin space in manytosuc h desired models criterion. is intractable. In these cases, one or design go goo otdinapproximation the must design an alternative criterion that still corresponds to the design ob jectives, or design a good approximation to the desired criterion.
5.1.3
The Exp Experience, erience, E E Mac Machine hine The learning unsupervise ervise ervised d or su5.1.3 Expalgorithms erience, can be broadly categorized as unsup pervise ervised d by what kind of exp experience erience they are allow allowed ed to ha hav ve during the learning Mac hine learning algorithms can be broadly categorized as unsupervised or supro process. cess. pervised by what kind of experience they are allowed to have during the learning Most of the learning algorithms in this book can be understo understoood as being allow allowed ed process. to exp experience erience an en entire tire dataset. A dataset is a collection of many examples, as Most of the learning algorithms in this book can be understood as being allowed 104 is a collection of many examples, as to experience an entire dataset. A dataset
CHAPTER 5. MACHINE LEARNING BASICS
deﬁned in Sec. 5.1.1. Sometimes we will also call examples data points oints.. One of the oldest datasets studied by statisticians and mac machine hine learning redeﬁned in Sec. 5.1.1. Sometimes we will also call examples data points. searc searchers hers is the Iris dataset (Fisher, 1936). It is a collection of measuremen measurements ts of One of the oldest datasets studied b y statisticians and mac hine learning rediﬀeren diﬀerentt parts of 150 iris plants. Eac Each h individual plant corresp corresponds onds to one example. searcfeatures hers is the Iris eac dataset (Fisher ). It is a collection ofofmeasuremen of The within each h example are, 1936 the measurements of each the parts oftsthe diﬀeren t parts of length, 150 iris sepal plants.width, Each individual plant to one plan plant: t: the sepal petal length andcorresp petal onds width. Theexample. dataset The features within eac h example are the measurements of each of the parts of are the also records which sp species ecies each plan plantt belonged to. Three diﬀeren diﬀerentt sp species ecies plan t: the sepal width, petal length and petal width. The dataset represen represented tedsepal in thelength, dataset. also records which species each plant belonged to. Three diﬀerent species are Unsup Unsupervise ervise ervised d le learning arning algorithms exp experience erience a dataset containing many features, represen ted in the dataset. then learn useful prop properties erties of the structure of this dataset. In the con context text of deep Unsup ervise d le arning algorithms exp erience a dataset containing many features, learning, we usually wan antt to learn the entire probabilit probability y distribution that generated learn useful prop erties of as theinstructure of this dataset. In the con text of deep athen dataset, whether explicitly densit density y estimation or implicitly for tasks lik likee learning, w e usually w an t to learn the entire probabilit y distribution that generated syn synthesis thesis or denoising. Some other unsupervised learning algorithms perform other a dataset, whether explicitly as in densit y estimation or implicitly for tasks like roles, like clustering, whic which h consists of dividing the dataset into clusters of similar synthesis or denoising. Some other unsupervised learning algorithms perform other examples. roles, like clustering, which consists of dividing the dataset into clusters of similar Sup Supervise ervise ervised d le learning arning algorithms exp experience erience a dataset con containing taining features, but examples. eac each h example is also asso associated ciated with a lab label el or tar target get. For example, the Iris dataset Supervisedwith learning algorithms expiris erience dataset containing features, but is annotated the sp species ecies of each plant.a A supervised learning algorithm eachstudy example is also associated withtoa classify label or iris target . Ftsorinto example, the Irist dataset can the Iris dataset and learn plan plants three diﬀeren diﬀerent sp species ecies is annotated with the sp ecies of each iris plant. A supervised learning algorithm based on their measurements. can study the Iris dataset and learn to classify iris plants into three diﬀerent species Roughly sp speaking, eaking, unsup unsupervised ervised learning inv involves olves observing sev several eral examples based on their measurements. of a random vector x, and attempting to implicitly or explicitly learn the probaRoughly speaking, ervised learningprop involves several examples p(x),unsup bilit bility y distribution or some in interesting teresting properties ertiesobserving of that distribution, while x of a random v ector , and attempting to implicitly or explicitly learn the probasup supervised ervised learning inv involv olv olves es observing sev several eral examples of a random vector x and p ( x bilitasso y distribution ) , or some in teresting propto erties of that distribution, while an v alue or v ector , and learning predict by associated ciated y y from x, usually x sup ervised learning inv olv es observing sev eral examples of a random v ector and estimating p (y  x). The term sup supervise ervise ervised d le learning arning originates from the view of an asso ciated v alue or v ector , and learning predict usually by y y from the target y being pro provided vided by an instructor orto teac teacher her who sho shows wsx,the mac machine hine y x p ( estimating ). The term ervise d learning originates the view or of learning system what to do. In sup unsup unsupervised ervised learning, there isfrom no instructor y the target b eing pro vided b y an instructor or teac her who sho ws the mac hine teac teacher, her, and the algorithm must learn to make sense of the data without this guide. learning system what to do. In unsupervised learning, there is no instructor or Unsup Unsupervised ervised learning and supervised learning are not formally deﬁned terms. teacher, and the algorithm must learn to make sense of the data without this guide. The lines betw between een them are often blurred. Man Many y machine learning technologies can Unsup ervised learning and supervised learning are not formally deﬁned states terms. be used to perform both tasks. For example, the chain rule of probability The lines eenxthem Manycan machine learning technologies can that for abetw vector ∈ Rnare , theoften jointblurred. distribution be decomposed as be used to perform both tasks. For example, the chain rule of probability states n R that for a vector x , the jointY distribution can be decomposed as p(x) = p(x i  x 1, . . . , xi−1). (5.1) ∈ i=1 p (x ) = p(x x , . . . , x ). (5.1) This decomp decomposition osition means that we can solve the ostensibly unsup unsupervised ervised problem of  mo modeling deling p( x) by splitting it in into to n sup supervised ervised learning problems. Alternativ Alternatively ely ely,, we This decomposition means that we can solve the ostensibly unsupervised problem of 105 modeling p( x) by splitting it into nY supervised learning problems. Alternatively, we
CHAPTER 5. MACHINE LEARNING BASICS
can solve the sup supervised ervised learning problem of learning p(y  x) by using traditional unsup unsupervised ervised learning technologies to learn the joint distribution p(x, y) and can solve the supervised learning problem of learning p(y x) by using traditional inferring p(x, y) and unsupervised learning technologies to learn  p(x, ythe ) joint distribution P p ( y  x ) = . (5.2) inferring 0 y0 p(x, y ) p(x, y) p ( y x ) = . (5.2) Though unsup unsupervised ervised learning and sup supervised ervisedp(learning x, y ) are not completely formal or distinct concepts, they do help toroughly categorize some of the things we do with Though ervised learningTand supervised learning completely formal or mac machine hine unsup learning algorithms. raditionally raditionally, , people referare to not regression, classiﬁcation distinct concepts,output they do help to roughly categorize some ofDensit the things we do with and structured problems as sup supervised ervised learning. Density y estimation in mac hine learning algorithms. T raditionally , p eople refer to regression, classiﬁcation P unsupervised learning. supp support ort of other tasks is usually considered and structured output problems as supervised learning. Density estimation in Other arian ariants ts of the learningconsidered paradigmunsupervised are p ossible.learning. For example, in semisupp ort ofvother tasks is usually sup supervised ervised learning, some examples include a supervision target but others do ariants of thelearning, learningan paradigm are p ossible. For example, in seminot.Other In mvultiinstance en entire tire collection of examples is lab labeled eled as sup ervised learning, some examples include a supervision target but others do con containing taining or not containing an example of a class, but the individual mem members bers not. In m ultiinstance learning, an en tire collection of examples is lab eled as of the collection are not lab labeled. eled. For a recen recentt example of multiinstance learning containing or dels, not containing with deep mo models, see Kotziasan et example al. (2015of ). a class, but the individual members of the collection are not labeled. For a recent example of multiinstance learning Some machine learning algorithms do not just exp experience erience a ﬁxed dataset. For with deep models, see Kotzias et al. (2015). example, reinfor einforccement le learning arning algorithms interact with an environmen environment, t, so there Some machine learning algorithms do not just exp erience a ﬁxed dataset. For is a feedback lo loop op b et etw ween the learning system and its exp experiences. eriences. Suc Such h algorithms example, reinfor cement arning an environmen so there are bey eyond ond the scop scope e of le this book.algorithms Please seeinteract Sutton with and Barto (1998) ort,Bertsek Bertsekas as is a feedback lo op b et w een the learning system and its exp eriences. Suc h algorithms and Tsitsiklis (1996) for information ab about out reinforcement learning, and Mnih et al. are b ey ond the scop e of this b o ok. Please see Sutton and Barto (1998) or Bertsekas (2013) for the deep learning approach to reinforcemen reinforcement t learning. and Tsitsiklis (1996) for information about reinforcement learning, and Mnih et al. Most machine exp experience erience a dataset. A dataset can (2013 ) formac thehine deeplearning learningalgorithms approach simply to reinforcemen t learning. be described in many wa ways. ys. In all cases, a dataset is a collection of examples, Most hine collections learning algorithms simply exp erience a dataset. A dataset can whic which h aremac in turn of features. be described in many ways. In all cases, a dataset is a collection of examples, One common wa way y of describing a dataset is with a design matrix. A design which are in turn collections of features. matrix is a matrix containing a diﬀeren diﬀerentt example in each ro row. w. Each column of the One common wa y of describing a dataset is with a design . Acontains design matrix corresp corresponds onds to a diﬀeren diﬀerentt feature. For instance, the Irismatrix dataset matrix is a matrix a diﬀeren t example in This each ro w. Each column of the 150 examples with containing four features for each example. means we can represent matrix corresp onds to a diﬀeren t feature. F or instance, the Iris dataset contains the dataset with a design matrix X ∈ R150×4 , where Xi,1 is the sepal length of 150 four width features for each example. means we of can i ,Retc. plan plantexamples t i, Xi,2 is with the sepal of plant We willThis describ describe e most therepresent learning X design the datasetinwith matrixofXho , where is the sepal datasets. length of algorithms thisabdesign ook in terms how w they op operate erate on matrix plant i, X is the sepal width of plant∈i , etc. We will describe most of the learning Of course, to describ describee a dataset as a design matrix, it must b e possible to algorithms in this book in terms of how they operate on design matrix datasets. describ describee eac each h example as a vector, and each of these vectors must be the same size. Of course, e aFdataset as a ifdesign matrix, it mustofb ephotographs possible to This is not alw alwa ato ysdescrib possible. or example, you ha hav ve a collection describ e each example as aheigh vector, and each of these vectors must be the same size. with diﬀerent widths and heights, ts, then diﬀerent photographs will contain diﬀeren diﬀerent t This is not alw a ys p ossible. F or example, if y ou ha v e a collection of photographs num umb bers of pixels, so not all of the photographs ma may y be describ described ed with the same with diﬀerent widths and heigh ts, then diﬀerent photographs will diﬀeren length of vector. Sec. 9.7 and Chapter 10 describe how to handlecontain diﬀerent typest numbers of pixels, so not all of the photographs may be described with the same length of vector. Sec. 9.7 and Chapter106 10 describe how to handle diﬀerent types
CHAPTER 5. MACHINE LEARNING BASICS
of such heterogeneous data. In cases lik likee these, rather than describing the dataset as a matrix with m ro rows, ws, we will describ describee it as a set containing m elemen elements: ts: of such heterogeneous data. In cases lik e these, rather than describing the dataset (1) (2) ( m ) does es not imply that any two example vectors {x , x , . . . , x } . This notation do m as a matrix with ro ws, we will describ e it as a set containing m elements: ( i ) ( j ) x and x ha hav ve the same size. . This notation does not imply that any two example vectors x ,x ,...,x In the case of sup supervised ervised learning, the example con contains tains a lab label el or target as x { and x have the } same size. well as a collection of features. For example, if we wan antt to use a learning algorithm the case of sup ervised learning, the example tains a ecify labelwhich or target as to pIn erform ob object ject recognition from photographs, we con need to sp specify ob object ject w ell ears as a in collection features. if wthis e wan t to a usenumeric a learning algorithm app appears each ofofthe photos.ForWexample, e migh mightt do with co code, de, with 0 to p erform ob ject recognition from photographs, w e need to sp ecify which ob ject signifying a person, 1 signifying a car, 2 signifying a cat, etc. Often when working appears in eachcon of taining the photos. We matrix might do withobserv a numeric with Xde, with a dataset containing a design of this feature observations ationsco , we also0 signifying a person, 1 signifying a car, 2 viding signifying cat, etc.example Often when working pro provide vide a vector of labels y , with yi pro providing the alab label el for i. with a dataset containing a design matrix of feature observations X, we also Of course, sometimes the label ma may y be more than just a single num umb ber. For provide a vector of labels y , with y providing the label for example i. example, if we want to train a sp speec eec eech h recognition system to transcribe en entire tire Of course, theeach labelexample may besen more than a single umber. For sen sentences, tences, thensometimes the lab label el for sentence tence is ajust sequence of n words. example, if we want to train a sp eech recognition system to transcribe entire Just asthen therethe is no deﬁnition of sen sup supervised ervised unsup unsupervised ervised learning, sentences, labformal el for each example tence is and a sequence of words. there is no rigid taxonom taxonomy y of datasets or exp experiences. eriences. The structures describ described ed here Just as there is no formal deﬁnition of sup ervised and unsup ervised learning, co cov ver most cases, but it is alwa always ys possible to design new ones for new applications. there is no rigid taxonomy of datasets or experiences. The structures described here cover most cases, but it is always possible to design new ones for new applications.
5.1.4
Example: Linear Regression
5.1.4deﬁnition Example: Linear Regression Our of a mac machine hine learning algorithm as an algorithm that is capable of improving a computer program’s performance at some task via exp experience erience is Our deﬁnition of a mac hine learning algorithm as an algorithm that is somewhat abstract. To make this more concrete, we presen presentt an example of acapable simple of improving a computer program’s somereturn task via experience is mac machine hine learning algorithm: line linear ar rpeerformance gr gression ession ession.. Wat e will to this example somewhat To make thismac more concrete, presentthat an example of a simple rep repeatedly eatedly abstract. as we introduce more machine hine learningwe concepts help to understand macbhine learning algorithm: linear regression. We will return to this example its ehavior. repeatedly as we introduce more machine learning concepts that help to understand As the name implies, linear regression solv solves es a regression problem. In other its behavior. words, the goal is to build a system that can tak takee a vector x ∈ Rn as input and As the linear solves In a regression In other y ∈ regression R as its output. predict thename value implies, of a scalar the case ofproblem. linear regression, R x w ords, the goal to build a system can Let takeyˆabveector as our input and the output is a is linear function ofR thethat input. the value that mo model del y predict the v alue of a scalar as its output. In the case of linear regression, ∈ predicts y should take on. We deﬁne the output to be the output is a linear function∈of the input. Let yˆ be the value that our model > x (5.3) = woutput predicts y should take on. We deﬁneyˆ the to be where w ∈ Rn is a vector of par arameters ameters ameters. (5.3) yˆ =. w x R are values that control the behavior of the system. In this case, wi is Parameters where w is a vector of parameters. the co coeﬃcien eﬃcien eﬃcientt that we multiply by feature xi before summing up the con contributions tributions ∈ P arameters are v alues that control the b ehavior of the system. In this case, whow is from all the features. We can think of w as a set of weights that determine x the co eﬃcien t that w e m ultiply b y feature b efore summing up the con tributions eac each h feature aﬀects the prediction. If a feature xi receiv receives es a positive weigh weightt wi , from all the features. We can think of w as a set of weights that determine how each feature aﬀects the prediction. If 107 a feature x receives a positive weight w ,
CHAPTER 5. MACHINE LEARNING BASICS
then increasing the value of that feature increases the value of our prediction yˆ. If a feature receiv receives es a negative weigh eight, t, then increasing the value of that feature yˆ. then increasing the v alue of that feature value of ourinprediction decreases the value of our prediction. If aincreases feature’s the weigh eight t is large magnitude, If a feature a negative eight, thenIfincreasing value thatit feature then it has areceiv largeeseﬀect on the w prediction. a feature’sthe weigh weight t is of zero, has no decreases the v alue of our prediction. If a feature’s w eigh t is large in magnitude, eﬀect on the prediction. then it has a large eﬀect on the prediction. If a feature’s weight is zero, it has no We thus hav havee a deﬁnition of our task T : to predict y from x by outputting eﬀect on the prediction. > yˆ = w x. Next we need a deﬁnition of our performance measure, P . We thus have a deﬁnition of our task T : to predict y from x by outputting Supp that we ha hav ve a design matrix of m example inputs that we will not yˆ = Suppose w xose . Next we need a deﬁnition of our performance measure, P . use for training, only for ev evaluating aluating ho how w well the model performs. We also hav havee Suppose that we hatargets ve a design matrix of correct that we of will not m example a vector of regression pro providing viding the valueinputs of y for each these use for training, only evaluating howbwell thefor model performs. We italso examples. Because thisfordataset will only e used ev evaluation, aluation, we call thehav teste y a vector of regression targets providing theascorrect aluethe of vector for each of these X (test) vand set set. . We refer to the design matrix of inputs of regression examples. Because this dataset will only b e used for ev aluation, we call it the test targets as y(test). set. We refer to the design matrix of inputs as X and the vector of regression One way of measuring the performance of the mo model del is to compute the me mean an targets as y . ( test ) squar squareed err error or of the mo model del on the test set. If yˆ giv gives es the predictions of the One w a y of measuring the p erformance of the mo del is to mo model del on the test set, then the mean squared error is given bycompute the mean squared error of the model on the test set. If yˆ gives the predictions of the X 1 ) 2 model on the test set, then the = mean squared is given MSEtest (5.4) ( yˆ (test)error ) i . by − y (test m 1 i MSE (5.4) = ( yˆ ) . y m ( test ) ( = y test) . In Intuitiv tuitiv tuitively ely ely,, one can see that this error measure decreases to 0 when yˆ − We can also see that =y Intuitively, one can see that this error measure decreases to 0 when yˆ . 1 ( test ) ( test ) 2 We can also see that  ˆ −y 2 , yˆ MSE test = X (5.5) m 1 y = Euclidean yˆ MSE , bet (5.5) so the error increases whenever the distance etween ween the predictions m  −  and the targets increases. so the error increases whenever the Euclidean distance between the predictions o make a mac machine hine learning algorithm, we need to design an algorithm that andTthe targets increases. will improv improvee the weigh weights ts w in a way that reduces MSEtest when the algorithm T o make a mac hine learning need toset design an),algorithm that y(train) ). One is allo allow wed to gain exp experience erience by algorithm, observing awe training (X (train w will improv eythe weights in a wawe y that reduceslater, MSE in Sec. when5.5.1 the )algorithm in intuitiv tuitiv tuitive e wa way of doing this (which will justify is just to X , y is allo w ed to gain exp erience by observing a training set ( ). One minimize the mean squared error on the training set, MSEtrain . intuitive way of doing this (which we will justify later, in Sec. 5.5.1) is just to To minimize MSE e can on simply solve forset, where its gradien gradient t is 0: train, werror minimize the mean squared the training MSE . To minimize MSE
, we can simply solve for where its gradient is 0: ∇w MSEtrain = 0 (5.6) 1 (train) (train 0 ) 22 = 0 (5.6) ⇒ ∇ w  ˆ yˆMSE − y= (5.7) m 1∇ 1 y yˆ =0 (5.7) ⇒ ∇wm  X X(train)w − y (train) 22 = 0 (5.8) m ⇒1∇  −  108 X w y =0 (5.8) m ⇒ ∇  − 
CHAPTER 5. MACHINE LEARNING BASICS
3 2
Linear regression example Linear regression example
0.50 MSE(train)
y
1 0 −1 −2 −3
0.55
Optimization of w Optimization of w
0.45 0.40 0.35 0.30 0.25
−1.0 −0.5
0.0 x1
0 .5
0.20
1.0
0.5
1.0 w1
1.5
w w w
y =w x
w
y =w x
w
w w w
> (train) (train) (train) (train) =0 X ⇒ ∇w X w−y w−y
(5.9)
)> (train =y0(train) = (5.9) X )> y (train w )+ y y (train)> X w) w y 0 ⇒ ∇w w> X (trainX − 2w> X (train ⇒∇ − − (5.10) w X X y +y y =0 w 2w X ⇒ 2X (train)> X (train)w − 2X (train)> y(train) = 0 (5.11) ⇒∇ − (5.10) −1 (train)> ( train ) > ( train ) ( train ) X= X X (5.11) Xw 2X X y y = 0 (5.12) ⇒ 2w ⇒ − X Xen by Eq. y 5.12 is known (5.12) w = Xwhose solution The system of equations is giv given as the normal equations quations..⇒Ev Evaluating aluating Eq. 5.12 constitutes a simple learning algorithm. system of solution is giv en by Eq.in5.12 is known as 5.1 the. For The an example of equations the linear whose regression learning algorithm action, see Fig. normal equations. Evaluating Eq. 5.12 constitutes a simple learning algorithm. worth noting term line linear ar regr gression ession is often used to the For Itanisexample of the that linear regression learning algorithm in action, seerefer Fig. to 5.1a. sligh slightly tly more sophisticated mo model del with one additional parameter—an intercept It is w orth noting that the term linear regression is often used to refer to a term b. In this mo model del slightly more sophisticated modelyˆ with parameter—an intercept = w >one x +additional b (5.13) term b. In this model so the mapping from parameters yˆto=predictions the w x + b is still a linear function but (5.13) mapping from features to predictions is now an aﬃne function. This extension to so the functions mapping means from parameters to predictions is still a linear function the aﬃne that the plot of the mo model’s del’s predictions still lo looks oksbut like a mapping from features to predictions is now an aﬃne function. This extension to line, but it need not pass through the origin. Instead of adding the bias parameter means thatthe themo plot theonly model’s predictions still tloxoks likeana baﬃne , one functions can contin continue ue to use model del of with weigh eights ts but augmen augment with line, but it need not pass through the origin. Instead of adding the bias parameter 109 only weights but augment x with an b, one can continue to use the model with
CHAPTER 5. MACHINE LEARNING BASICS
extra entry that is alwa always ys set to 1. The weigh eightt corresp corresponding onding to the extra 1 entry pla plays ys the role of the bias parameter. We will frequen frequently tly use the term “linear” when extra entry always setthroughout to 1. The this weigh corresponding to the extra 1 entry referring to that aﬃneisfunctions bto ok. plays the role of the bias parameter. We will frequently use the term “linear” when b is often The in intercept called this the bias referring totercept aﬃne term functions throughout bo ok.parameter of the aﬃne transformation. This terminology derives from the point of view that the output of the b is The intercept often biasabsence parameter ofythe aﬃne transfortransformation is term biased to tow w ard bcalled eing bthe in the of an any input. This term mation. This terminology derives from the point of view that the output of the is diﬀeren diﬀerentt from the idea of a statistical bias, in which a statistical estimation b transformation is biased toward iny the absence input. This yterm algorithm’s exp expected ected estimate of abeing quantit quantity is not equal of to an they true quantit quantity . is diﬀerent from the idea of a statistical bias, in which a statistical estimation Linear regression of courseofanaextremely simple and limited algorithm, algorithm’s expectedisestimate quantity is not equal to the learning true quantit y. but it provides an example of how a learning algorithm can work. In the subsequen subsequentt Linear is of courseofan extremely simple and limited learning sections weregression will describ describe e some the basic principles underlying learning algorithm, algorithm but it provides an example of how a learning algorithm can work. In the subsequen design and demonstrate ho how w these principles can be used to build more complicatedt sections we will describ e some of the basic principles underlying learning algorithm learning algorithms. design and demonstrate how these principles can be used to build more complicated learning algorithms.
5.2
Capacit Capacity y, Ov Overﬁtting erﬁtting and Underﬁtting
The central tral challenge mac machine hine learning is that we must perform well on 5.2 cen Capacit y, in Ov erﬁtting and Underﬁtting inputs—not just those on whic which h our mo model del was trained. The The cen tral c hallenge in mac hine learning is that we m ust perform well on abilit ability y to perform well on previously unobserv unobserved ed inputs is called gener generalization alization alization. . inputs—not just those on which our mo del was trained. The Typically ypically, , whenwell training a machineunobserv learninged mo model, del, weishav have e access toalization a training abilit y to perform on previously inputs called gener . set, we can compute some error measure on the training set called the tr training aining T,ypically a machine learning del, w we hav access to err error or or, and we, when reducetraining this training error. So far,mo what e ha hav vee describ described edaistraining simply set, w e can compute some error measure on the training set called the tr aining an optimization problem. What separates machine learning from optimization is err or , and w e reduce this training error. So far, what w e ha v e describ ed is simply that we wan wantt the gener generalization alization err error or, also called the test err error or, to be lo low w as well. an optimization problem. What separates machine learning from optimization is The generalization error is deﬁned as the exp expected ected value of the error on a new that weHere wantthe theexp gener alization error,across also called the test errorinputs, , to be dra lowwn as from well. input. is taken diﬀeren expectation ectation diﬀerent t possible drawn The generalization error iswedeﬁned expected valueter of in thepractice. error on a new the distribution of inputs expect as thethe system to encoun encounter input. Here the expectation is taken across diﬀerent possible inputs, drawn from We typically estimate the generalization error of a machine learning mo model del by the distribution of inputs we expect the system to encounter in practice. measuring its performance on a test set of examples that were collected separately e typically the generalization error of a machine learning model by fromWthe trainingestimate set. measuring its performance on a test set of examples that were collected separately ourtraining linear regression example, we trained the model by minimizing the fromInthe set. training error, In our linear regression 1example, we trained the model by minimizing the (train) (train) 2  X X w − y 2, (5.14) training error, m(train) 1 X error,w 1 y  (5.14) but we actually care ab about out the test X X (test),w − y (test) 22. m m  − set when  we get to observ How can we aﬀect performance the test but Ho wewactually care ab out the testonerror, . e only the X w y observe training set? The ﬁeld of statistic statistical al le learning arning the theory ory pro provides vides some answ answers. ers. If the  e only the How can we aﬀect performance on the test set when we − get to observ 110 training set? The ﬁeld of statistical learning theory provides some answers. If the
CHAPTER 5. MACHINE LEARNING BASICS
training and the test set are collected arbitrarily arbitrarily,, there is indeed little we can do. If we are allo allowed wed to make some assumptions about ho how w the training and test set training and the test set are collected arbitrarily , there is indeed little we can do. are collected, then we can mak makee some progress. If we are allowed to make some assumptions about how the training and test set train and are generated by a probability distribution ov over er datasets are The collected, thentest wedata can mak e some progress. called the data gener generating ating pr pro ocess ess.. We typically mak makee a set of assumptions kno known wn The train dataassumptions are generated by a probability distribution overexamples datasets collectiv collectively ely asand thetest i.i.d. These assumptions are that the called data gener atingendent processfrom . We each typically mak e athat set of knotest wn in eac each hthe dataset are indep independent other, and theassumptions train set and collectiv ely as al the i.i.d. assumptions Thesethe assumptions are thatdistribution the examples set are identic identical ally ly distribute distributed d, dra drawn wn from same probability as in eac h dataset are indep endent from each other, and that the train set and test eac each h other. This assumption allo allows ws us to describe the data generating process set are identical lyy distribution distributed, dra wna from same probability with a probabilit probability ov over er singlethe example. The same distribution distribution as is eac h other. This assumption allo ws us to describe the data generating process then used to generate ev every ery train example and every test example. We call that with a probabilit y distribution ovdata er a gener singleating example. The same distribution is p data. This shared underlying distribution the generating distribution distribution, , denoted then used to generate ev ery train example and every test example. W e call that probabilistic framework and the i.i.d. assumptions allo allow w us to mathematically sharedthe underlying distribution data gener . This study relationship betw etween eenthe training errorating and distribution test error. , denoted p probabilistic framework and the i.i.d. assumptions allow us to mathematically One immediate connection can observe betw between een error. the training and test error study the relationship betweenwe training error and test is that the expected training error of a randomly selected mo model del is equal to the One immediate connection we can observe betw een the training test error exp expected ected test error of that mo model. del. Suppose we ha hav ve a probabilityand distribution the expected error of a randomly moset deland is equal to the pis(xthat , y ) and we sampletraining from it rep repeatedly eatedly to generateselected the train the test set. exp ected test error of that mo del. Suppose w e ha v e a probability distribution For some ﬁxed value w , then the exp expected ected training set error is exactly the same as p ( x , y ) and w e sample from it rep eatedly generate theare train set and thethe test set. the exp expected ected test set error, b ecause both to exp expectations ectations formed using same w F or some ﬁxed v alue , then the exp ected training set error is exactly the same as dataset sampling process. The only diﬀerence betw etween een the tw two o conditions is the the exp testtoset b ecause both expectations are formed using the same name wected e assign theerror, dataset we sample. dataset sampling process. The only diﬀerence between the two conditions is the Ofwcourse, we e use awe machine name e assignwhen to thew dataset sample.learning algorithm, we do not ﬁx the parameters ahead of time, then sample b oth datasets. We sample the training set, course, when the we parameters use a machine learning algorithm, wethen do not ﬁx the thenOfuse it to choose to reduce training set error, sample the parameters aheadthis of time, thenthe sample b othtest datasets. Wegreater samplethan the training test set. Under pro process, cess, exp expected ected error is or equalset, to then use it to c hoose the parameters to reduce training set error, then sample the the exp expected ected value of training error. The factors determining ho how w well a machine test set. Under this pro cess, the exp ected test error is greater than or equal to learning algorithm will perform are its abilit ability y to: the expected value of training error. The factors determining how well a machine learning algorithm will perror erform are its ability to: 1. Mak Make e the training small. 1. Mak training error small. and test error small. 2. Makee the gap betw etween een training 2.These Maketwthe gap betw een ond training and error small. o factors corresp correspond to the twotest cen central tral challenges in machine learning: underﬁtting and overﬁtting. Underﬁtting occurs when the model is not able to These two factors corresp ond to the wo training central challenges in machine learning: obtain a suﬃcien suﬃciently tly lo low w error value on tthe set. Ov Overﬁtting erﬁtting occurs when underﬁtting and overﬁtting . Underﬁtting occurs when the model is not able to the gap betw etween een the training error and test error is to too o large. obtain a suﬃciently low error value on the training set. Overﬁtting occurs when We can con control trol whether a mo model del is more lik likely ely to ov overﬁt erﬁt or underﬁt by altering the gap between the training error and test error is too large. its cap apacity acity acity.. Informally Informally,, a mo model’s del’s capacit capacity y is its abilit ability y to ﬁt a wide variety of We can control whether a model is more likely to overﬁt or underﬁt by altering 111 y is its ability to ﬁt a wide variety of its capacity. Informally, a mo del’s capacit
CHAPTER 5. MACHINE LEARNING BASICS
functions. Mo Models dels with lo low w capacit capacity y ma may y struggle to ﬁt the training set. Mo Models dels with high capacit capacity y can overﬁt by memorizing properties of the training set that do functions. Models lowtest capacit not serv servee them wellwith on the set.y may struggle to ﬁt the training set. Models with high capacity can overﬁt by memorizing properties of the training set that do way y towell con control trol thetest capacity not One servewa them on the set. of a learning algorithm is by choosing its hyp hypothesis othesis sp spac ac acee, the set of functions that the learning algorithm is allow allowed ed to One y tothe consolution. trol the F capacity of a the learning by choosing its select as wa being or example, linear algorithm regression is algorithm has the hypothesis spacefunctions , the set of functions that learningspace. algorithm is allow ed to set of all linear of its input as its the hypothesis We can generalize select as being the For example,rather the linear has linear regression tosolution. include polynomials, thanregression just linearalgorithm functions, inthe its set of all linear functions of its input as its hypothesis space. W e can generalize hyp ypothesis othesis space. Doing so increases the mo model’s del’s capacit capacity y. linear regression to include polynomials, rather than just linear functions, in its polynomial degree giv gives es us regression model del with whic which h we hypA othesis space. ofDoing so one increases thethe molinear del’s capacit y. mo are already familiar, with prediction A polynomial of degree one gives us the linear regression model with which we are already familiar, with prediction yˆ = b + wx. (5.15) By in intro tro troducing ducing x2 can learn a mo model del By introducing x can learn a model
yˆ =provided b + wx. to the linear regression model, (5.15) as another feature we that is quadratic as a function of x: as another feature provided to the linear regression model, we that is quadratic aswa1function yˆ = b + x + w2 x2 of . x: (5.16)
yˆ = b + w x + function w x . of its (5.16) Though this mo model del implements a quadratic , the output is still a linear function of the , so we can still use the normal equations Though model del in implements a quadratic function of add its more ,pthe output is to train this the mo model closed form. We can contin continue ue to ow owers ers of x as still a linearfeatures, functionfor of example the , so can still of usedegree the normal equations additional to obtain a we polynomial 9: to train the model in closed form. We can continue to add more powers of x as 9 a polynomial of degree 9: additional features, for example to obtain X (5.17) yˆ = b + wi xi . yˆ = b +
i=1
wx.
(5.17)
Mac Machine hine learning algorithms will generally perform best when their capacity is appropriate in regard to the true complexit complexity y of the task they need to perform Mac hine learning algorithms will generally perform est when their capacityt and the amount of training data they are provided with.bModels with insuﬃcien insuﬃcient X is appropriate in regard to the true complexit y of the task they need to perform capacit capacity y are unable to solv solvee complex tasks. Mo Models dels with high capacit capacity y can solve and the amount of training data they are provided with. Models with insuﬃcien complex tasks, but when their capacit capacity y is higher than needed to solve the presen presenttt capacit y are unable to solve complex tasks. Models with high capacity can solve task they may ov overﬁt. erﬁt. complex tasks, but when their capacity is higher than needed to solve the present Fig. 5.2 sho shows ws this principle in action. We compare a linear, quadratic and task they may overﬁt. degree9 predictor attempting to ﬁt a problem where the true underlying function Fig. 5.2 sho wslinear this principle action.toW e compare a linear, is quadratic. The function isinunable capture the curv curvature ature quadratic in the trueand undegree9 problem, predictor so attempting to ﬁt problem where theistrue underlying function derlying it underﬁts. Thea degree9 predictor capable of represen representing ting is quadratic. The linear function is unable the curv ature in many the true unthe correct function, but it is also capabletoofcapture representing inﬁnitely other derlying problem, so it underﬁts. The degree9 predictor is capable of represen ting functions that pass exactly through the training p oin oints, ts, because we hav havee more the correct function, but it is also capable of representing inﬁnitely many other 112training p oints, b ecause we have more functions that pass exactly through the
CHAPTER 5. MACHINE LEARNING BASICS
parameters than training examples. We hav havee little chance of choosing a solution that generalizes well when so man many y wildly diﬀerent solutions exist. In this example, parameters than training examples. We hav littletrue chance of choosing solution the quadratic mo model del is perfectly matched toe the structure of the atask so it that generalizes well when so man y wildly diﬀerent solutions exist. In this example, generalizes well to new data. the quadratic model is perfectly matched to the true structure of the task so it generalizes well to new data.
x
y
x
y
So far we hav havee only describ described ed changing a mo model’s del’s capacity by changing the num umb ber of input features it has (and simultaneously adding new parameters So far w e hav e only describThere ed changing a mo del’s wcapacity by changing the asso associated ciated with those features). are in fact many ays of changing a mo model’s del’s n umbery.ofCapacit input yfeatures it has (andonly simultaneously adding new parameters capacit capacity Capacity is not determined by the choice of model. The mo model del asso ciated with those features). There are in fact many w a ys of changing a mo del’s sp speciﬁes eciﬁes whic which h family of functions the learning algorithm can cho hoose ose from when capacit y . Capacit y is not determined only by the c hoice of model. mothe del varying the parameters in order to reduce a training ob objectiv jectiv jective. e. This isThe called eciﬁes which family of functions the can cthe hoose from when rsp epr epresentational esentational cap apacity acity of the mo model. del. learning In man many yalgorithm cases, ﬁnding best function varyingthis thefamily parameters in order to optimization reduce a training ob jectiv e. This is called the within is a very diﬃcult problem. In practice, the learning representational cap acity ofﬁnd thethe mobdel. In many but cases, ﬁnding function algorithm do does es not actually est function, merely onethe thatbest signiﬁcantly within this family is a very diﬃcult optimization problem. In practice, the learning reduces the training error. These additional limitations, such as the imp imperfection erfection algorithm do es not actually ﬁnd the best function, but merely one that signiﬁcantly reduces the training error. These additional 113 limitations, such as the imp erfection
CHAPTER 5. MACHINE LEARNING BASICS
of the optimization algorithm, mean that the learning algorithm’s eﬀe eﬀective ctive cap apacity acity ma may y be less than the representational capacit capacity y of the mo model del family family.. of the optimization algorithm, mean that the learning algorithm’s eﬀective capacity Our mo modern dern ideas ab about out impro improving ving the generalization of mac machine hine learning may be less than the representational capacity of the model family. mo models dels are reﬁnements of though thoughtt dating bac back k to philosophers at least as early Our mo dern ideas ab out impro ving the generalization of machine as Ptolem Ptolemy y. Man Many y early scholars inv invok ok okee a principle of parsimony thatlearning is now models are reﬁnements ofcthough t dating back to philosophers at least as early most widely kno known wn as Oc Occ am’s razor (c. 12871347). This principle states that as Ptolem y . Man y early scholars inv ok e a principle of parsimony that is among comp competing eting hyp ypotheses otheses that explain kno known wn observ observations ations equally well, now one most widely kno wn as Oc c am’s r azor (c. 12871347). This principle states that should cho hoose ose the “simplest” one. This idea was formalized and made more precise among comp eting hyp that explain known observ ations equally well, and one in the 20th century byotheses the founders of statistical learning theory (Vapnik shouldonenkis choose, the “simplest” idea et wasal. formalized and ,made more precise Cherv Chervonenkis 1971 ; Vapnik,one. 1982This ; Blumer al., , 1989; Vapnik 1995). in the 20th century by the founders of statistical learning theory (Vapnik and Statistical learning theory provides various means of quan quantifying tifying mo model del capacity capacity.. Chervonenkis, 1971; Vapnik, 1982; Blumer et al., 1989; Vapnik, 1995). Among these, the most wellkno wellknown wn is the VapnikChervonenkis dimension dimension,, or VC Statistical learning theory provides v arious means of quan tifying del capacity dimension. The VC dimension measures the capacity of a binary mo classiﬁer. The. Among these, the most wellkno wnthe is the VapnikChervonenkis dimension , or VC V C dimension is deﬁned as being largest possible value of m for whic which h there dimension. The set VC of dimension measures capacity of a binary classiﬁer. The. m diﬀeren exists a training diﬀerent t x poin oints tsthe that the classiﬁer can lab label el arbitrarily arbitrarily. VC dimension is deﬁned as being the largest possible value of m for which there Quan Quantifying tifying the capacit capacity y of the mo model del allo allows ws statistical learning theory to exists a training set of m diﬀerent x points that the classiﬁer can label arbitrarily. mak makee quan quantitativ titativ titativee predictions. The most imp important ortant results in statistical learning Quan tifying y of the mo deltraining allows statistical learning theory to theory sho show w thatthe thecapacit discrepancy betw etween een error and generalization error mak e quantitativ e predictions. The ymost ortant results in capacity statisticalgro learning is bounded from ab abov ov ovee by a quantit quantity that imp gro grows ws as the mo model del grows ws but theory sho w that the discrepancy b etw een training error and generalization error, shrinks as the num umb ber of training examples increases (Vapnik and Cherv Chervonenkis onenkis onenkis, is bounded from ab;ovBlumer e by a quantit that gro ws as the ). moThese del capacity ws but 1971 ; Vapnik , 1982 et al. al.,, y1989 ;V apnik , 1995 boundsgro provide shrinks as the number ofthat training examples increases (Vapnik and Cherv onenkis in intellectual tellectual justiﬁcation machine learning algorithms can work, but they are, 1971 ; Vused apnikin, 1982 ; Blumer al., 1989 ; Vapnik 1995). These boundsThis provide rarely practice when etworking with deep ,learning algorithms. is in in tellectual justiﬁcation that machine learning algorithms can work, but they are part because the bounds are often quite lo loose ose and in part because it can be quite rarely used in practice when working with deep learning algorithms. This is diﬃcult to determine the capacity of deep learning algorithms. The problem in of part becausethe thecapacity bounds of area often quite loose andisinesp part because it can be quite determining deep learning mo model del especially ecially diﬃcult because the diﬃcult determine the capacity of deep learning problemand of eﬀectiv eﬀectivee to capacity is limited by the capabilities of the algorithms. optimizationThe algorithm, determining capacityunderstanding of a deep learning movery del general is esp ecially diﬃcult because the w e ha hav ve littlethe theoretical of the noncon nonconv vex optimization eﬀectiv e capacity is limited b y the capabilities of the optimization algorithm, and problems in inv volv olved ed in deep learning. we have little theoretical understanding of the very general nonconvex optimization We must remem rememb ber that while simpler functions are more likely to generalize problems involved in deep learning. (to hav havee a small gap betw etween een training and test error) we must still choose a W e must remem b er that while areerror. more likely to generalize suﬃcien suﬃciently tly complex hyp ypothesis othesis to simpler ac achiev hiev hievee functions lo low w training Typically ypically, , training (to hav e a small gap b etw een training and test error) we m ust still a error decreases un until til it asymptotes to the minimum possible error value choose as mo model del suﬃcien complex hypothesis achiev e low training error. , training, capacit capacity ytly increases (assuming thetoerror measure has a minim minimum umTvypically alue). Typically Typically, error decreaseserror untilhas it asymptotes thee minimum possible errorcapacit value yas model generalization a Ushaped to curv curve as a function of model capacity . This is capacit y increases (assuming the error measure has a minim um v alue). Typically , illustrated in Fig. 5.3. generalization error has a Ushaped curve as a function of model capacity. This is To reach capacity y, we in intro tro troduce duce illustrated in the Fig. most 5.3. extreme case of arbitrarily high capacit the concept of nonp nonpar ar arametric ametric models. So far, w wee hav havee seen only parametric To reach the most extreme case of arbitrarily high capacity, we intro duce the concept of nonparametric models. 114 So far, we have seen only parametric
CHAPTER 5. MACHINE LEARNING BASICS
mo models, dels, suc such h as linear regression. Parametric mo models dels learn a function describ described ed by a parameter vector whose size is ﬁnite and ﬁxed before any data is observed. models, such as mo linear Pharametric models learn a function described Nonparametric models delsregression. hav havee no suc such limitation. by a parameter vector whose size is ﬁnite and ﬁxed before any data is observed. Sometimes, nonparametric are just theoretical abstractions (suc (such h as Nonparametric models have no models such limitation. an algorithm that searches over all possible probability distributions) that cannot Sometimes, nonparametric models abstractions (such as be implemented in practice. How However, ever, weare canjust alsotheoretical design practical nonparametric an algorithm that their searches over all distributions) thatexample cannot mo models dels by making complexit complexity y apossible functionprobability of the training set size. One be such implemented in practice. How ever,orwerecan also .design nonparametric of an algorithm is ne near ar arest est neighb neighbor gr gression ession ession. Unlik Unlikeepractical linear regression, whic which h mo dels b y making their complexit y a function of the training set size. One example has a ﬁxedlength vector of weigh eights, ts, the nearest neighbor regression model simply of such an algorithm is ne ar est neighb regrWhen ession.ask Unlik e linear regression, whicxh, stores the X and y from the trainingorset. asked ed to classify a test point has mo a ﬁxedlength of weigh ts, the nearest neighbor regression model simply the model del lo looks oks upvector the nearest en entry try in the training set and returns the asso associated ciated x, stores the X and yInfrom thewords, training set.y When ask edarg to min classify a− test point 2  X X x  yˆ = i = regression target. other where . i i,: 2 The the model can looks upbthe nearest ento trydistance in the training andthan returns associated 2 norm, Lthe algorithm also e generalized metrics set other the theL such arg min X x y ˆ = y i = regression target. In other words, where . The as learned distance metrics (Goldb Goldberger erger et al., 2005). If the algorithm is allo allow wed L algorithm can also b e generalized to distance metrics other than the norm, such  −  to break ties by averaging the yi values for all Xi,: that are tied for nearest, then as learned distance (Goldb etum al.p, ossible 2005). training If the algorithm is hallo wedt this algorithm is ablemetrics to achiev achieve e theerger minim minimum error (whic (which migh might y tical toe break by zero, averaging values for all that arewith tied diﬀeren for nearest, then b greaterties than if twothe iden identical inputs areXasso associated ciated diﬀerent t outputs) thisan algorithm is able to achieve the minimum p ossible training error (which might on any y regression dataset. be greater than zero, if two identical inputs are associated with diﬀerent outputs) Finally Finally,, we can also create a nonparametric learning algorithm by wrapping a on any regression dataset. parametric learning algorithm inside another algorithm that increases the num numb ber Finally, we can also create a nonparametric learning algorithm by wrapping a 115 parametric learning algorithm inside another algorithm that increases the number
CHAPTER 5. MACHINE LEARNING BASICS
of parameters as needed. For example, we could imagine an outer lo loop op of learning that changes the degree of the polynomial learned by linear regression on top of a ofolynomial parameters as needed. For input. example, we could imagine an outer loop of learning p expansion of the that changes the degree of the polynomial learned by linear regression on top of a The idealexpansion mo model del is an that simply knows the true probability distribution polynomial of oracle the input. that generates the data. Even suc such h a mo model del will still incur some error on man many y The ideal modelthere is an oracle thatbsimply thethe true probability distribution problems, because ma may y still e some knows noise in distribution. In the case that generateslearning, the data.the Even such afrom model willy still some errorsto on many x to of supervised mapping ma may yincur be inherently stoc chastic, problems, there mayfunction still be some noise in other the distribution. In thethose case y ma or may y bbeecause a deterministic that inv involv olv olves es variables besides x y of supervised learning, the mapping from to ma y be inherently sto c hastic, included in x. The error incurred by an oracle making predictions from the true or y may be pa(xdeterministic function inv distribution , y) is called the Bayesthat err error or or. . olves other variables besides those included in x. The error incurred by an oracle making predictions from the true Training and generalization error vary as the size of the training set varies. distribution p(x, y) is called the Bayes error. Exp Expected ected generalization error can nev never er increase as the num numb ber of training examples T raining and generalization error v ary as the size of the generalization training set varies. increases. For nonparametric mo models, dels, more data yields better until Expected caned. never increase the number ofdel training examples the best pgeneralization ossible error error is achiev achieved. Any ﬁxed as parametric mo model with less than increases. For nonparametric morevdata yields betterthe generalization optimal capacit capacity y will asymptotemo todels, an error alue that exceeds Bay Bayes es error.until See the b est p ossible error is achiev ed. Any ﬁxed parametric mo del with less than Fig. 5.4 for an illustration. Note that it is possible for the mo model del to hav havee optimal optimal capacit y will asymptote to an error v alue that exceeds the Bay es error.error. See capacit capacity y and yet still hav havee a large gap b et etw ween training and generalization Fig.this 5.4situation, for an illustration. it is possible forbthe model tomore have training optimal In we may beNote ablethat to reduce this gap y gathering capacity and yet still have a large gap b etween training and generalization error. examples. In this situation, we may be able to reduce this gap by gathering more training examples.
5.2.1
The No Free Lunc Lunch h Theorem
Learning theory claims a machine learning algorithm can generalize well from 5.2.1 The No Freethat Lunc h Theorem a ﬁnite training set of examples. This seems to con contradict tradict some basic principles of Learning theory claims thatorainferring machine general learningrules algorithm generalize well from logic. Inductiv Inductive e reasoning, from acan limited set of examples, a ﬁnite training vset of examples. This seems to describing contradict some principles of is not logically alid. T To o logically infer a rule ev every erybasic member of a set, logic.must Inductiv reasoning, orabout inferring general a limited set of examples, one hav haveeeinformation every mem memb brules er of from that set. is not logically valid. To logically infer a rule describing every member of a set, In part, mac machine hine learning av avoids oids this problem by oﬀering only probabilistic rules, one must have information about every member of that set. rather than the entirely certain rules used in purely logical reasoning. Machine In part, machinetolearning avoids oﬀeringab only rules, learning promises ﬁnd rules thatthis areproblem pr prob ob obably ablybycorrect about out probabilistic most members of rather than the entirely certain rules used in purely logical reasoning. Machine the set they concern. learning promises to ﬁnd rules that are probably correct about most members of Unfortunately Unfortunately,, even this do does es not resolve the en entire tire problem. The no fr freee lunch the set they concern. the theor or orem em for machine learning (Wolp olpert ert, 1996) states that, averaged ov over er all possible Unfortunately , even this do es not resolve the en tire problem. The free lunch data generating distributions, every classiﬁcation algorithm has thenosame error the or em for machine learning ( W olp ert , 1996 ) states that, a v eraged ov er all p ossible rate when classifying previously unobserv unobserved ed points. In other words, in some sense, data generating distributions, every classiﬁcation algorithm hasother. the same no machine learning algorithm is universally an any y better than any The error most rate when classifying previously unobserv points. In other words, in some sense, sophisticated algorithm we can conceive ofedhas the same average performance (o (ov ver no machine learning algorithm is universally an y b etter than any other. The most all possible tasks) as merely predicting that every point belongs to the same class. sophisticated algorithm we can conceive of has the same average performance (over all possible tasks) as merely predicting 116 that every point belongs to the same class.
CHAPTER 5. MACHINE LEARNING BASICS
117
CHAPTER 5. MACHINE LEARNING BASICS
Fortunately ortunately,, these results hold only when we average ov over er al alll possible data generating distributions. If we mak makee assumptions about the kinds of probability Fortunately these results hold orld only applications, when we average over possible data distributions we, encounter in realw realworld then we canal ldesign learning generating that distributions. If won e mak e assumptions algorithms perform well these distributions.about the kinds of probability distributions we encounter in realworld applications, then we can design learning This means that the goal of mac machine hine learning research is not to seek a universal algorithms that perform well on these distributions. learning algorithm or the absolute best learning algorithm. Instead, our goal is to This means that theof goal of machineare learning research not to seek that a universal understand what kinds distributions relev relevant ant to theis“real world” an AI learning algorithm or the absolute b est learning algorithm. Instead, our goal is on to agen agentt exp experiences, eriences, and what kinds of mac machine hine learning algorithms perform well understand what kinds of distributions are relev ant to the “real world” that an AI data dra drawn wn from the kinds of data generating distributions we care ab about. out. agent experiences, and what kinds of machine learning algorithms perform well on data drawn from the kinds of data generating distributions we care about.
5.2.2
Regularization
5.2.2no free Regularization The lunc lunch h theorem implies that we must design our mac machine hine learning algorithms to perform well on a sp speciﬁc eciﬁc task. We do so by building a set of The no free lunc h theorem implies that we mthese ust design our mac learning preferences in into to the learning algorithm. When preferences arehine aligned with algorithms to perform w ell on a sp eciﬁc task. W e do so by building a set of the learning problems we ask the algorithm to solv solve, e, it performs better. preferences into the learning algorithm. When these preferences are aligned with far, theproblems only metho method of the mo modifying difying a learning algorithm we ha have ve discussed is the So learning we dask algorithm to solve, it performs better. to increase or decrease the mo model’s del’s capacit capacity y by adding or remo removing ving functions from So far, the only metho d of mo difying a learning algorithm we ve discussed the hypothesis space of solutions the learning algorithm is able to ha choose. We ga gav vise to increase decreaseofthe model’s or capacit y by adding or remo functions for from the sp speciﬁc eciﬁcor example increasing decreasing the degree of ving a polynomial a the hypothesis space The of solutions algorithm to choose. We gave regression problem. view wethe ha hav vlearning e described so far is is able oversimpliﬁed. the speciﬁc example of increasing or decreasing the degree of a polynomial for a The b eha ehavior vior of The our algorithm aﬀected not by how large we regression problem. view we haisvestrongly described so far is ovjust ersimpliﬁed. mak makee the set of functions allow allowed ed in its hyp ypothesis othesis space, but by the sp speciﬁc eciﬁc iden identit tit tity y The b eha vior of our algorithm is strongly aﬀected not just by how large we of those functions. The learning algorithm we ha hav ve studied so far, linear regression, makae the set of functions allowed inofitsthe hyp othesis space, but by the speciﬁc tity has hypothesis space consisting set of linear functions of its input.iden These of thosefunctions functions. algorithm we havewhere studied far, linear regression, linear canThe be learning very useful for problems thesorelationship betw etween een has a hypothesis space consisting of the set of linear functions of its input. These inputs and outputs truly is close to linear. They are less useful for problems linear be nonlinear very useful for problems where the relationship betw een that bfunctions eha very fashion. For example, linear regression would ehav ve in acan inputs and outputs closetotouse linear. They are useful sin sin((less x ) from x . for not perform very welltruly if weistried it to predict We problems can thus that b eha v e in a very nonlinear fashion. F or example, linear regression would con control trol the performance of our algorithms by cho hoosing osing what kind of functions we sin ( x x not p erform v ery w ell if w e tried to use it to predict ) from . W e can thus allo allow w them to dra draw w solutions from, as well as by con controlling trolling the amoun amountt of these con trol the p erformance of our algorithms b y c ho osing what kind of functions we functions. allow them to draw solutions from, as well as by controlling the amount of these We can also give a learning algorithm a preference for one solution in its functions. hyp ypothesis othesis space to another. This means that both functions are eligible, but one W e can also a learning algorithm a preference its is preferred. Thegive unpreferred solution be chosen only if itfor ﬁtsone thesolution trainingindata hypothesis to than another. This means that both functions are eligible, but one signiﬁcan signiﬁcantly tlyspace better the preferred solution. is preferred. The unpreferred solution be chosen only if it ﬁts the training data For example, w wee can modify the training criterion for linear regression to signiﬁcantly better than the preferred solution. include weight de deccay ay.. To perform linear regression with weigh weightt deca decay y, we minimize For example, we can modify the training criterion for linear regression to include weight decay. To perform linear118 regression with weight decay, we minimize
CHAPTER 5. MACHINE LEARNING BASICS
a sum comprising both the mean squared error on the training and a criterion J (w) that expresses a preference for the weigh weights ts to hav havee smaller squared L2 norm. a sum comprising both the mean squared error on the training and a criterion Sp Speciﬁcally eciﬁcally eciﬁcally, , J (w) that expresses a preference forMSE the weigh ts to>have smaller squared L (5.18) norm. J (w) = train + λw w , Speciﬁcally, where λ is a value chosen ahead that con controls the + λtrols w w J (w)of=time MSE , strength of our preference (5.18) for smaller weigh weights. ts. When λ = 00,, we imp impose ose no preference, and larger λ forces the where is abvecome alue chosen ahead of time that con trols theinstrength preference J (w w eigh eights tsλ to smaller. Minimizing ) results a choiceofofour weigh weights ts that λ λ for smaller weigh ts. When = 0 , we imp ose no preference, and larger forces mak makee a tradeoﬀ bet etw ween ﬁtting the training data and being small. This giv gives esthe us J (w w eights tothat become ) results in a cof hoice weightsAsthat solutions hav havee smaller. a smallerMinimizing slope, or put weigh eight t on fewer the of features. an make a tradeoﬀ betcan ween ﬁttinga the training data and berﬁt eingorsmall. This es ust example of ho how w we control mo model’s del’s tendency to ov overﬁt underﬁt viagiv weigh weight solutions have aa highdegree smaller slope, or put wregression eight on fewer of with the features. an deca decay y, we that can train polynomial mo model del diﬀeren diﬀerentt vAs alues example of ho w we can control a mo del’s tendency to ov erﬁt or underﬁt via weigh t of λ. See Fig. 5.5 for the results. decay, we can train a highdegree polynomial regression model with diﬀerent values of λ. See Fig. 5.5 for the results.
λ λ
λ
λ
More generally generally,, we can regularize a mo model del that learns a function f ( x; θ ) by adding a penalty called a regularizer to the cost function. In the case of weigh eightt f ( x ; θ More generally , we can regularize a mo del that learns a function ) by > w) = w w. In Chapter 7, we will see that man deca decay y, the regularizer is Ω( Ω(w many y other adding a penalty called a regularizer to the cost function. In the case of weight decay, the regularizer is Ω(w) = w w. 119 In Chapter 7, we will see that many other
CHAPTER 5. MACHINE LEARNING BASICS
regularizers are possible. Expressing preferences for one function over another is a more general wa way y regularizers are possible. of con controlling trolling a model’s capacity than including or excluding members from the Expressing preferences for one function aovfunction er another general waasy hyp ypothesis othesis space. We can think of excluding fromisaahmore yp ypothesis othesis space of controlling a model’sstrong capacity than including excluding expressing an inﬁnitely preference against or that function.members from the hypothesis space. We can think of excluding a function from a hypothesis space as In our weigh weightt deca decay y example, we expressed our preference for linear functions expressing an inﬁnitely strong preference against that function. deﬁned with smaller weigh eights ts explicitly explicitly,, v via ia an extra term in the criterion we In our weigh t deca y example, w e expressed our preference for linear minimize. There are man many y other ways of expressing preferences for functions diﬀeren diﬀerentt deﬁned with smaller w eigh ts explicitly , v ia an extra term in the criterion we solutions, both implicitly and explicitly explicitly.. Together, these diﬀeren diﬀerentt approac approaches hes are minimize. are man kno known wn as reThere gularization gularization. . y other ways of expressing preferences for diﬀerent solutions, both implicitly and explicitly. Together, these diﬀerent approaches are known as regularization. Regularization is one of the cen central tral concerns of the ﬁeld of machine learning, riv rivaled aled in its imp importance ortance only by optimization. Regularization is one of the central concerns of the The no free lunc lunch h theorem has made it clear that there is no best machine ﬁeld of machine learning, rivaled in its importance only by optimization. learning algorithm, and, in particular, no best form of regularization. Instead The free alunc h theorem has madethat it clear that there best machine we m ust no choose form of regularization is wellsuited to isthenoparticular task learning algorithm, and, in particular, no b est form of regularization. Instead we wan wantt to solv solve. e. The philosophy of deep learning in general and this book in w e must choose a form regularization that (suc is wellsuited particulartasks task particular is that a veryofwide range of tasks (such h as all of to thethe in intellectual tellectual w e wan t to can solve. of eﬀectiv deep learning general and thisose boforms ok in that people do)The mayphilosophy all be solved eﬀectively ely usinginvery generalpurp generalpurpose particular is that a v ery wide range of tasks (suc h as all of the in tellectual tasks of regularization. that people can do) may all be solved eﬀectively using very generalpurpose forms of regularization.
5.3
Hyp Hyperparameters erparameters and Validation Sets
Most learning algorithms and hav havee sev several settings that we can use to con control trol 5.3 machine Hyperparameters Veral alidation Sets the behavior of the learning algorithm. These settings are called hyp hyperp erp erpar ar arameters ameters ameters.. Mostvalues machine learning algorithms e sev eral settings we can use to con trol The of hyperparameters arehav not adapted by thethat learning algorithm itself the behavior of the learning algorithm. settings are one called hyperpar ameters. (though we can design a nested learningThese pro procedure cedure where learning algorithm The values adapted by the learning algorithm itself learns the bof esthyperparameters hyperparametersare for not another learning algorithm). (though we can design a nested learning procedure where one learning algorithm In the regression example we saw in Fig.algorithm). 5.2, there is a single hyperlearns the pbolynomial est hyperparameters for another learning parameter: the degree of the polynomial, whic which h acts as a cap apacity acity hyp yperparameter. erparameter. In the p olynomial regression example w e saw in Fig. 5.2 , there is a single hyperThe λ value used to control the strength of weigh weightt decay is another example of a parameter: the degree of the p olynomial, whic h acts as a c ap acity h yp erparameter. hyp yperparameter. erparameter. The λ value used to control the strength of weight decay is another example of a a setting is chosen to b e a hyp yperparameter erparameter that the learning algohypSometimes erparameter. rithm does not learn because it is diﬃcult to optimize. More frequently frequently,, we do Sometimes a setting is chosen to b e a h yp erparameter that learning algonot learn the hyp yperparameter erparameter because it is not appropriate to the learn that hyp ypererrithm does not learn b ecause it is diﬃcult to optimize. More frequently , we do parameter on the training set. This applies to all hyperparameters that control not learn the h. yp because it set, is not appropriate to learnwould that alw hypaermo model del capacity capacity. If erparameter learned on the training such hyperparameters alwa ys parameter on the training set. This applies to all hyperparameters that control model capacity. If learned on the training 120set, such hyperparameters would always
CHAPTER 5. MACHINE LEARNING BASICS
cho hoose ose the maxim maximum um possible mo model del capacity capacity,, resulting in ov overﬁtting erﬁtting (refer to Fig. 5.3). For example, we can alw alwa ays ﬁt the training set better with a higher choose pthe maximand um possible del setting capacity erﬁtting degree olynomial a weigh eightt mo decay of, λresulting = 0 thaninweovcould with(refer a lo low wto er Fig. 5.3 ). F or example, we can alw a ys ﬁt the training set better with a higher degree polynomial and a positive weigh eightt deca decay y setting. degree polynomial and a weight decay setting of λ = 0 than we could with a lower To solve this problem, we need a validation set of examples that the training degree polynomial and a positive weight decay setting. algorithm do does es not observe. To solve this problem, we need a validation set of examples that the training Earlier we discussed ho how w a heldout test set, comp composed osed of examples coming from algorithm does not observe. the same distribution as the training set, can be used to estimate the generalization Earlier we discussed howlearning a heldout test has set, completed. composed ofItexamples coming error of a learner, after the pro process cess is imp important ortant thatfrom the the same distribution theintraining bee choices used to about estimate generalization test examples are not as used any waset, y tocan mak make thethe mo model, del, including error of a learner, afterF the learning process has completed. It is impset ortant the its hyperparameters. For or this reason, no example from the test can that be used testthe examples are set. not used in anywe waalwa y to ys mak e choices the about the model, in validation Therefore, always construct validation set including from the its hyperparameters. F or this reason, no example from the test set can be used tr training aining data. Speciﬁcally Speciﬁcally,, we split the training data in into to tw twoo disjoin disjointt subsets. One in these the validation weparameters. always construct the vsubset alidation setvfrom the of subsets isset. usedTherefore, to learn the The other is our alidation training Speciﬁcally we split the training intoortw o disjoin t subsets. One set, useddata. to estimate the, generalization error data during after training, allowing of these subsets is used to learn the parameters. The other subset is our v alidation for the hyperparameters to be updated accordingly accordingly.. The subset of data used to set, used to estimate the generalization errorthe during or after allowing learn the parameters is still typically called training set, training, ev even en though this for be updated . The of data pro used to ma may ythe be hyperparameters confused with thetolarger po pool ol of accordingly data used for the subset entire training process. cess. learnsubset the parameters is still typically called the set, even isthough The of data used to guide the selection of training hyperparameters called this the ma y b e confused with the larger po ol of data used for the entire training pro cess. validation set. Typically Typically,, one uses ab about out 80% of the training data for training and The subset of data used to guide the of to hyperparameters is called the 20% for validation. Since the validationselection set is used “train” the hyperparameters, vthe alidation set. set Typically , one uses about 80% of the trainingerror, data though for training and validation error will underestimate the generalization typically 20% validation. alidationerror. set isAfter usedall to h“train” the hyperparameters, by a for smaller amountSince than the the vtraining yp yperparameter erparameter optimization the v alidation set error will underestimate the generalization error, though is complete, the generalization error may be estimated using the test set.typically by a smaller amount than the training error. After all hyperparameter optimization In practice, when the same test set has been used rep repeatedly eatedly to ev evaluate aluate is complete, the generalization error may be estimated using the test set. performance of diﬀerent algorithms over many years, and esp especially ecially if we consider In practice, the scientiﬁc same testcomm set has eatedly to ev aluate all the attempts when from the communit unit unity ybeen at bused eatingrep the reported stateofperformance of diﬀerent algorithms ver end many ears, and especially if we consider theart performance on that test set,owe upyhaving optimistic ev evaluations aluations with all the attempts from the scientiﬁc comm unit y at b eating the reported stateofthe test set as well. Benc Benchmarks hmarks can th thus us b ecome stale and then do not reﬂect the theart performance on of that test set,system. we end up having optimistic evaluations with true ﬁeld performance a trained Thankfully Thankfully, , the communit community y tends to the test set as well. Benc hmarks can th us b ecome stale and then do not reﬂect the mo mov ve on to new (and usually more am ambitious bitious and larger) benc enchmark hmark datasets. true ﬁeld performance of a trained system. Thankfully, the community tends to move on to new (and usually more ambitious and larger) benchmark datasets.
5.3.1
CrossV CrossValidation alidation
5.3.1 CrossV alidation Dividing the dataset into a ﬁxed training set and a ﬁxed test set can be problematic if it results in the test set being small. A small test set implies statistical uncertaint uncertainty y Dividingthe theestimated dataset into a ﬁxed training set and it a ﬁxed testtosetclaim can bthat e problematic around average test error, making diﬃcult algorithm if itwresults in thethan test algorithm set being small. small test set implies statistical uncertainty A orks better B on A the given task. around the estimated average test error, making it diﬃcult to claim that algorithm A works better than algorithm B on the given task. 121
CHAPTER 5. MACHINE LEARNING BASICS
When the dataset has hundreds of thousands of examples or more, this is not a serious issue. When the dataset is to too o small, there are alternative procedures, When the dataset has hundreds of thousands of examples this istest not whic which h allow one to use all of the examples in the estimationorofmore, the mean a serious issue. When the dataset is too small, there arepro alternative procedures, error, at the price of increased computational cost. These procedures cedures are based on whic h allow one to use all of the examples in the estimation of the mean test the idea of rep repeating eating the training and testing computation on diﬀerent randomly the price of increased computational cost. ceduresofare based on cerror, hosenatsubsets or splits of the original dataset. TheThese most pro common these is the the idea of repalidation eating the training testing computation k fold crossv crossvalidation pro procedure, cedure, and shown in Algorithm 5.1,on in diﬀerent which a randomly partition cofhosen subsets or splits of the original dataset. The most common of these the the dataset is formed by splitting it in into to k nono nonov verlapping subsets. Theistest k fold crossv alidation pro cedure, shown in Algorithm 5.1 , in which a partition error ma may y then be estimated by taking the average test error across k trials. On k nono of thei, dataset formed by splitting it inas tothe erlapping The test trial the ith is subset of the data is used test v set and the subsets. rest of the data is error ma y then b e estimated b y taking the a v erage test error across trials. On k used as the training set. One problem is that there exist no unbiased estimators of trialviariance , the ith the data used as the test setand andGrandv the rest of, the data is the of subset such avoferage errorisestimators (Bengio Grandvalet alet 2004 ), but used as the training set. One problem is that there exist no unbiased estimators of appro approximations ximations are typically used. the variance of such average error estimators (Bengio and Grandvalet, 2004), but approximations are typically used.
5.4
Estimators, Bias and Variance
The of statistics giv gives esBias us man many y to tools ols can be used to ac achiev hiev hievee the machine 5.4ﬁeldEstimators, and Vthat ariance learning goal of solving a task not only on the training set but also to generalize. The ﬁeld of statistics es husasman y tools that can be used to acvhiev e theare machine Foundational conceptsgivsuc such parameter estimation, bias and ariance useful learning goal of solving a task not only on the training set but also to generalize. to formally characterize notions of generalization, underﬁtting and overﬁtting. Foundational concepts such as parameter estimation, bias and variance are useful to formally characterize notions of generalization, underﬁtting and overﬁtting.
5.4.1
Poin ointt Estimation
5.4.1 Point Estimation P oin ointt estimation is the attempt to provide the single “best” prediction of some quan quantit tit tity y of interest. In general the quan quantit tit tity y of interest can be a single parameter Poin estimation is the attempt provide themo single “best” or a tvector of parameters in sometoparametric model, del, such as prediction the weigh eights tsofinsome our quan tit y of interest. In general the quan tit y of interest can be a single parameter linear regression example in Sec. 5.1.4, but it can also be a whole function. or a vector of parameters in some parametric mo del, such as the weights in our In order to distinguish estimates of parameters from their true value, our linear regression example in Sec. 5.1.4, but it can also be a whole function. con conv ven ention tion will be to denote a point estimate of a parameter θ by θˆ. In order to distinguish estimates of parameters from their true value, our Let {x(1), . . . , x (m)} be a set of m indep independen enden endentt and identically distributed convention will be to denote a point estimate of a parameter θ by θˆ. (i.i.d.) data points. A point estimator or statistic is an any y function of the data: Let x , . . . , x be a set of m independent and identically distributed (1) statistic (i.i.d.) data θˆm = g(xor , . . . , x(m)is).any function of the data: { points. A p}oint estimator (5.19) θˆ = g(xg return , . . . , x a v)alue . The deﬁnition do does es not require that that is close to the(5.19) true even en that the range of g is the same as the set of allo allow wable values of θ. θ or ev The deﬁnition that return a value thatwsisthe close to theoftrue This deﬁnition do of es a pnot oin ointtrequire estimator is gvery general and allo allows designer an or ev en that the range of is the same as the set of allo w able v alues of θ g θ. estimator great ﬂexibility ﬂexibility.. While almost an any y function thus qualiﬁes as an estimator, This deﬁnition of a point estimator is very general and allows the designer of an estimator great ﬂexibility. While almost122 any function thus qualiﬁes as an estimator,
CHAPTER 5. MACHINE LEARNING BASICS
The k fold crossv crossvalidation alidation algorithm. It can be used to estimate generalization error of a learning algorithm A when the giv given en dataset D is to too o k The fold crossv alidation algorithm. It can b e used to estimate small for a simple train/test or train/v train/valid alid split to yield accurate estimation of D A generalization error of a learning algorithm when the giv en dataset is to oo generalization error, because the mean of a loss L on a small test set ma may y ha hav ve to too smallvariance. for a simple or train/v split to accurate estimation of z(i) (for high The train/test dataset D con contains tains asalid elements theyield abstract examples generalization error,which because the mean of a loss L on a small test set(i)may xha (iv ) ,eyto (i)o) ith example), the could D stand for an (input,target) pair z = ((x z case high datasetlearning, contains elements abstract (for z(i) =examples x(i) in the in thevariance. case of The sup supervised ervised orasfor just anthe input i z = ( x , y the th example), which could stand for an (input,target) pair of unsupervised learning. The algorithm returns the vector of errors e for eac each h) = x The in the case of, whose supervised or for just an input z error. in errors the case example in D meanlearning, is the estimated generalization on of unsupervised learning. The algorithm returns the vector of errors for eac h e individual examples can be used to compute a conﬁdence interv interval al around the mean D example in While , whose mean is the estimated generalization error. after The the errors (Eq. 5.47). these conﬁdence interv intervals als are not welljustiﬁed useon of individual examples can b e used to compute a conﬁdence interv al around the mean crossv crossvalidation, alidation, it is still common practice to use them to declare that algorithm A (Eq. 5.47than ). While these B conﬁdence interv als are not welljustiﬁed after use of is better algorithm only if the conﬁdence in of the algorithm interv terv terval al of the error A crossv is still practice to use them to al declare that algorithm A lies balidation, elow anditdoes notcommon in intersect tersect the conﬁdence in interv terv terval of algorithm B. is better than algorithm B only if the conﬁdence interval of the error of algorithm D, A,not L, kin ):tersect the conﬁdence interval of algorithm B. A lies below and (does D, the given dataset, with elemen elements ts z (i) D ( , A, L, k ): takes es a dataset as A D , the learning algorithm, seen as a function that tak , the given dataset, with elemen ts z input and outputs a learned function the loss learning algorithm, as a function takes a dataset as A ,, the f and function, seen asseen a function from a that learned function L input and outputs a learned function an example z(i) ∈ D to a scalar ∈ R thenum lossber function, seen as a function from a learned function f and L, ,the k number D of folds R an example to aexclusiv scalar e subsets D , whose union is D. Split D into zk mutually exclusive i k, the iD from 1 tonum k∈ ber of folds ∈ D D Split k mutually exclusive subsets , whose union is . f i = Ainto (D\D i) i from 1 to k z (j)Din DD i f e= A ) (j ) ) ( j = L(fD i, z z in \ e = L(f , z ) e e
123
CHAPTER 5. MACHINE LEARNING BASICS
a go goo od estimator is a function whose output is close to the true underlying θ that generated the training data. a good estimator is a function whose output is close to the true underlying θ that For no now, w, we tak takee the frequentist persp perspective ective on statistics. That is, we assume generated the training data. that the true parameter value θ is ﬁxed but unknown, while the point estimate now, weoftak e the frequentist on statistics. That is,process, we assume θˆ isFaorfunction the data. Since thepersp dataective is drawn from a random any that the of true value θTherefore is ﬁxed but the point estimate function theparameter data is random. θˆ isunknown, a randomwhile variable. θˆ is a function of the data. Since the data is drawn from a random process, any Poin ointt estimation can also refer to the estimation of the relationship b et etw ween function of the data is random. Therefore θˆ is a random variable. input and target variables. We refer to these types of point estimates as function Point estimation can also refer to the estimation of the relationship b etween estimators. input and target variables. We refer to these types of point estimates as function estimators. As we mentioned ab abo ove, sometimes we are in interested terested in performing function estimation (or function appro approximation). ximation). Here we are trying to As we mentioned ab o v e, sometimes are is interested in predict a variable y giv given en an input vector x . We assume thatwe there a function p erforming function estimation (or function appro ximation). Here w e are trying to f (x ) that describ describes es the approximate relationship b et etw ween y and x. For example, y giv predict variable We assume that there a function y en = fan ( xinput ) + ,vector w e may aassume that where x. stands for the part of yisthat is not f ( x y x ) that describ es the approximate relationship b et w een and . F or example, predictable from x. In function estimation, we are interested in appro approximating ximating y = f ( x ) + y w e may assume that , where stands for the part of that is not ˆ f with a model or estimate f . Function estimation is really just the same as x predictable from . In function estimation, w e are interested in appro ximating ˆ estimating a parameter θ; the function estimator f is simply a p oin ointt estimator in f with a space. fˆ. Function modelThe or estimate estimation is really just the5.1.4 same as function linear regression example (discussed abov abovee in Sec. ) and ˆ θ; the estimating a parameter function estimator is simply a pboth oint estimator the polynomial regression example (discussed in fSec. 5.2) are examples in of function space. The linear regression example (discussed abov e in Sec. 5.1.4 ) and w or estimating scenarios that ma may y be interpreted either as estimating a parameter parameterw the polynomial regression example (discussed in Sec. 5.2 ) are both examples of ˆ a function f mapping from x to y. scenarios that may be interpreted either as estimating a parameter w or estimating We no now w review the most commonly studied prop properties erties of poin ointt estimators and a function fˆ mapping from x to y. discuss what they tell us ab about out these estimators. We now review the most commonly studied properties of point estimators and discuss what they tell us about these estimators.
5.4.2
Bias
The bias Bias of an estimator is deﬁned as: 5.4.2 (5.20) bias(θˆm The bias of an estimator is deﬁned as:) = E(θˆm ) − θ E ˆ samples from a random variable) and where the exp expectation ectation is over the data (5.20) bias( θˆ (seen ) = as (θ ) θ θ is the true underlying value of θ used to deﬁne−the data generating distribution. where the expθectation ovbere the datad (seen as(θˆsamples from a random variable) ˆm is saidisto ˆ and bias( An estimator unbiase unbiased if bias m) = 0, which implies that E( θm ) = θ. isAn theestimator true underlying value of θasymptotic used to deﬁne the data distribution. ˆ to be asymptotical al ally ly unbiase unbiased d if generating θ θˆm is said limm→∞ bias bias( (θ Emˆ) = 0, ˆ ˆ θ bias ( θ ) = 0 (θ ) = An estimator is said to b e unbiase d if , which implies that ˆ whic which h implies that limm→∞ E(θm) = θ. ˆ ˆ θ. An estimator θ is said to be asymptotically unbiased if lim bias(θ ) = 0, E ˆ (θ ) = θ. which implies that lim Consider a set of samples {x (1), . . . , x(m) } that are indep independently endently and iden identically tically distributed according to a Bernoulli distriConsider a set of samples x , . . . , x 124 that are independently and identically distributed according to a {Bernoulli distri}
CHAPTER 5. MACHINE LEARNING BASICS
bution with mean θ:
P (x (i); θ) = θ x (1 − θ)(1−x
)
. (5.21) bution with mean θ: A common estimator for Pthe is the mean (5.21) of the (x θ ;parameter θ) = θ (1of this θ) distribution . training samples: mof−this distribution is the mean of the A common estimator for the θ parameter X 1 θˆm = x(i) . (5.22) training samples: m i=1 1 θˆ = x . (5.22) To determine whether this estimator ism biased, we can substitute Eq. 5.22 in into to Eq. 5.20 5.20:: To determine whether this estimator is biased, we can substitute Eq. 5.22 into Eq. 5.20: bias(ˆθm) = E[θˆm] − θ (5.23) X " # m E ˆ1 X bias(θˆ ) = [ θ ] θ x (i) − θ (5.23) =E (5.24) m i=1 E 1− = (5.24) m hx i θ 1 X m ( i) = − (5.25) E x −θ m i=1 1 E = "X (5.25) m X 1x # θ m 1 (1−x ( i ) x ) X −θ (5.26) = x− θ (1 − θ) m 1 i=1 x =0 θ (5.26) = x θ (1 θ) m m 1 X h i X (θ) − θ = (5.27) − − m 1 i=1 = θ − θ =(θ0) θ (5.27) = (5.28) mX X − =θ θ=0 (5.28) Since bias(ˆθ) = 00,, we say that our estimator θˆ is un unbiased. biased. − X ˆ Since bias(θ) = 0, we say that our estimator θˆ is unbiased. No Now, w, consider a set of samples {x(1) , . . . , x(m) } that are indep and iden independen enden endently tly identically tically distributed ( i ) ( i ) 2 ∈w, {1,consider . . . , m}. according to a Gaussian distribution p(x ) = N (x ; µ, σ ), where iNo , . . . , x x aRecall set ofthat samples that are indep enden tly and iden tically distributed the Gaussian probability densit density y function is giv given en by p(x ) = (x ; µ, σ ), where i 1, . . . , m . according to a Gaussian distribution { } ! Recall that the Gaussian probability1 density function N 1 (x (i) is } − giv µ)2en by ∈ { p(x(i); µ, σ2 ) = √ exp − . (5.29) 2 2 σ 2π σ 2 1 1 (x µ) p(x ; µ, σ ) = exp . (5.29) 2 σ− 2 π σ A common estimator of the Gaussian mean−parameter is kno known wn as the sample √ me mean an an:: m A common estimator of the Gaussian mean parameter is ! known as the sample 1X µ ˆ = x(i) (5.30) m mean : m 1 i=1 µ ˆ = x (5.30) m 125 X
CHAPTER 5. MACHINE LEARNING BASICS
To determine the bias of the sample mean, we are again interested in calculating its exp expectation: ectation: To determine the bias of the sample mean, we are again interested in calculating its expectation: bias( bias(ˆ µ ˆ m ) = E[µ ˆm] − µ (5.31) " # m E 1 X bias(µ ˆ )= = E [µ ˆ ] µx(i) − µ (5.31) (5.32) m− i=1 E 1 !µ = x (5.32) m 1mX h (i) i = (5.33) E x − −µ m 1 i=1 E! = " X x# µ (5.33) m m 1 X (5.34) = µ −µ − m 1 i=1 (5.34) = =µ− =µ 0 h µi! (5.35) m µX − = µ µ = 0 Th Thus us we ﬁnd that the sample mean is an unbiased estimator of Gaussian (5.35) mean ! − parameter. Xunbiased estimator of Gaussian mean Thus we ﬁnd that the sample mean is an parameter. As an 2 example, we compare two diﬀeren diﬀerentt estimators of the variance parameter σ of a As an Gaussian distribution. We are in interested terested in kno knowing wing if either estimator is biased. σ example, we compare two diﬀeren t estimators of the v ariance parameter of a 2 we consider is known as the sample varianc The ﬁrst estimator of σ variance e : Gaussian distribution. We are interested in knowing if either estimator is biased. m as2 the sample variance : The ﬁrst estimator of σ we consider 1 X is(iknown 2 ) σ ˆm = x −µ ˆm , (5.36) m i=1 1 σ ˆ = x µ ˆ , (5.36) m abov ˆ m is the sample mean, deﬁned where µ above. e.−More formally formally,, we are in interested terested in computing ˆ is the sample mean,bias( where µ deﬁned abov 2 More2 formally, we are interested in bias(ˆ σ ˆ 2m) = E[σ ˆe.m (5.37) ]−σ computing X 2 ]: E We begin by ev evaluating aluating the term ˆm bias(Eσ ˆ[σ )= [σ ˆ ] σ (5.37) # E" − 2 We begin by evaluating the term [σ ˆ1 X ]:m 2 ( i) E[σ ˆm ] = =E E x −µ ˆm (5.38) m E E 1 i=1 [σ ˆ ] =m − 1 x µ ˆ (5.38) m σ2 (5.39) = − m m 1 (5.39) = σ ˆ2m #is −σ2 /m. Therefore, the Returning to Eq. 5.37, we conclude"m −that the bias of σ sample variance is a biased estimator. X ˆ is σ /m. Therefore, the Returning to Eq. 5.37, we conclude that the bias of σ sample variance is a biased estimator. − 126
CHAPTER 5. MACHINE LEARNING BASICS
The unbiase unbiased d sample varianc variancee estimator m 2 X The unbiased sample varianc e 1estimator 2 (5.40) σ ˜m = x(i) − µ ˆm m−1 1 i=1 (5.40) σ ˜ = x µ ˆ pro provides vides an alternative approac approach. h.mAs 1the name suggests this estimator is un unbiased. biased. − That is, we ﬁnd that E[σ ˜ 2m ] = σ 2: − provides an alternative approach. As the name suggests this estimator is unbiased. " # E m That is, we ﬁnd that [σ ˜ ]=σ : 2 1 XX 2 (i) ˆm E[σ ˜m] = E x −µ (5.41) m−1 i=1 1 E E µ ˆ [σ ˜ ]= m x (5.41) 2 m E[1σ ] (5.42) = ˆm m−1 − m − E ] 1 2 (5.42) = m [σ ˆm − = m" 1 σ (5.43) # m m−1 mX 1 m − 2 =σ . σ (5.43) (5.44) m m 1 − σ .− (5.44) We ha hav ve two estimators:=one is biased and the other is not. While unbiased estimators are clearly desirable, they are not alw alwa ays the “b “best” est” estimators. As we W e ha v e t w o estimators: one is biased and the other is not. While unbiased will see we often use biased estimators that possess other imp important ortant properties. estimators are clearly desirable, they are not always the “best” estimators. As we will see we often use biased estimators that possess other important properties.
5.4.3
Variance and Standard Error
Another prop propert ert erty y ofand the estimator that we migh mightt wan antt to consider is ho how w muc much h 5.4.3 V ariance Standard Error we exp expect ect it to vary as a function of the data sample. Just as we computed the Another propofert y ofestimator the estimator that we its migh t ww anetcan to consider w muche. exp expectation ectation the to determine bias, computeisitshovarianc variance w e exp ect it to v ary as a function of the data sample. Just as w e computed the The variance of an estimator is simply the variance expectation of the estimator to determine its bias, we can compute its variance. The variance of an estimator is simply theθˆ)variance Var( (5.45) Var(θˆset. ) Alternately where the random variable is the training Alternately,, the square ro root ot (5.45) of the ˆ variance is called the standar standard d err error or or,, denoted SE(θ). where the random variable is the training set. Alternately, the square root of the The variance or the standard error of an estimator pro provides vides a measure of ho how w variance is called the standard error, denoted SE(θˆ). we would exp expect ect the estimate we compute from data to vary as we indep independently endently The variance or thefrom standard error of andata estimator provides acess. measure resample the dataset the underlying generating pro process. Justofasho wwe w e would exp ect the estimate we compute from data to v ary as w e indep endently migh mightt like an estimator to exhibit low bias we would also like it to ha hav ve relativ relatively ely resample the dataset from the underlying data generating pro cess. Just as we lo low w variance. might like an estimator to exhibit low bias we would also like it to have relatively When we compute an any y statistic using a ﬁnite num numb ber of samples, our estimate low variance. of the true underlying parameter is uncertain, in the sense that we could ha hav ve When we compute an y statistic using a ﬁnite num b er of samples, our estimate obtained other samples from the same distribution and their statistics would hav havee of the true underlying parameter is uncertain, in the sense that we could have 127 obtained other samples from the same distribution and their statistics would have
CHAPTER 5. MACHINE LEARNING BASICS
been diﬀerent. The exp expected ected degree of variation in any estimator is a source of error that we wan wantt to quan quantify tify tify.. been diﬀerent. The exp ected degree of variation in any estimator is a source of The standard error of the mean is giv given en by error that we want to quantify. v u is givenmby The standard error of the mean u 1 X (i) σ SE( SE(ˆ µ ˆ m) = tVar[ x ]= √ , (5.46) m m 1 i=1 σ SE(µ ˆ ) = Var[ x ]= , (5.46) m m 2 i √ standard error is often where σ is the true variance of the samples x . The estimated by using an estimate of σ. Unfortunately Unfortunately,, neither the square ro root ot of σ isvariance x biased where the truenor variance ofvthe samples . The estimator standard of error isariance often the sample the square ro root ot of the un unbiased the v u X u estimated y using an estimate of Unfortunately , neither theapproaches square root of pro provide vide anbunbiased estimate of t theσ. standard deviation. Both tend theunderestimate sample variance thestandard square ro ot of the un estimator the variance to thenor true deviation, butbiased are still used inofpractice. The pro vide an unbiased estimate of the standard deviation. Both approaches tend square ro root ot of the unbiased estimator of the variance is less of an underestimate. toorunderestimate the true standard deviation, but are still used in practice. The F large m, the appro approximation ximation is quite reasonable. square root of the unbiased estimator of the variance is less of an underestimate. The standard error of the mean is very useful in machine learning exp experimen erimen eriments. ts. For large m, the approximation is quite reasonable. We often estimate the generalization error by computing the sample mean of the The thenum mean very useful ininmachine experimen ts. error onstandard the test error set. of The numb beris of examples the testlearning set determines the W e often estimate the generalization error by of computing thelimit sample mean of theh accuracy of this estimate. Taking adv advan an antage tage the central theorem, whic which errorusonthat thethe test set. will The ber of examples in the testa normal set determines the tells mean be num appro approximately ximately distributed with distribution, accuracy this estimate. Taking advantage of the central theorem, which w e can useofthe standard error to compute the probability thatlimit the true exp expectation ectation tells us that the mean will b e appro ximately distributed with a normal distribution, falls in any chosen in interv terv terval. al. For example, the 95% conﬁdence in interv terv terval al centered on w e can useisthe error to compute the probability that the true expectation the mean µ ˆmstandard is falls in any chosen interval. For example, the 95% conﬁdence interval centered on the mean is µ ˆ is (µ ˆ m − 1.96SE( ˆm + 1.96SE( 96SE(ˆ µ ˆm)) 96SE(ˆ µ ˆ m), µ )),, (5.47) (µ ˆ ), µ ˆµ 1.96SE( µ ˆ )),SE 1.96SE( ˆmean (5.47) ˆ m+and SE(( µ ˆm )2 . In mac under the normal distribution with µ variance machine hine A is better than algorithm learning exp experiments, eriments, it is − common to sa say y that algorithm algorithmA µ ˆ SE( µ ˆof )algorithm under normal distribution with mean and v ariance . In machine B A is if thethe upp b ound of the 95% conﬁdence in for the error upper er interv terv terval al A learning exp eriments, it is common to sa y that algorithm is b etter than algorithm less than the low lower er bound of the 95% conﬁdence in interv terv terval al for the error of algorithm B if the upp er b ound of the 95% conﬁdence in terv al for the error of algorithm A is B. less than the lower bound of the 95% conﬁdence interval for the error of algorithm B. We once again consider a set of samples (1) ( m ) {x , . . . , x } dra drawn wn indep independently endently and iden identically tically from a Bernoulli distribution We once again consider a set of samples (1−x ( i ) x ) (recall P (x ; θ) = θ (1 − θ) ).PThis time we are in interested terested in computing x ,...,x drawn independently from a Bernoulli distribution 1 andmiden(itically ) ˆ the variance of the estimator θm = m i=1 x . (recall ). This time we are interested in computing P (x ; θ}) = θ (1 θ) { ! ˆ x1 X the variance of the estimator .m − θ = Var θˆm = V Var ar x ( i) (5.48) m 1 i=1 Var θˆ = Var x (5.48) 128 m P
X
!
CHAPTER 5. MACHINE LEARNING BASICS
m 1 X = 2 Var x (i) (5.49) m i=1 1 = (5.49) m Var x 1 X m = 2 θ(1 − θ) (5.50) m i=1 1 = 1 θ(1 θ) (5.50) (1 − θ) (5.51) = m2 mθ X − m 1 (1 θ) (5.51) = 1 θmθ = (5.52) m (1 − θ) m 1 X − = asθ(1 θ) (5.52) The variance of the estimator decreases a function of m, the num numb b er of examples m in the dataset. This is a common prop property erty−of popular estimators that we will The v ariance of the estimator decreases as a function return to when we discuss consistency (see Sec. 5.4.5of).m, the numb er of examples in the dataset. This is a common property of popular estimators that we will return to when we discuss consistency (see Sec. 5.4.5).
5.4.4
Trading oﬀ Bias and Variance to Minimize Mean Squared Error 5.4.4 Trading oﬀ Bias and Variance to Minimize Mean Squared Bias andError variance measure two diﬀerent sources of error in an estimator. Bias
measures the exp expected ected deviation from the true value of the function or parameter. Bias and v ariance measure twovides diﬀerent sources of error in anfrom estimator. Bias Variance on the other hand, pro provides a measure of the deviation the expected measures the expthat ected deviation from the trueofvalue of the parameter. estimator value any particular sampling the data is function likely to or cause. Variance on the other hand, provides a measure of the deviation from the expected What happ happens when e are given a choiceofbetw between estimators, one with estimator valueens that any w particular sampling the een datatw isolikely to cause. more bias and one with more variance? How do we choose betw etween een them? For What happ ens when w e are given a choice betw een t w o estimators, one with example, imagine that we are interes interested ted in approximating the function shown in more bias and one with more v ariance? How do w e choose b etw een them? F or Fig. 5.2 and we are only oﬀered the choice betw between een a mo model del with large bias and example, imagine thatlarge we are interesHo tedwindoapproximating the one that suﬀers from variance. How we cho hoose ose betw etween eenfunction them? shown in Fig. 5.2 and we are only oﬀered the choice between a model with large bias and The most common way to negotiate this tradeoﬀ is to use crossv crossvalidation. alidation. one that suﬀers from large variance. How do we choose between them? Empirically Empirically,, crossv crossvalidation alidation is highly successful on many realw realworld orld tasks. AlterThe most common w a y to negotiate this tradeoﬀ is to use crossv alidation. nativ w e can also compare the me squar d err (MSE) of the estimates: natively ely ely,, mean an squaree error or Empirically, crossvalidation is highly successful on many realworld tasks. Alterθ) 2 ]ed error (MSE) of the estimates: MSEthe = Eme [(θˆan (5.53) natively, we can also compare squar m− E ˆ θˆm) 2 + Var(θˆm ) (5.54) θ) ] MSE = = Bias( [(θ (5.53) ˆ The MSE measures the overall=exp expected ected Bias( θ−)deviation—in + Var(θˆ ) a squared error sense— (5.54) bet etw ween the estimator and the true value of the parameter θ. As is clear from The MSE measures theMSE overall exp ectedbdeviation—in squared errorDesirable sense— Eq. 5.54 , ev evaluating aluating the incorp incorporates orates oth the bias anda the variance. θ. Asmanage between the theMSE trueand value of the parameter that is cleartofrom estimators areestimator those withand small these are estimators keep Eq. , evbias aluating MSE incorp oratesinbcoth the b oth5.54 their and the variance somewhat hec heck. k. bias and the variance. Desirable estimators are those with small MSE and these are estimators that manage to keep The relationship betw etween een bias and variance is tightly linked to the machine both their bias and variance somewhat in check. learning concepts of capacity capacity,, underﬁtting and ov overﬁtting. erﬁtting. In the case where genThe relationship between bias and variance is tightly linked to the machine 129 and overﬁtting. In the case where genlearning concepts of capacity, underﬁtting
CHAPTER 5. MACHINE LEARNING BASICS
x x
eralization error is measured by the MSE (where bias and variance are meaningful comp componen onen onents ts of generalization error), increasing capacity tends to increase variance eralization error is This measured by the MSE (where bias and ariance and decrease bias. is illustrated in Fig. 5.6, where we vsee againare themeaningful Ushap Ushaped ed comp onen ts of generalization error), increasing capacity tends to increase v ariance curv curvee of generalization error as a function of capacit capacity y. and decrease bias. This is illustrated in Fig. 5.6, where we see again the Ushaped curve of generalization error as a function of capacity.
5.4.5
Consistency
5.4.5 So far weConsistency hav havee discussed the prop properties erties of various estimators for a training set of ﬁxed size. Usually Usually,, we are also concerned with the behavior of an estimator as the So far twe e discussed the prop of various estimators for a training setb er of amoun amount of hav training data grows. In erties particular, we usually wish that, as the num numb ﬁxed size. Usually also concerned ehavior ofconv an estimator the of data points ourare dataset increases, with our pthe oin ointtbestimates converge erge to theastrue m in, we amoun t of training data grows. In particular, we usually wish that, as the num b er value of the corresp corresponding onding parameters. More formally formally,, we would lik likee that of data points m in our dataset increases, our p oint estimates converge to the true p value of the corresponding parameters. formally , we would like that (5.55) lim θˆMore . m→θ m→∞
lim θˆ θ. (5.55) p The sym symb bol → means that the conv convergence ergence is in probability probability,, i.e. for any > 0, → P (θˆm − θ > ) → 0 as m → ∞ . The condition describ described ed b by y Eq. 5.55 is > 0, The sym b ol means that the conv ergence is in probability , i.e. for any , with kno known wn as consistency onsistency.. It is sometimes referred to as weak consistency consistency, ˆ θ > P ( θ consistency 0 as m . Thesur condition describ 5.55 is ˆ tobθy. Eq. → ) referring strong to the almost sure e con convergence vergence of θed Almost sur sure e kno wn −as consistency sometimes referred to as weak consistency, with → . It is → ∞ strong consistency referring to the almost 130 sure convergence of θˆ to θ . Almost sure
CHAPTER 5. MACHINE LEARNING BASICS
conver onvergenc genc gencee of a sequence of random variables x (1), x (2), . . . to a value x occurs when p(lim m→∞ x(m) = x) = 1. convergence of a sequence of random variables x , x , . . . to a value x occurs Consistency x ensures that the bias induced by the estimator is assured to when p(lim = x) = 1. diminish as the num umb ber of data examples gro grows. ws. How Howev ev ever, er, the rev reverse erse is not Consistency ensures that the bias induced b y the estimator is assured to true—asymptotic unbiasedness does not imply consistency consistency.. For example, consider diminish asthe themean numbparameter er of dataµexamples grows. However,Nthe rev is not 2 (x ; µ, σerse estimating of a normal distribution ), with a true—asymptotic unbiasedness does not imply consistency . F or example, consider (1) ( m ) dataset consisting of m samples: {x , . . . , x } . We could use the ﬁrst sample ; µ, σE(),ˆ estimating the mean of a normalˆ distribution (1) of the dataset x θ= x(1). In that(xcase, θmwith ) = θa as parameter an unbiase unbiased dµ estimator: m , . . . , x x dataset consisting of samples: . W e could use the ﬁrst sample N are seen. so the estimator is un unbiased biased no matter how man many y data p oin oints ts E ˆThis, of ˆ x θ = x ( θ )is=not θ of the dataset as an unbiase d estimator: . In that case, } un course, implies that the estimate is{ asymptotically unbiased. biased. Ho How w ev ever, er, this the estimator is unbiased howthat man p oin ˆ θmy data → θ as mts→are ∞.seen. This, of asoconsisten consistent t estimator as it isno notmatter the case course, implies that the estimate is asymptotically unbiased. However, this is not θ θ as m . a consistent estimator as it is not the case that ˆ 5.5 Maxim Maximum um Lik Likeliho eliho elihoo od Estimation → →∞ Previously Previously, , we hav have e seen some deﬁnitions of common estimators and analyzed 5.5 Maxim um Lik eliho od Estimation their properties. But where did these estimators come from? Rather than guessing Previously we havemight seen mak some of common estimators and that some ,function make e adeﬁnitions go goo od estimator and then analyzing its analyzed bias and their properties. But where did these estimators come from? Rather than variance, we would lik likee to ha hav ve some principle from whic which h we can derive guessing speciﬁc that somethat function might make a gofor oddiﬀeren estimator and then analyzing its bias and functions are go goo od estimators diﬀerent t mo models. dels. variance, we would like to have some principle from which we can derive speciﬁc The most principle the maxim maximum likeliho eliho elihoo od principle. functions thatcommon are goodsuch estimators forisdiﬀeren t moum dels.lik Consider a set of m examples X = {x (1), . . . , x(m) } dra drawn wn indep independently endently from The most common such principle is the maximum likelihood principle. the true but unknown data generating distribution p ( x ) . data X Consider a set of m examples = x , . . . , x drawn independently from Let pmodel ( x; θ) b e a parametric family of probability distributions over the the true but unknown data generating{distribution p} (x). same space indexed by θ. In other words, p model(x ; θ) maps an any y conﬁguration x p ( x ; θ Let ) b e a parametric family of probability distributions over the to a real num numb b er estimating the true probability pdata (x). same space indexed by θ. In other words, p (x ; θ) maps any conﬁguration x The maxim maximum um likelihoo likelihood d estimator for θ is then deﬁned as to a real numb er estimating the true probability p (x). The maximum likelihoo estimator θ is θMLd = arg max for p model (Xthen ; θ) deﬁned as θ X θ = arg max Y pm ( ; θ) = arg max pmodel(x (i); θ) θ
i=1
(5.56) (5.56) (5.57)
= arg max p (x ; θ) (5.57) This pro product duct over man many y probabilities can be inconv inconvenien enien enientt for a variet ariety y of reasons. For example, it is prone to numerical underﬂow. To obtain a more con conv venien enientt This pro duct o v er man y probabilities can b e inconv enien t for a v ariet y of reasons. but equiv equivalent alent optimization problem, we observe that taking the logarithm of the Y F or example, is prone umerical underﬂow. Tenien o obtain a more con venien lik likeliho eliho elihoo od do does esitnot changetoitsnarg do does es conv convenien eniently tly transform a pro product ductt max but but equivalent optimization problem, we observe that taking the logarithm of the likelihood does not change its arg max but does conveniently transform a product 131
CHAPTER 5. MACHINE LEARNING BASICS
in into to a sum: into a sum:
θ ML = arg max θ
m X i=1
log pmodel (x(i) ; θ).
(5.58)
θ es not = arg max when log pwe rescale (x ;the θ). cost function, w(5.58) Because the argmax do does change e can divide by m to obtain a version of the criterion that is expressed as an exp expectation ectation Because the argmax do es not c hange when w e rescale the cost function, we can with resp respect ect to the empirical distribution pˆdata deﬁned by the training data: divide by m to obtain a version of the criterion that is expressed as an expectation θ ML = arg max EX p model(xb;yθthe ). training data: (5.59) ∼ˆ p p with respect to the empirical distribution ˆ logdeﬁned θ E θ = arg max log p (x; θ). (5.59) One wa way y to interpret maximum lik likelihoo elihoo elihood d estimation is to view it as minimizing the dissimilarit dissimilarity y bet etw ween the empirical distribution pˆdata deﬁned by the training One wa y to interpret maximum likelihoo d estimation is to view as minimizing set and the mo model del distribution, with the degree of dissimilarit dissimilarity y bitet etw ween the two p ˆ the dissimilarit y b et w een the empirical distribution deﬁned b y the training measured by the KL divergence. The KL div divergence ergence is giv given en by set and the model distribution, with the degree of dissimilarity between the two D KL pˆdataKL ) = E ∼ˆpThe [log kpmodel (x) − log pmodel x)] . (5.60) data measured by (the divergence. KL pˆdiv ergence is giv en b(y E The term only data generating pro process, the D on (the pˆ left = [logofpˆ the(x p is a)function ) log p (x)] . cess, not (5.60) mo model. del. This means when we train the mo model del to minimize the KL divergence, we k is a function only of the data − generating process, not the The the left needterm only on minimize model. This means when we−train model to minimize the KL divergence, we (x)] (5.61) p model E ∼ˆpthe [log need only minimize whic which h is of course the same as E the maximization in Eq. 5.59. (x)] (5.61) [log p Minimizing this KL div divergence ergence corresp corresponds onds exactly to minimizing the cross− the maximization in Eq. 5.59. which is of course the same as en entrop trop tropy y betw etween een the distributions. Many authors use the term “crossen “crossentrop trop tropy” y” to Minimizing this KL div ergence corresp onds exactly to minimizing the crossiden identify tify sp speciﬁcally eciﬁcally the negativ negativee loglik loglikeliho eliho elihoood of a Bernoulli or softmax distribution, en trop y b etw een the distributions. Many authors use the loglik term “crossen to but that is a misnomer. Any loss consisting of a negative loglikelihoo elihoo elihood d trop is a y” cross iden tifyy sp eciﬁcally theempirical negative loglik elihood deﬁned of a Bernoulli softmax distribution, en entrop trop tropy betw etween een the distribution by theortraining set and the but that is a misnomer. Any loss consisting of a negative loglik elihoo d is a cross mo model. del. For example, mean squared error is the crossentrop crossentropy y b et etw ween the empirical en trop y b etw een the empirical distribution deﬁned b y the training set and the distribution and a Gaussian mo model. del. model. For example, mean squared error is the crossentropy b etween the empirical We can th thus us see maximum lik likeliho eliho elihoo o d as an attempt to make the mo model del disdistribution and a Gaussian model. tribution matc match h the empirical distribution pˆdata . Ideally Ideally,, we would lik likee to match W e can th us see maximum lik eliho o d as an attempt to make the model disthe true data generating distribution pdata , but we ha hav ve no direct access to this tribution matc h the empirical distribution . Ideally , we w ould lik e to match p ˆ distribution. the true data generating distribution p , but we have no direct access to this While the optimal θ is the same regardless of whether we are maximizing the distribution. lik likeliho eliho elihoo od or minimizing the KL divergence, the values of the ob objective jective functions θ While the optimal is the same regardless of whether w e are the are diﬀeren diﬀerent. t. In soft software, ware, we often phrase both as minimizing amaximizing cost function. lik eliho o d or minimizing the KL divergence, the v alues of the ob jective functions Maxim Maximum um likelihoo likelihood d thus becomes minimization of the negative loglik loglikeliho eliho elihoo od are diﬀeren t. In soft ware, we often phrase b oth as minimizing a cost function. (NLL), or equiv equivalently alently alently,, minimization of the cross entrop entropy y. The persp perspective ective of Maxim um lik likelihoo thus becomes minimization the negative eliho od maxim maximum um likelihoo elihoo elihood dd as minimu minimum m KL div divergence ergence of becomes helpfulloglik in this case (NLL), , minimization of the cross . zero. The persp of becauseor theequiv KL alently div divergence ergence has a known minim minimum umentrop value yof The ective negative maxim um lik elihoo d as minimu m KL div ergence becomes helpful in this case loglik loglikeliho eliho elihoo od can actually become negative when x is realv realvalued. alued. because the KL divergence has a known minimum value of zero. The negative 132 when x is realvalued. loglikelihood can actually become negative
CHAPTER 5. MACHINE LEARNING BASICS
5.5.1
Conditional LogLikelihoo LogLikelihood d and Mean Squared Error
The maxim maximum um lik likeliho eliho elihoo od estimator can readily generalized to the case where 5.5.1 Conditional LogLikelihoo d and be Mean Squared Error our goal is to estimate a conditional probabilit probability y P (y  x ; θ) in order to predict y The maxim um lik eliho o d estimator can readily be generalized to thethe case where giv given en x . This is actually the most common situation because it forms basis for our goal is to estimate a conditional probabilit y ) in order to predict P ( ; θ y y x X Y most sup supervised ervised learning. If represen represents ts all our inputs and all our observ observed ed given x . then This the is actually the most common situation because is it forms the basis for targets, conditional maximum lik likelihoo elihoo elihood d estimator most supervised learning. If X represents all our inputs and Y all our observed targets, then the conditional d estimator is θ MLmaximum = arg maxlikPelihoo (Y  X ; θ ). (5.62) θ
θ = arg max P (Y X ; θ). (5.62) If the examples are assumed to b e i.i.d., then this can be decomposed into  m X If the examples are assumed to b e i.i.d., then this can be decomposed into θ ML = arg max log P (y(i)  x(i) ; θ). (5.63) θ
θ
= arg max
i=1
log P (y
x ; θ).
(5.63)

Linear regression, in intro tro troduced duced earlier in Sec. 5.1.4, ma may y be justiﬁed as a maximum likelihoo likelihood d pro procedure. cedure. Linear regression, Previously Previously,, we motiv motivated ated linear regression as an algorithm that learns to tak take e an X in tro duced earlier in Sec. 5.1.4 , ma y b e justiﬁed as a maximum likelihoo d pro cedure. input x and pro produce duce an output value yˆ. The mapping from x to yˆ is chosen to Previously , we motiv ated linear regression aswe anin algorithm that learns take an. minimize mean squared error, a criterion that intro tro troduced duced more or less to arbitrarily arbitrarily. x y ˆ x y ˆ input and pro duce an output v alue . The mapping from to is chosen to We no now w revisit linear regression from the poin ointt of view of maximum likelihoo likelihood d minimize mean squared error, a criterion we introyˆduced more or less arbitrarily estimation. Instead of pro producing ducing a singlethat prediction , we now think of the mo model del . W now revisita linear regression from the maximum p( ypoin  xt). ofWview asepro producing ducing conditional distribution e canofimagine thatlikelihoo with and yˆ, we examples estimation. Instead of pro a single prediction now thinkwith of the del inﬁnitely large training set,ducing we migh might t see several training themo same p ( y x as pro ducing a conditional distribution ) . W e can imagine that with input value x but diﬀeren diﬀerentt values of y . The goal of the learning algorithm is nowan to inﬁnitely large training weallmigh t see diﬀeren several the same  ttraining p( y  set, y valuesexamples ﬁt the distribution of those diﬀerent that are with all compatible x) to input xv.alue but diﬀeren t values of yregression . The goalalgorithm of the learning algorithmbefore, is nowwe to with To xderive the same linear we obtained p ( y ﬁt the pdistribution diﬀerent yˆ(x values that compatible (y  x ) = N (yy; yˆ(xx); to w)all , σ2of ; w) giv deﬁne ). those The function gives es are the all prediction of with . T o derive the same linear regression algorithm w e obtained b efore, we x  the mean of the Gaussian. In this example, we assume that the variance is ﬁxed to p(y xt) σ=2 chosen (y; yˆ(b xy; w ) , σuser. yˆ(x ;this w) cgiv deﬁneconstan ). The es the prediction of some constant the We function will see that hoice of the functional the mean this example, weodassume that the variancetoisyield ﬁxedthe to N y the x ) Gaussian. form of p(of causes the In maximum lik likeliho eliho elihoo estimation pro procedure cedure σ chosen b some constan thedev user. We will see that of the same learningt algorithm asy we developed eloped before. Sincethis thechoice examples arefunctional assumed p ( y x form of ) causes the maximum lik eliho o d estimation pro cedure to yield the to be i.i.d., the conditional loglikelihoo loglikelihood d (Eq. 5.63) is giv given en by same learning  algorithm as we developed before. Since the examples are assumed m X to be i.i.d., the conditional loglikelihood (Eq. 5.63) is given by log p(y (i)  x (i); θ) (5.64) i=1
log p(y x ; θ) m X yˆ(i) − y (i) 2 m = − m log σ −  log(2π ) − 2 2σ2 i=1 y ˆ y m = m log σ log(2π ) 2 2  −σ  X 133 − − − X
(5.64) (5.65) (5.65)
CHAPTER 5. MACHINE LEARNING BASICS
where yˆ(i) is the output of the linear regression on the ith input x(i) and m is the num umb ber of the training examples. Comparing the loglikelihoo loglikelihood d with the mean y ˆ i x where is the output of the linear regression on the th input and m is the squared error, m number of the training examples. Comparing d with the mean 1 X (i) the (loglikelihoo  ˆ yˆ − y i)2, MSE train = (5.66) squared error, m 1 i=1 = yˆ y MSE , (5.66) we immediately see that maximizingmthe loglikelihoo loglikelihood d with respect to w yields es minimizing −  the mean squared error. the same estimate of the parameters w as do does w e immediately see that maximizing the loglikelihoo d with to w yields The tw two o criteria ha hav ve diﬀerent values but the same lo location cation of respect the optimum. This w the same estimate of the parameters as do es minimizing the mean squared error. justiﬁes the use of the MSE as a maxim maximum um likelihoo elihoo elihood d estimation pro procedure. cedure. As we X lik The tw o criteria ha v e diﬀerent v alues but the same lo cation of the optimum. will see, the maximum lik likelihoo elihoo elihood d estimator has sev several eral desirable prop properties. erties. This justiﬁes the use of the MSE as a maximum likelihood estimation procedure. As we will see, the maximum likelihood estimator has several desirable properties.
5.5.2
Prop Properties erties of Maxim Maximum um Lik Likeliho eliho elihoo od
5.5.2 erties Maxim Likdeliho od is that it can be sho The mainProp app appeal eal of theofmaxim maximum um um lik likelihoo elihoo elihood estimator shown wn to be the best estimator asymptotically asymptotically,, as the num umber ber of examples m → ∞ , in terms The eal of the maxim likelihoo d estimator is that it can be shown to of itsmain rate app of conv convergence ergence as m um increases. be the best estimator asymptotically, as the number of examples m , in terms Under appropriate conditions, maximum lik likeliho eliho elihoo od estimator has the prop propert ert erty y of its rate of convergence as m increases. →∞ of consistency (see Sec. 5.4.5 ab abov ov ove), e), meaning that as the number of training Under approac appropriate conditions, eliho ododestimator property examples approaches hes inﬁnit inﬁnity y, themaximum maxim maximum umliklik likeliho eliho elihoo estimatehas of athe parameter of consistency 5.4.5 abov e), meaning thatconditions as the number con conv verges to the(see trueSec. value of the parameter. These are: of training examples approaches inﬁnity, the maximum likelihood estimate of a parameter con•verges thedistribution true value ofp data the parameter. Thesethe conditions are: pmodel(·; θ). The to true must lie within mo model del family Otherwise, no estimator can recov recover er pdata . ( ; θ). The true distribution p must lie within the model family p pdata •• The true distribution must corresp correspond Otherwise, no estimator can recov er p ond . to exactly one value of θ. Other· wise, maximum likelihoo likelihood d can recov recover er the correct pdata, but will not be able p ofmθ θcessing. Thedetermine true distribution ustwas corresp to exactly one value of . Otherto which value usedond by the data generating pro processing. • wise, maximum likelihoo d can recover the correct p , but will not be able to determine value of θ wasb esides used by dataum generating pro cessing. There are other which inductiv inductive e principles thethe maxim maximum lik likelihoo elihoo elihood d estimator, man many y of which share the prop property erty of being consistent estimators. Ho How wev ever, er, consisThere are other inductiv e principles b esides the maxim um lik elihoo d estimator, ten tentt estimators can diﬀer in their statistic eﬃciency, meaning that one consisten consistentt man y of which share the prop erty of b eing consistent estimators. Ho w ev er, consism, estimator ma may y obtain low lower er generalization error for a ﬁxed num umb ber of samples ten t estimators can diﬀer in their statistic eﬃciency , meaning that one consisten or equiv equivalen alen alently tly tly,, ma may y require fewer examples to obtain a ﬁxed lev level el of generalizationt estimator may obtain lower generalization error for a ﬁxed number of samples m , error. or equivalently, may require fewer examples to obtain a ﬁxed level of generalization Statistical eﬃciency is typically studied in the par arametric ametric case (lik (likee in linear error. regression) where our goal is to estimate the value of a parameter (and assuming is ttrue ypically studied in ametric case (like Ainwlinear it isStatistical possible toeﬃciency identify the parameter), notthe thepar value of a function. ay to regression) where our goal is to estimate the v alue of a parameter (and assuming measure ho how w close we are to the true parameter is by the exp expected ected mean squared it is possible to identify the true parameter), not the v alue of a function. A way to error, computing the squared diﬀerence betw etween een the estimated and true parameter measure how close we are to the true parameter is by the expected mean squared error, computing the squared diﬀerence 134 between the estimated and true parameter
CHAPTER 5. MACHINE LEARNING BASICS
values, where the exp expectation ectation is over m training samples from the data generating distribution. That parametric mean squared error decreases as m increases, and m training vfor alues, wherethe theCramérRao expectationlow is er over samples from, 1946 the data generating lower bound (Rao, 1945 ; Cramér ) shows that no m large, m distribution. That parametric mean squared error decreases as increases, and consisten consistentt estimator has a low lower er mean squared error than the maximum lik likeliho eliho elihoo od for large, the CramérRao low er b ound ( Rao , 1945 ; Cramér , 1946 ) shows that no m estimator. consistent estimator has a lower mean squared error than the maximum likelihood For these reasons (consistency and eﬃciency), maximum lik likelihoo elihoo elihood d is often estimator. considered the preferred estimator to use for machine learning. When the num number ber F or these reasons (consistency and eﬃciency), maximum lik elihoo d is often of examples is small enough to yield overﬁtting behavior, regularization strategies considered the preferred use for machine learning. When the num ber suc such h as weigh weight t deca decay y mayestimator be used totoobtain a biased version of maximum lik likeliho eliho elihoo od of examples small enough to yield odata verﬁtting behavior, regularization strategies that has lessisvariance when training is limited. such as weight decay may be used to obtain a biased version of maximum likelihood that has less variance when training data is limited.
5.6
Ba Bay yesian Statistics
So e ha hav discussed fr freequentist statistics and approac approaches hes based on estimating a 5.6far wBa yveesian Statistics single value of θ, then making all predictions thereafter based on that one estimate. So far weapproach have discussed frequentist hes based on aestimating a Another is to consider all statistics p ossible vand aluesapproac of θ when making prediction. single v alue of , then making all predictions thereafter based on that one estimate. θ The latter is the domain of Bayesian statistics. Another approach is to consider all p ossible values of θ when making a prediction. As discussed in Sec. 5.4.1, the frequentist persp perspectiv ectiv ectivee is that the true parameter The latter is the domain of Bayesian statistics. value θ is ﬁxed but unknown, while the poin ointt estimate θˆ is a random variable on As discussed in Sec. 5.4.1, the frequentist ectiv is that the true parameter accoun account t of it being a function of the dataset persp (whic (which h is eseen as random). value θ is ﬁxed but unknown, while the point estimate θˆ is a random variable on Thet Ba Bay yesian erspective ective on statistics quiteh diﬀerent. Ba Bay yesian uses accoun of it beingpaersp function of the datasetis(whic is seen as The random). probabilit probability y to reﬂect degrees of certaint certainty y of states of kno knowledge. wledge. The dataset is The observed Bayesian and persp statistics diﬀerent. yesian uses θ directly soective is not on random. On is thequite other hand, theThe trueBaparameter probabilit y to reﬂect degrees of certaint y of states of kno wledge. The dataset is is unkno unknown wn or uncertain and thus is represen represented ted as a random variable. directly observed and so is not random. On the other hand, the true parameter θ Before observing the data, we represent our knowledge of θ using the prior is unknown or uncertain and thus is represented as a random variable. pr prob ob obability ability distribution distribution,, p (θ ) (sometimes referred to as simply “the prior”). GenBefore observing the data, we represent ouraknowledge of θ using theis prior erally erally,, the mac machine hine learning practitioner selects prior distribution that quite pr ob ability distribution , ) (sometimes referred to as simply “the prior”). Genp ( θ broad (i.e. with high en entrop trop tropy) y) to reﬂect a high degree of uncertain uncertaintty in the value of erally , the mac hine learning practitioner selects a prior distribution that is quite θ before observing an any y data. For example, one migh mightt assume a priori that θ lies broad (i.e. withrange high or entrop y) towith reﬂect a high degree of uncertain tyy in the vinstead alue of in some ﬁnite volume, a uniform distribution. Man Many priors θ before θ lies observing an data. Forsolutions example,(suc oneh migh t assume a priori co that reﬂect a preference fory “simpler” (such as smaller magnitude coeﬃcients, eﬃcients, in some ﬁnite that range volume, withconstant). a uniform distribution. Many priors instead or a function is or closer to being reﬂect a preference for “simpler” solutions (such as smaller(1) magnitude coeﬃcients, No Now w consider that we hav havee a set of data samples {x , . . . , x (m) }. We can or a function that is closer to being constant). reco recov ver the eﬀect of data on our belief ab about out θ by com combining bining the data lik likeliho eliho elihoo od x , . . . , x No w consider that we hav e a set of data samples . W e can (1) ( m )  θ) with the prior via Bay Bayes’ es’ rule: p(x , . . . , x recover the eﬀect of data on our belief about θ by combining the data} likelihood { (1),rule: (m)  θ )p(θ ) θ) with(1)the prior via Bay es’ p(x , . . . , x p ( x . . . , x p(θ  x , . . . , x (m) ) = (5.67) p(x(1), . . . , x(m))  p(x , . . . , x θ)p(θ ) p(θ x , . . . , x ) =135 (5.67) p(x , . . . , x ) 
CHAPTER 5. MACHINE LEARNING BASICS
In the scenarios where Ba Bay yesian estimation is typically used, the prior b egins as a relativ relatively ely uniform or Gaussian distribution with high en entrop trop tropy y, and the observ observation ation In the scenarios where Ba y esian estimation is t ypically used, the prior b egins as aa of the data usually causes the posterior to lose entrop entropy y and concen concentrate trate around relativ ely uniform or Gaussian distribution with high entropy, and the observation few highly likely values of the parameters. of the data usually causes the posterior to lose entropy and concentrate around a Relativ Relativee to maxim maximum um lik likelihoo elihoo elihood d estimation, Ba Bay yesian estimation oﬀers two few highly likely values of the parameters. imp importan ortan ortantt diﬀerences. First, unlik unlikee the maxim maximum um likelihoo likelihood d approach that makes Relativ e to maxim um lik elihoo d estimation, Ba y esian oﬀers two predictions using a point estimate of θ , the Ba Bay yesian approac approach hestimation is to make predictions importan t diﬀerences. First, e or theexample, maximum likelihoo d approach that makes θ. F m examples, using a full distribution ov over erunlik after observing the θ predictions using a p oint estimate of , the Ba y esian approac h is to make predictions ( m +1) predicted distribution ov over er the next data sample, x , is given by using a full distribution over θ.Z For example, after observing m examples, the predicted distribution over the next data sample, x , is given by p(x (m+1)  x (1) , . . . , x(m)) = p(x (m+1)  θ)p(θ  x(1), . . . , x (m) ) dθ. (5.68)
p(x (5.68) x , . . . , x ) = p(x θ)p(θ x , . . . , x ) dθ. Here eac each h value of θ with positive probability density contributes to the prediction  of the next example, with the contribution weigh eighted ted by the posterior density itself. Here eac h value of θ ed with probability to the ab prediction (1) , . . . , x (m)} {xpositive After ha having ving observ observed , if we density are stillcontributes quite uncertain about out the of the next example, with the contribution w eigh ted by the p osterior density itself. Z value of θ , then this uncertain uncertaintty is incorporated directly into an any y predictions we x , . . . , x After ha ving observ ed , if we are still quite uncertain ab out the migh mightt mak make. e. value of θ , then this uncertain ty is incorporated directly into any predictions we { } In Sec. 5.4, we discussed ho how w the frequen frequentist tist approach addresses the uncertaint uncertainty y might make. in a giv given en point estimate of θ by ev evaluating aluating its variance. The variance of the In Sec. 5.4 , w e discussed ho w the frequen tist approach the uncertaint estimator is an assessment of how the estimate might addresses change with alternativey θ in a giv en p oint estimate of b y ev aluating its v ariance. The v ariance of deal the samplings of the observ observed ed data. The Ba Bay yesian answer to the question of ho how w to estimator is an assessment of how the might change alternative with the uncertaint uncertainty y in the estimator is toestimate simply in integrate tegrate over it,with whic which h tends to samplings of the observ ed data. The Ba y esian answer to the question of ho w to deal protect well against ov overﬁtting. erﬁtting. This integral is of course just an application of with the uncertaint y in the estimator is to simply in tegrate o v er it, whic h tends to the laws of probabilit probability y, making the Ba Bay yesian approac approach h simple to justify justify,, while the protect well against ov erﬁtting. This integral is of course just an ofc frequen frequentist tist machinery for constructing an estimator is based on the application rather ad ho hoc the laws to of summarize probability, all making the Bacon yesian approac simple to justify , while the decision knowledge contained tained in theh dataset with a single point frequentist machinery for constructing an estimator is based on the rather ad ho c estimate. decision to summarize all knowledge contained in the dataset with a single point The second imp importan ortan ortantt diﬀerence b et etw ween the Bay Bayesian esian approac approach h to estimation estimate. and the maximum likelihoo likelihood d approac approach h is due to the contribution of the Ba Bay yesian The second impThe ortanprior t diﬀerence b etween the Bayesianprobability approach to estimation prior distribution. has an inﬂuence by shifting mass densit density y and the maximum likelihoo d approac h is due to the contribution of the Ba y esian to tow wards regions of the parameter space that are preferred a priori. In practice, prior distribution. The prior has an inﬂuence by that shifting mass smo densit y the prior often expresses a preference for models are probability simpler or more smooth. oth. to w ards regions of the parameter space that are preferred a priori . In practice, Critics of the Bay Bayesian esian approach identify the prior as a source of sub subjective jective human the priortoften expresses a preference for models that are simpler or more smooth. judgmen judgment impacting the predictions. Critics of the Bayesian approach identify the prior as a source of sub jective human Ba Bay yesian metho methods ds typically generalize muc much h better when limited training data judgment impacting the predictions. is av available, ailable, but typically suﬀer from high computational cost when the num umb ber of Ba y esian metho ds typically generalize muc h b etter when limited training data training examples is large. is available, but typically suﬀer from high computational cost when the number of training examples is large. 136
CHAPTER 5. MACHINE LEARNING BASICS
Here we consider the Ba Bay yesian estimation approach to learning the linear regression parameters. In linear regression, Here weRconsider the the Bayv esian n to predict we learn a linear mapping from an input vector alue estiof a x∈ mation approach to learning the linear regression parameters. In linear regression, n scalar y ∈ R. The prediction is parametrized by the vRector w ∈ R : we learn a linear mapping from an input vector x to predict the value of a R R > scalar y . The prediction is parametrized : yˆ = w x. by the (5.69) ∈ vector w ∈ ∈ train w) , y x(.train) ), we can express the prediction (5.69) Giv Given en a set of m training samples (Xyˆ(= of y ov over er the en entire tire training set as: ,y Given a set of m training samples (X ), we can express the prediction ( train ) ( train ) of y over the entire training setyˆ as: = X w. (5.70) yˆ =X w.on y(train), we hav Expressed as a Gaussian conditional distribution havee
(5.70)
) (train) Expressed Gaussian distribution , we have p(y(train)  as X(atrain , w) = conditional N (y (train); X w, I ) on y (5.71) 1 (train) > ( train ) ( train ) ( train ) p(y ;X X , w) ∝ = exp (y − (y (5.71) w) , −wX, I ) w) (y −X 2  N 1 (5.72) (y w) (y w) , exp X X 2 ∝ − − − (5.72) where we follow the standard MSE formulation in assuming that the Gaussian variance on y is one. In what follo follows, ws, to reduce the notational burden, we refer to where we follow the standard MSE ( train ) ( train ) (X ,y ) as simply (X , y ). formulation in assuming that the Gaussian variance on y is one. In what follows, to reduce the notational burden, we refer to posterior over the mo model del parameter vector w, we (X To determine ,y ) the as simply (Xdistribution , y ). ﬁrst need to sp a prior distribution. The prior should reﬂect our naive belief specify ecify w, we T o determine the p osterior distribution o v er the mo del parameter vector ab about out the value of these parameters. While it is sometimes diﬃcult or unnatural ﬁrst need to ecify baeliefs priorindistribution. prior should belief to express ourspprior terms of theThe parameters of thereﬂect mo model, del,our in naive practice we ab out the v alue of these parameters. While it is sometimes diﬃcult or unnatural typically assume a fairly broad distribution expressing a high degree of uncertain uncertaintty to express our prior b eliefs in terms of the parameters of the mo del, in practice we θ ab about out . F For or realv realvalued alued parameters it is common to use a Gaussian as a prior typically assume a fairly broad distribution expressing a high degree of uncertainty distribution: about θ . For realvalued parametersit is common to use a Gaussian as a prior 1 distribution: (5.73) p(w) = N (w; µ0 , Λ 0) ∝ exp − (w − µ0 )> Λ−1 0 (w − µ0) 2 1 (5.73) p(w) = (w; µ , Λ ) exp (w µ ) Λ (w µ ) where µ0 and Λ 0 are the prior distribution mean vector and cov covariance ariance matrix 2 ∝ − − − resp respectiv ectiv ectively ely ely..1 N where µ and Λ are the prior distribution mean vector and covariance matrix the. prior th thus us sp speciﬁed, eciﬁed, we can no now w pro proceed ceed in determining the posterior respWith ectively distribution over the mo model del parameters. With the prior thus speciﬁed, we can now proceed in determining the posterior distribution over p(w  X , y) ∝ p(ythe  Xmo , wdel )p(parameters. w) (5.74)
there to)p assume p(wUnless X, y ) isp(ayreason X, w (w) a particular covariance structure, we typically assume (5.74)a diagonal covariance matrix .  ∝  137
CHAPTER 5. MACHINE LEARNING BASICS
1 1 > > −1 ∝ exp − (y − X w) (y − X w) exp − (w − µ 0) Λ 0 (w − µ 0 ) 2 2 1 1 exp (y X w) (y X w) exp (w µ ) Λ (w (5.75) µ ) 2 2 ∝ −1 − − − − − > −1 (5.75) . ∝ exp − −2y >X w + w > X > X w + w> Λ−1 0 w − 2µ0 Λ 0 w 2 1 w +w Λ w 2µ Λ w(5.76) . exp 2y X w + w X X 2 ∝ − − −1 > − −1 (5.76) µ = Λ X y + Λ0 µ0 . Using We no now w deﬁne Λ m = X> X + Λ−1 and m m 0 these new variables, we ﬁnd that the posterior ma may y be rewritten as a Gaussian X+Λ W e now deﬁne Λ = X and µ = Λ X y + Λ µ . Using distribution: these new variables, weﬁnd that the posterior may be rewritten asa Gaussian 1 1 > −1 distribution: p(w  X , y) ∝ exp − (w − µm )> Λ−1 (5.77) m (w − µ m) + µm Λm µ m 2 2 1 1 (5.77) p(w X , y) exp 1 (w µ )> Λ−1(w µ ) + µ Λ µ (5.78) ∝ exp − 2 (w − µm ) Λm (w − µ m) . 2  ∝ −2 − − 1 (5.78) exp (w µ ) Λ (w µ ) . 2 the parameter vector w ha All terms that do not include hav ve been omitted; they ∝ fact that − the − − be normalized to in are implied by the distribution must integrate tegrate to 1. All terms that do not include the parameter v ector ha v e b een omitted; they w Eq. 3.23 shows how to normalize a multiv multivariate ariate Gaussian distribution. are implied by the fact that the distribution must be normalized to integrate to 1. Examining this posterior distribution allows us to gain some in intuition tuition for the Eq. 3.23 shows how to normalize a multivariate Gaussian distribution. eﬀect of Bay Bayesian esian inference. In most situations, we set µ0 to 0. If we set Λ0 = α1 I , thissame posterior distribution us to gain linear some in tuition forwith the µm giv thenExamining gives es the estimate of w as allows do does es frequentist regression µ tothe 0. Bay = I, oftBay esian inference. situations, we set If weesian set Λ w >most w . One aeﬀect weigh weight decay penalt enalty y of αIn diﬀerence is that Bayesian estimate µ givesif the then same estimate of w as are doesnot frequentist linear regression with alpha is undeﬁned is set to zero—we allo allow wed to begin the Ba Bay yesian α w w a weigh t decay p enalt y of . One diﬀerence is that the Bay esian estimate learning pro process cess with an inﬁnitely wide prior on w. The more imp important ortant diﬀerence alphaestimate is that undeﬁned ifyesian is set topro zero—we are not allo wed to begin the yesian is the Ba Bay provides vides a cov covariance ariance matrix, showing ho how w Ba like likely ly all w learning pro cess with an inﬁnitely wide prior on . The more imp ortant diﬀerence the diﬀeren diﬀerentt values of w are, rather than pro providing viding only the estimate µm. is that the Bayesian estimate provides a covariance matrix, showing how likely all the diﬀerent values of w are, rather than providing only the estimate µ .
5.6.1
Maxim Maximum um
(MAP) Estimation
While most principled approac approach h is (MAP) to mak makee predictions using the full Bay Bayesian esian 5.6.1 theMaxim um Estimation posterior distribution over the parameter θ , it is still often desirable to ha hav ve a While most principled h isreason to makfor e predictions full Bay single the point estimate. Oneapproac common desiring a using point the estimate isesian that θ p osterior distribution o v er the parameter , it is still often desirable to ha v a most op operations erations inv involving olving the Ba Bay yesian posterior for most in interesting teresting mo models dels eare single point and estimate. common reason for desiring a point estimate is than that in intractable, tractable, a pointOne estimate oﬀers a tractable approximation. Rather most op erations inv olving the Ba y esian posterior for most in teresting mo dels simply returning to the maxim maximum um likelihoo likelihood d estimate, we can still gain someare of intractable, point estimate oﬀers a tractable Rather than the beneﬁt ofand theaBay Bayesian esian approac approach h by allowing theapproximation. prior to inﬂuence the choice simply to theOne maxim um likelihoo can still some of of the preturning oin ointt estimate. rational way to ddoestimate, this is towe choose the gain maximum a the b eneﬁt of the Bay esian approac h b y allowing the prior to inﬂuence the choice posteriori (MAP) point estimate. The MAP estimate chooses the p oin ointt of maximal of the point estimate. One rational way to do this is to choose the maximum a posteriori (MAP) point estimate. The MAP 138 estimate chooses the p oint of maximal
CHAPTER 5. MACHINE LEARNING BASICS
posterior probability (or maximal probability densit density y in the more common case of con contin tin tinuous uous θ): posterior probability (or maximal probability density in the more common case of continuous θ): = arg max p(θ  x) = arg max log p(x  θ) + log p(θ ). θ MAP (5.79) θ
θ
θ = arg max p(θ x) = arg max log p(x θ) + log p(θ ). (5.79) We recognize, ab abo ove on the right hand side, log p(x  θ ), i.e. the standard log lik likeliho eliho elihoo od term, and log p(θ), corresp corresponding onding to the prior distribution. We recognize, above on the right hand side, log p(x θ ), i.e. the standard logAs an regression model delprior withdistribution. a Gaussian prior on the likeliho od example, term, andconsider log p(θ)a, linear corresp onding 1tomo the  2 weigh eights ts w . If this prior is giv given en by N ( w; 0, λ I ), then the logprior term in Eq. As an example, consider a linearλregression mo with a Gaussian the w > w weigh 5.79 is prop proportional ortional to the familiar eight t del decay p enalt enalty y, plus aprior termonthat ( w; 0,theI learning w eigh ts wdep . Ifend thisonprior is giv en y aﬀect ), then the logprior term Eq. w and do does es not depend do does es bnot pro process. cess. MAP Ba Bay yinesian w wweigh 5.79 is prop ortional to the prior familiar weigh t us decay p enalt y, plus a term that. N inference with a Gaussian on λthe eights ts th thus corresp corresponds onds to weigh weight t decay decay. does not depend on w and does not aﬀect the learning process. MAP Bayesian As with fullaBay Bayesian esian inference, MAP Bay Bayesian esian has the adv advan an antage of. inference with Gaussian prior on the weigh ts thusinference corresponds to weigh t tage decay lev leveraging eraging information that is brough broughtt by the prior and cannot be found in the As with full This Bayesian inference, MAP Bayhelps esian to inference advantage of training data. additional information reducehas thethe variance in the leveraging is brough by the and cannot ber, e found MAP poin ointtinformation estimate (inthat comparison tot the ML prior estimate). How Howev ev ever, it do does esinsothe at training data. This additional information helps to reduce the v ariance in the the price of increased bias. MAP point estimate (in comparison to the ML estimate). However, it does so at Many regularized strategies, such as maxim maximum um lik likelihoo elihoo elihood d learning the Man pricey of increased estimation bias. regularized with weigh weightt deca decay y, can be in interpreted terpreted as making the MAP appro approximaximaMan y regularized estimation strategies, such as maxim um lik elihoo d learning tion to Ba Bay yesian inference. This view applies when the regularization consists of regularized with term weighto t deca , can be in terpreted as making the MAP p(θ ).ximaadding an extra the yob objective jective function that corresponds to logappro Not tion to Ba y esian inference. This view applies when the regularization consists of all regularization penalties corresp correspond ond to MAP Ba Bay yesian inference. For example, p(θ ). Not addingregularizer an extra term tomay the ob jective function that of corresponds to log some terms not be the logarithm a probability distribution. all regularization penalties correspon ondthe to data, MAPwhich Bayesian inference. For example, Other regularization terms depend of course a prior probability some regularizer may distribution is notterms allow allowed ed to not do. be the logarithm of a probability distribution. Other regularization terms depend on the data, which of course a prior probability MAP Bay Bayesian inference provides vides a straigh straightforw tforw tforward ard way to design complicated distribution isesian not allow ed topro do. yet in interpretable terpretable regularization terms. For example, a more complicated penalty esian inference vides a of straigh tforward way than to design complicated termMAP can bBay e derived by using pro a mixture Gaussians, rather a single Gaussian ydistribution, et interpretable regularization terms. F or example, a more complicated penalty as the prior (Nowlan and Hin Hinton ton, 1992). term can be derived by using a mixture of Gaussians, rather than a single Gaussian distribution, as the prior (Nowlan and Hinton, 1992).
5.7
Sup Supervised ervised Learning Algorithms
Recall 5.1.3 that sup supervised ervised Algorithms learning algorithms are, roughly sp speaking, eaking, 5.7 from SupSec. ervised Learning learning algorithms that learn to asso associate ciate some input with some output, giv given en a Recall from Sec. 5.1.3 that sup ervised learning algorithms are, roughly sp eaking, x y training set of examples of inputs and outputs . In man many y cases the outputs learning that learnautomatically to asso ciate some some output, en a y ma may y balgorithms e diﬃcult to collect and input must with be provided by a giv human x andeven training set ofbut examples of still inputs outputs . Intraining many cases the outputs “sup “supervisor,” ervisor,” the term applies whenythe set targets were y ma y b e diﬃcult to collect automatically and must be provided b y a h uman collected automatically automatically.. “supervisor,” but the term still applies even when the training set targets were 139 collected automatically.
CHAPTER 5. MACHINE LEARNING BASICS
5.7.1
Probabilistic Sup Supervised ervised Learning
Most learningSup algorithms this bo book ok are based on estimating a 5.7.1 supervised Probabilistic ervised in Learning probabilit probability y distribution p(y  x). We can do this simply by using maxim maximum um Most supervised learning algorithms in this bo ok are based on estimating lik likeliho eliho elihoo od estimation to ﬁnd the best parameter vector θ for a parametric familya probabilit y distribution of distributions p(y  x; θ)p.(y x). We can do this simply by using maximum likelihood estimation to ﬁnd the  best parameter vector θ for a parametric family W e ha hav v e already seen that linear regression corresp corresponds onds to the family of distributions p(y x; θ). We have already seen that p(y linear  x; θ)regression = N (y; θ>corresp x, I ). onds to the family (5.80) p(y x; θ)to = the(yclassiﬁcat ; θ x, I ). ion scenario by deﬁning (5.80)a We can generalize linear regression classiﬁcation diﬀeren diﬀerentt family of probability distributions. If we ha hav ve tw two o classes, class 0 and N W e can generalize linear regression to the classiﬁcat y deﬁning a class 1, then we need only specify the probabilit probability y ofion onescenario of thesebclasses. The diﬀeren t family of 1probability e hav0,e b tw o classes, and probabilit probability y of class determinesdistributions. the probabilityIfofwclass ecause theseclass tw twoo v0alues class we1.need only specify the probability of one of these classes. The m ust 1, addthen up to probability of class 1 determines the probability of class 0, because these two values The distribution over realv realvalued alued num numb bers that we used for linear must addnormal up to 1. regression is parametrized in terms of a mean. An Any y value we supply for this mean The A normal distribution over realv alued numbersmore thatcomplicated, we used forbecause linear is valid. distribution over a binary variable is slightly regression parametrized inwterms of a 1. mean. weethis supply for this mean its mean misust alw alwaays be bet etw een 0 and One An wayy vtoalue solv solve problem is to use is v alid. A distribution o v er a binary v ariable is slightly more complicated, b ecause the logistic sigmoid function to squash the output of the linear function in into to the itsterv mean must alwain ysterpret be betw een v0alue and as 1. aOne way to solve this problem is to use in interv terval al (0, 1) and interpret that probability: the logistic sigmoid function to squash the output of the linear function into the interval (0, 1) and interpretpthat (5.81) (y = v1alue  x; as θ) a=probability: σ (θ >x). x). (5.81) p(gistic y = 1regr xession ; θ) = σ(a(θsomewhat This approac approach h is known as lo logistic gression strange name since we use the mo model del for classiﬁcation rather  than regression). This approach is known as logistic regression (a somewhat strange name since we In the case of linear regression, we were able to ﬁnd the optimal weigh eights ts by use the model for classiﬁcation rather than regression). solving the normal equations. Logistic regression is somewhat more diﬃcult. There In closedform the case of linear regression, we were ablets.to Instead, ﬁnd the we optimal eights for by is no solution for its optimal weigh weights. must wsearch solving the normal equations. Logistic regression is somewhat more diﬃcult. There them by maximizing the loglik loglikelihoo elihoo elihood. d. We can do this by minimizing the negative is no closedform solution for its optimal loglik loglikeliho eliho elihoo od (NLL) using gradient descen descent. t.weights. Instead, we must search for them by maximizing the loglikelihood. We can do this by minimizing the negative This same can bgradient e applied descen to essen essentially any y sup supervised ervised learning problem, loglik eliho od strategy (NLL) using t.tially an by writing down a parametric family of conditional probability distributions over appliedvariables. to essentially any supervised learning problem, the This righ righttsame kindstrategy of inputcan andbeoutput by writing down a parametric family of conditional probability distributions over the right kind of input and output variables.
5.7.2
Supp Support ort Vector Mac Machines hines
5.7.2of the Supp Vectorapproaches Machines One mostort inﬂuential to sup supervised ervised learning is the supp support ort vector mac machine hine (Boser et al., 1992; Cortes and Vapnik, 1995). This mo model del is similar to One of the most inﬂuential approaches to sup ervised learning is the supp ort vector > logistic regression in that it is driven by a linear function w x + b. Unlik Unlike e logistic machine (Boser et al., 1992; Cortes and Vapnik, 1995). This model is similar to logistic regression in that it is driven by 140 a linear function w x + b. Unlike logistic
CHAPTER 5. MACHINE LEARNING BASICS
regression, the support vector machine do does es not pro provide vide probabilities, but only outputs a class iden identit tit tity y. The SVM predicts that the positiv ositivee class is presen presentt when regression, the support vector machine do es not pro vide probabilities, but only > ositive. e. Likewise, it predicts that the negativ negativee class is presen present t when w x + b is positiv outputs identity. The SVM predicts that the positive class is present when > w x + baisclass negative. w x + b is positive. Likewise, it predicts that the negative class is present when One key inno innov vation asso associated ciated with supp support ort vector machines is the kernel trick. w x + b is negative. The kernel tric trick k consists of observing that many mac machine hine learning algorithms can One k ey inno v ation asso ciated with supp ort vector is the trick be written exclusiv exclusively ely in terms of dot pro products ducts betw between eenmachines examples. For kernel example, it. The tricthat k consists of observing hine algorithms can bkernel e shown the linear function that usedmany by themac supp support ortlearning vector mac machine hine cancan be b e written exclusiv ely in terms of dot pro ducts betw een examples. F or example, it rewritten as m can be shown that the linear function usedX by the support vector machine can be > w x + b = b + αi x> x (i) (5.82) rewritten as i=1
w x+b = b+ αx x (5.82) where x (i) is a training example and α is a vector of co coeﬃcien eﬃcien eﬃcients. ts. Rewriting the learning algorithm this way allo allows ws us to replace x by the output of a giv given en feature x α where is a training example and is a vector of co eﬃcien ts. Rewriting the ( i ) function φ(x ) and the dot pro product duct with a function k (x, x ) = φ( x) · φ (x(i) ) called xduct this warepresen y allowstsusan toinner replace by the output oftoaφgiv X pro alearning kernel.algorithm The · op operator erator represents product analogous (xen )>φfeature (x(i)). φ ( x k ( x , x ) = φ ( x ) ( x function ) and the dot pro duct with a function ) called φ For some feature spaces, we ma may y not use literally the vector inner pro product. duct. In a kernel . Thedimensional operator represen tse an inner proother duct analogous to· φpro ). ( x ) φ (x for some inﬁnite spaces, w need to use kinds of inner products, ducts, For someinner feature spaces, we ma not use literally inner pro duct. In · pro example, products ducts based onyin integration tegration rather the thanvector summation. A complete some inﬁnitet dimensional spaces, we pro need to use of inner dev developmen elopmen elopment of these kinds of inner products ducts is bother eyondkinds the scope of pro thisducts, bo book. ok.for example, inner pro ducts based on integration rather than summation. A complete After replacing dot products with kernel ev evaluations, aluations, we can make predictions development of these kinds of inner pro ducts is beyond the scope of this bo ok. using the function After replacing dot products with kX ernel evaluations, we can make predictions f (x) = b + α ik(x, x(i) ). (5.83) using the function i f (x) = b + α k(x, x ). (5.83) This function is nonlinear with resp respect ect to x, but the relationship betw between een φ( x) and f(x) is linear. Also, the relationship betw between een α and f (x) is linear. The This function is nonlinear with resp ect to , but thecessing relationship betw x φ( x) kernelbased function is exactly equiv equivalent alent to prepro preprocessing the data by een applying (xall and ) isinputs, linear.then Also, the relationship between α and f (x) is linear. The φ (x) fto learning a linear Xmodel in the new transformed space. kernelbased function is exactly equivalent to preprocessing the data by applying The kernel tric trick k is pow owerful erful for tw two o reasons. First, it allows us to learn mo models dels φ(x) to all inputs, then learning a linear model in the new transformed space. that are nonlinear as a function of x using conv convex ex optimization techniques that are Theteed kernel trickerge is peﬃciently owerful for two reasons. it allows us to learn moand dels guaran guaranteed to conv converge eﬃciently. . This is possibleFirst, because we consider φ ﬁxed that are nonlinear as athe function of x using convex optimization that are optimize only α, i.e., optimization algorithm can view thetechniques decision function guaran teed to conv erge eﬃciently . This is p ossible b ecause w e consider ﬁxed and φ as being linear in a diﬀerent space. Second, the kernel function k often admits α, i.e., optimize onlytation theisoptimization can view the decision an implemen that signiﬁcantly algorithm more computational eﬃcien naiv implementation eﬃcient t thanfunction naively ely k as b eing linear in a diﬀerent space. Second, the k ernel function often admits constructing two φ(x) vectors and explicitly taking their dot pro product. duct. an implementation that is signiﬁcantly more computational eﬃcient than naively In some cases, can even e inﬁnite taking dimensional, whic which h duct. would result in constructing two φ(φx(x ) )vectors andbexplicitly their dot pro an inﬁnite computational cost for the naiv naive, e, explicit approach. In man many y cases, can evenfunction be inﬁnite whic h would result in φ(x)tractable k(xIn , x0 some φ ( x ) is a cases, nonlinear, of xdimensional, ev when ) is in even en intractable. tractable. As an inﬁnite computational cost for the naive, explicit approach. In many cases, k(x, x ) is a nonlinear, tractable function 141 of x even when φ (x) is intractable. As
CHAPTER 5. MACHINE LEARNING BASICS
an example of an inﬁnitedimensional feature space with a tractable kernel, we construct a feature mapping φ (x ) over the nonnegative in integers tegers x . Supp Suppose ose that an example of an inﬁnitedimensional feature space with a tractable kernel, we this mapping returns a vector con containing taining x ones follow followed ed by inﬁnitely many zeros. φ ( x x construct a feature mapping over integers . Supp osealent that x(i)the ) = nonnegative min min(( x, x(i) ) that W e can write a kernel function k)(x, is exactly equiv equivalent this mapping returns a v ector con taining ones follow ed by inﬁnitely many zeros. x to the corresp corresponding onding inﬁnitedimensional dot pro product. duct. We can write a kernel function k (x, x ) = min( x, x ) that is exactly equivalent The most commonly used kernel is the Gaussian kernel to the corresponding inﬁnitedimensional dot product. The most commonly used k(ukernel , v) = is N the (u −Gaussian v ; 0, σ2I )kernel
(5.84)
) (5.84) k(u, v) = (u densit v ; 0,yσ. IThis where N( x; µ, Σ) is the standard normal density kernel is also known as the radial basis function (RBF) kernel, its value decreases along lines in N because − ( x ; µ , Σ where ) is the standard normal densit y. This kernel is also as outward ard from u . The Gaussian kernel corresp corresponds ondsknown to a dot v space radiating outw the duct radial function (RBF) kernel, because itsderiv value decreases in Ninbasis pro product an inﬁnitedimensional space, but the derivation ation of thisalong spacelines is less spacetforw radiating outw The kernel ondss. to a dot v u . of straigh straightforw tforward ard than in ard our from example theGaussian min kernel over corresp the integer integers. product in an inﬁnitedimensional space, but the derivation of this space is less We can think of the Gaussian kernel as performing a kind of template matching. straightforward than in our example of the min kernel over the integers. A training example x asso associated ciated with training lab label el y becomes a template for class W e can think of the Gaussian k ernel as p erforming a kind of template matching. 0 y . When a test poin ointt x is near x according to Euclidean distance, the Gaussian x asso y becomes A training ciated with training for class 0 isel kernel has example a large resp response, onse, indicating that xlab very similar atotemplate the x template. y x x . When a test p oin t is near according to Euclidean distance, the Gaussian The mo model del then puts a large weigh eightt on the asso associated ciated training lab label el y . Ov Overall, erall, k ernel has a large resp onse, indicating that is very similar to the template. x x the prediction will combin combinee many such training labels weigh eighted ted by the similarit similarity y The mocorresp del then putstraining a large w eight on the associated training label y . Overall, of the corresponding onding examples. the prediction will combine many such training labels weighted by the similarity Supp Support ort vector mac machines hines are not the only algorithm that can be enhanced of the corresponding training examples. using the kernel trick. Man Many y other linear mo models dels can be enhanced in this wa way y. The Supp ort v ector mac hines are not the only algorithm that can b e enhanced category of algorithms that emplo employ y the kernel tric trick k is kno known wn as kernel machines using the metho kernelds trick. Many other linear models can behölk enhanced in, this or kernel methods (Williams and Rasmussen , 1996 ; Sc Schölk hölkopf opf et al. al., 1999wa ). y. The category of algorithms that employ the kernel trick is known as kernel machines A ma major jor drawbac drawback k to kernel is that the; Sc cost ofopf ev evaluating aluating the ). decision or kernel metho ds (Williams andmachines Rasmussen , 1996 hölk et al., 1999 function is linear in the number of training examples, because the ith example A ma jor adrawbac to kernel of evaluating themachines decision (i) machines is that the cost Support con term α k contributes tributes ort vector i k (x, x ) to the decision function. Supp ith example function in the of training because the mostly are able is to linear mitigate thisnumber by learning an α examples, vector that contains zeros. α k ( x , x con tributes a term ) to the decision function. Supp ort vector machines Classifying a new example then requires ev evaluating aluating the kernel function only for are able to mitigate this by learning an vector training that contains mostly zeros. α the training examples that hav havee nonzero α i. These examples are known Classifying actors new. example then requires evaluating the kernel function only for as supp support ort ve vectors ctors. the training examples that have nonzero α . These training examples are known Kernel mac machines hines also suﬀer from a high computational cost of training when as support vectors. the dataset is large. We will revisit this idea in Sec. 5.9. Kernel mac machines hines with Kernel mac hines also suﬀer from a high computational cost of training generic kernels struggle to generalize well. We will explain why in Sec. 5.11.when The the dataset is large. W e will revisit this idea in Sec. 5.9 . Kernel mac hines with mo modern dern incarnation of deep learning was designed to overcome these limitations of generic kernels to generalize well. Wrenaissance e will explain whywhen in Sec. 5.11. etThe k ernel mac machines. hines.struggle The current deep learning began Hinton al. mo dern incarnation of deep learning was designed to o v ercome these limitations of (2006) demonstrated that a neural netw network ork could outperform the RBF kernel SVM k ernel mac hines. The current deep learning renaissance b egan when Hinton et al. on the MNIST benchmark. (2006) demonstrated that a neural network could outperform the RBF kernel SVM 142 on the MNIST benchmark.
CHAPTER 5. MACHINE LEARNING BASICS
5.7.3
Other Simple Supervised Learning Algorithms
W e ha hav ve Other already brieﬂy encountered another nonprobabilistic sup supervised ervised learning 5.7.3 Simple Supervised Learning Algorithms algorithm, nearest neigh neighb bor regression. More generally generally,, knearest neighbors is W e ha v e already brieﬂy encountered another nonprobabilistic ervised learning a family of tec techniques hniques that can be used for classiﬁcation orsup regression. As a algorithm, nearest neigh b or regression. More generally , nearest neighbors is k k nonparametric learning algorithm, nearest neighbors is not restricted to a ﬁxed aum family techniques that can be think used for classiﬁcation or regression. As a k nearest neigh n umb ber ofofparameters. We usually of the neighbors bors algorithm nonparametric learning algorithm, nearest neighbors is anot restricted to aofﬁxed as not ha having ving an any y parameters, but krather implemen implementing ting simple function the k n um b er of parameters. W e usually think of the nearest neigh bors algorithm training data. In fact, there is not even really a training stage or learning pro process. cess. as not ha ving an y parameters, but rather implemen ting a simple function of x, Instead, at test time, when we wan antt to pro produce duce an output y for a new test inputthe training data. In fact,neigh therebors is not even really a training ore learning process. knearest X. W w e ﬁnd the neighb to x in the training datastage then return the y forwaorks atthe testcorresponding time, when we ant toinpro an output newfor test input x, yw aInstead, verage of values theduce training set. This essentially k x X w eyﬁnd nearest bors to thecan training . Weothen an any kindthe of sup supervised ervisedneigh learning whereinwe deﬁne data an average ver y return values.the In y a v erage of the corresponding v alues in the training set. This w orks for essentially the case of classiﬁcation, we can av average erage ov over er onehot code vectors c with cy = 1 any ckind offor supall ervised learning where we can deﬁne an average ver y ovvalues. In and = 0 other v alues of . W e can then in interpret terpret the avoerage er these i i c with c = 1 the caseco ofdes classiﬁcation, we can average over onehot code vectors onehot codes as giving a probability distribution ov over er classes. As a nonparametric and c =algorithm, 0 for all other values of i. Wecan canac then terpret average ovexample, er these knearest learning neighbor achiev hiev hievee in very highthe capacit capacity y. For onehot co des as giving a probability distribution ov er classes. As a nonparametric supp suppose ose we ha hav ve a multiclass classiﬁcation task and measure performance with 01 learning algorithm, ac hieve to very high capacit or example, loss. In this setting,knearest 1nearestneighbor neighborcan con conv verges double the Ba Bay yy.esFerror as the supp ose we ha v e a multiclass classiﬁcation task and measure p erformance with num umb ber of training examples approac approaches hes inﬁnit inﬁnity y. The error in excess of the Ba Bay y01 es loss. In this setting, 1 nearest neighbor con v erges to double the Ba y es error as the error results from cho hoosing osing a single neighbor by breaking ties betw etween een equally n um b er of training examples approac hes inﬁnit y . The error in excess of Bayx es distan distantt neighbors randomly randomly.. When there is inﬁnite training data, all testthe points ointsx errorha results from cman hoosing a single ties bIf etw een equally will hav ve inﬁnitely many y training set neighbor neigh neighbors borsby at breaking distance zero. we allow the x distan t neighbors randomly . When there is inﬁnite training data, all test p oints algorithm to use all of these neigh neighb bors to vote, rather than randomly choosing one will have the inﬁnitely manyconv training neigh bors distance zero.high If wecapacity allow the of them, pro procedure cedure converges erges set to the Bay Bayes es at error rate. The of algorithm to use all of these neigh b ors to vote, rather than randomly choosing one knearest neigh neighbors bors allows it to obtain high accuracy giv given en a large training set. of them, pro convcomputational erges to the Bay es error high capacity of Ho How wev ever, er, the it do does es cedure so at high cost, and itrate. ma may y The generalize very badly knearest neighﬁnite bors allows to obtain accuracy given aneigh large training giv given en a small, trainingit set. One whigh eakness of k nearest neighb bors is thatset. it Ho w ev er, it do es so at high computational cost, and it ma y generalize very badly cannot learn that one feature is more discriminativ discriminativee than another. For example, k nearest giv en a small, ﬁnite training set. One eakness of wn bors is that it R100 dra imagine we hav havee a regression task with xw∈ drawn from anneigh isotropic Gaussian cannot learn that one feature discriminativ e than For Suppose example, x1Ris relev distribution, but only a singleis vmore ariable relevan an ant t to another. the output. imaginethat we hav regression task with drawn from, an further thise afeature simply enco encodes desx the output directly directly, i.e.isotropic that y = Gaussian x1 in all x distribution, but only a single v ariable is relev an t to the output. Suppose ∈ cases. Nearest neigh neighbor bor regression will not be able to detect this simple pattern. further that this feature simply enco theboutput directlyby , i.e. y =num x bin x will The nearest neighbor of most poin oints ts des e determined thethat large numb er all of cases. Nearest neighx bor ,regression will notfeature be ablex to detect this simple pattern. x 2 through features not by the lone . Th the output on small us 100 1 Thus The nearest of mostbepoin ts x will be determined by the large numb er of training setsneighbor will essentially random. features x through x , not by the lone feature x . Thus the output on small training sets will essentially be random.
143
CHAPTER 5. MACHINE LEARNING BASICS
R R
144
CHAPTER 5. MACHINE LEARNING BASICS
Another type of learning algorithm that also breaks the input space into regions and has separate parameters for each region is the de decision cision tr treee (Breiman et al., Another type of learning algorithm that also breaks the input into regions 1984 1984)) and its many varian ariants. ts. As shown in Fig. 5.7, eac each h no node de of space the decision tree and has separate parameters for each region is the de cision tr e e ( Breiman et al., is asso associated ciated with a region in the input space, and internal no nodes des break that region 1984 ) and its many v arian ts. As shown in Fig. 5.7 , eac h no de of the decision tree in into to one subregion for each child of the no node de (typically using an axisaligned is assoSpace ciated is with a region in the into inputnonov space,erlapping and internal nodeswith breaka that region cut). th thus us subdivided nonoverlapping regions, onetoone in to one subregion for leaf eachno cdes hildand of the noregions. de (typically using an usually axisaligned corresp correspondence ondence b et etw ween nodes input Eac Each h leaf node maps cut). Space is th us subdivided into nonov erlapping regions, with a onetoone ev every ery point in its input region to the same out output. put. Decision trees are usually corresp ondence b et w een leaf no des and input regions. Eachscope leaf node usually trained with sp specialized ecialized algorithms that are beyond the of this book.maps The every point in its can input to the same output.if it Decision trees are usually learning algorithm beregion considered nonparametric is allow allowed ed to learn a tree trained withsize, specialized algorithms that beyond the scope of this book. The of arbitrary though decision trees areare usually regularized with size constraints learning canparametric be considered nonparametric it is allowtrees ed to as learn a tree that turnalgorithm them into mo models dels in practice.if Decision they are of arbitrary size, though decision trees are usually regularized with size constraints typically used, with axisaligned splits and constant outputs within eac each h no node, de, that turntothem parametric dels practice. Decision as they F are struggle solveinto some problems mo that areineasy even for logistictrees regression. or texample, ypically used, with axisaligned splits and constant outputs within eac h no de, if we hav havee a twoclass problem and the positive class occurs wherev wherever er struggle to solve some problems that are easy even for logistic regression. F or x2 > x1 , the decision boundary is not axisaligned. The decision tree will th thus us example, if we hav e a t w oclass problem and the p ositive class o ccurs wherev er need to approximate the decision boundary with man many y no nodes, des, implementing a step x > x , that the decision boundary is not axisaligned. tree will thus function constantly walks back and forth acrossThe the decision true decision function need axisaligned to approximate the decision boundary with many nodes, implementing a step with steps. function that constantly walks back and forth across the true decision function we ha hav ve seen, havee man many y withAs axisaligned steps.nearest neighbor predictors and decision trees hav limitations. Nonetheless, they are useful learning algorithms when computational As we are haveconstrained. seen, nearestWneighbor predictors and decision trees have many resources e can also build in intuition tuition for more sophisticated limitations. Nonetheless, they areab useful learning algorithms computational learning algorithms by thinking about out the similarities and when diﬀerences b et etw ween resources are constrained. W e can also build in tuition for more sophisticated sophisticated algorithms and kNN or decision tree baselines. learning algorithms by thinking about the similarities and diﬀerences b etween See Murphy (2012), Bishop (2006), Hastie et al. (2001) or other machine sophisticated algorithms and kNN or decision tree baselines. learning textb textboooks for more material on traditional sup supervised ervised learning algorithms. See Murphy (2012), Bishop (2006), Hastie et al. (2001) or other machine learning textbooks for more material on traditional supervised learning algorithms.
5.8
Unsup Unsupervised ervised Learning Algorithms
Recall Sec. 5.1.3 that unsup unsupervised ervised algorithms are those that exp experience erience only 5.8 from Unsup ervised Learning Algorithms “features” but not a sup supervision ervision signal. The distinction betw between een sup supervised ervised and Recall ervised from Sec. 5.1.3 thatisunsup ervised algorithms those that experience unsup unsupervised algorithms not formally and rigidlyaredeﬁned because there isonly no “features” but not a sup ervision signal. The distinction betw een sup ervised and ob objective jective test for distinguishing whether a value is a feature or a target provided by unsup ervised Informally algorithms is notervised formally and rigidly deﬁned ecause there is no a sup supervisor. ervisor. Informally, , unsup unsupervised learning refers to most battempts to extract ob jective testfrom for distinguishing whether a vnot alue require is a feature or a lab target provided by information a distribution that do human labor or to annotate a supervisor. Informally unsupervised learning to most attemptslearning to extract examples. The term is , usually asso associated ciated withrefers density estimation, to information from a distribution that do not require human lab or to annotate dra draw w samples from a distribution, learning to denoise data from some distribution, examples. The term is the usually with density the estimation, ﬁnding a manifold that dataasso liesciated near, or clustering data intolearning groups to of draw samples from a distribution, learning to denoise data from some distribution, 145 or clustering the data into groups of ﬁnding a manifold that the data lies near,
CHAPTER 5. MACHINE LEARNING BASICS
related examples. A classic unsup unsupervised ervised learning task is to ﬁnd the “best” represen representation tation of the related examples. data. By ‘b ‘best’ est’ we can mean diﬀeren diﬀerentt things, but generally sp speaking eaking we are lo looking oking A classic unsup ervised learning task is to ﬁnd the “best” represen tation of the for a representation that preserv preserves es as muc much h information ab about out x as possible while data. Bysome ‘best’penalty we canor mean diﬀerenaimed t things, but generally speaking we are looking ob obeying eying constraint at keeping the representation simpler or for a representation preserves as much information about x as possible while more accessible thanthat x itself. obeying some penalty or constraint aimed at keeping the representation simpler or There are multiple wa ways ys of deﬁning a simpler represen representation. tation. Three of the more accessible than x itself. most common include lo low wer dimensional represen representations, tations, sparse representations There are m ultiple wa ys of deﬁning a simpler tation. Three of the and indep independen enden endentt represen representations. tations. Lowdimensionalrepresen representations attempt to most common lower dimensional represen tations, sparse representations compress as minclude uc uch h information ab about out x as possible in a smaller represen representation. tation. and indep endentations t represen tations. Lowdimensional representations attempt to Sparse represen representations (Barlow , 1989 ; Olshausen and Field, 1996; Hin Hinton ton and x compress as ,m1997 uch )information out into as possible in atation smaller represen tation. Ghahramani embed the ab dataset a represen representation whose entries are Sparse represen tations ( Barlow , 1989 ; Olshausen and Field , 1996 ; Hin ton and mostly zero zeroes es for most inputs. The use of sparse representations typically requires Ghahramani , 1997 ) embed theofdataset into a represen whose entries are increasing the dimensionality the representation, so tation that the represen representation tation mostly zero es for zero mostesinputs. The use ofto sparse representations typically requires b ecoming mostly zeroes do does es not discard too o muc much h information. This results in an increasing the dimensionality of the representation, so that the represen tation overall structure of the represen representation tation that tends to distribute data along the axes b ecoming mostly zero es do es not discard o much tations information. Thistoresults in an of the represen representation tation space. Indep Independen enden endentttorepresen representations attempt disentangle overall structure of the represen tation to distribute data the along the axes the sources of variation underlying thethat datatends distribution suc such h that dimensions of the the representation representation are space. Independen t representations attempt to disentangle of statistically independent. the sources of variation underlying the data distribution such that the dimensions Of course these three criteria are certainly not mutually exclusive. Lo Lowwof the representation are statistically independent. dimensional representations often yield elements that hav havee fewer or weak eaker er deOf course these criteria are certainly not This mutually exclusive. pendencies than the three original highdimensional data. is because one waLo y wto dimensional representations often yield elements that hav e fewer or w eak er dereduce the size of a representation is to ﬁnd and remov removee redundancies. Identifying p endencies than the original highdimensional data. This reduction is becausealgorithm one way to and remo removing ving more redundancy allows the dimensionality to reduce the size of a representation is to ﬁnd and remov e redundancies. Identifying ac achiev hiev hievee more compression while discarding less information. and removing more redundancy allows the dimensionality reduction algorithm to The notioncompression of representation is one of the central themes of deep learning and achiev e more while discarding less information. therefore one of the central themes in this book. In this section, we dev develop elop some The notion of representation is one of the central themes of deep learning and simple examples of represen representation tation learning algorithms. Together, these example therefore one of the central themes in this b o ok. In this section, w e dev elop some algorithms show how to op operationalize erationalize all three of the criteria ab abov ov ove. e. Most of the simple examples of represen tation learning algorithms. T ogether,algorithms these example remaining chapters in intro tro troduce duce additional representation learning that algorithms show how to op erationalize all three of the criteria ab ov e. Most of the dev develop elop these criteria in diﬀerent ways or in intro tro troduce duce other criteria. remaining chapters introduce additional representation learning algorithms that develop these criteria in diﬀerent ways or introduce other criteria.
5.8.1
Principal Comp Componen onen onents ts Analysis
5.8.1 Principal ts Analysis In Sec. 2.12 , we sa saw w Comp that theonen principal comp components onents analysis algorithm pro provides vides a means of compressing data. We can also view PCA as an unsup unsupervised ervised learning In Sec. 2.12 , welearns saw that the principal onents analysis algorithm videsona algorithm that a represen representation tation comp of data. This represen representation tation ispro based We can also tation view PCA as ed an ab unsup tmeans wo of of thecompressing criteria for data. a simple represen representation describ described abov ov ove. e.ervised PCA learning learns a algorithm that learns a representation of data. This representation is based on two of the criteria for a simple represen 146tation describ ed ab ove. PCA learns a
CHAPTER 5. MACHINE LEARNING BASICS
x z= x W z= x W
z z
x
z
z
represen representation tation that has low lower er dimensionalit dimensionality y than the original input. It also learns a represen representation tation whose elemen elements ts hav havee no linear correlation with each other. This represen that er dimensionalit y than the original also learns is a ﬁrsttation step tow toward ardhas thelow criterion of learning represen representations tations input. whose It elemen elements ts are a represen tation whose elemen ts hav e no linear correlation with each other. This statistically indep independent. endent. To achiev achievee full independence, a represen representation tation learning is a ﬁrst step tow ard the criterion of learning represen tations whose elemen ts are algorithm must also remov removee the nonlinear relationships bet etw ween variables. statistically independent. To achieve full independence, a representation learning PCA learns an orthogonal, linear transformation of the data that pro projects jects an algorithm must also remove the nonlinear relationships between variables. input x to a represen representation tation z as sho shown wn in Fig. 5.8. In Sec. 2.12, we saw that we PCA learns an orthogonal, linear transformation of the data that the pro jects an could learn a onedimensional represen representation tation that best reconstructs original x z input to a represen tation as sho wn in Fig. 5.8 . In Sec. 2.12 , w e saw that w e data (in the sense of mean squared error) and that this represen representation tation actually could learn represen tation that est reconstructs theuse original corresp the ﬁrst principal comp thebdata. Th PCA corresponds onds atoonedimensional componen onen onent t of Thus us we can data (in the sense of mean squared error) and that this represen tation actually as a simple and eﬀectiv eﬀectivee dimensionalit dimensionality y reduction metho method d that preserv preserves es as muc much h corresp onds to the ﬁrst principal comp onen t of the data. Th us w e can use PCA of the information in the data as possible (again, as measured by leastsquares as a simple anderror). eﬀectiv dimensionalit metho thatPCA preserv es as much reconstruction Ine the follo following, wing,ywreduction e will study ho how wdthe representation of the information in thedata datarepresentation as possible (again, as measured by leastsquares decorrelates the original X. reconstruction error). In the following, we will study how the PCA representation Let us consider the m × n dimensional design matrix X . We will assume that decorrelates the original data representation X . the data has a mean of zero, E[ x] = 0. If this is not the case, the data can easily n dimensional Let us consider the m the design matrix X . preprocessing We willcessing assume that be centered by subtracting step. E mean from all examples in a prepro the data has a mean of zero, × [ x] = 0. If this is not the case, the data can easily The un unbiased biased sample cov covariance ariance asso associated ciated with is giv given en by: step. be centered by subtracting the meanmatrix from all examples in a X prepro cessing 1 asso>ciated with X is given by: The unbiased sample covariance matrix Var[x] = X X. (5.85) m−1 1 Var[x] = 147 X X . (5.85) m 1 −
CHAPTER 5. MACHINE LEARNING BASICS
PCA ﬁnds a represen representation tation (through linear transformation) z = x>W where Var[z ] is diagonal. PCA ﬁnds a representation (through linear transformation) z = x W where In Sec. 2.12, we sa saw w that the principal comp components onents of a design matrix X are Var[z ] is diagonal. > giv given en by the eigenv eigenvectors ectors of X X . From this view, In Sec. 2.12, we saw that the principal components of a design matrix X are > = Wthis ΛWview, given by the eigenvectors of XXX>.XFrom . (5.86) X X e=deriv WΛ W of . the principal components. (5.86) In this section, we exploit an alternativ alternative derivation ation The principal comp componen onen onents ts may also be obtained via the singular value decomp decomposition. osition. In this section, we exploit an alternativ e deriv ation of the principal components. Sp Speciﬁcally eciﬁcally eciﬁcally,, they are the righ rightt singular vectors of X . To see this, let W be The the principal comp onen ts may also b e obtained via the singular v alue decomp osition. > righ rightt singular vectors in the decomp decomposition osition X = U ΣW . W Wee then recov recover er the Speciﬁcally , they areequation the righwith t singular of X . Tobasis: see this, let W be the original eigen eigenv vector W asvectors the eigenv eigenvector ector right singular vectors in the decomposition X = U ΣW . We then recover the > original eigenvector > > as the eigenv Xequation X = Uwith ΣWW U ΣW > = ector W Σ2basis: W >. (5.87)
X X = U ΣW U ΣW = W Σ W . (5.87) ar[[ z]. Using the The SVD is helpful to show that PCA results in a diagonal Var SVD of X , we can express the variance of X as: The SVD is helpful to show that PCA results in a diagonal Var[ z]. Using the 1 > X as: SVD of X , we can express Var[xthe X (5.88) ] =variance Xof m−1 1 Var[x] = = 1 (X XW > )>U ΣW > (5.88) UΣ (5.89) m m−1 1 1 > = 1− W (U Σ>W >) U ΣW (5.89) = (5.90) m − 11 Σ U U ΣW m 1 = 1− W W Σ2 U >U ΣW (5.90) (5.91) = m − 11 Σ W , m 1− (5.91) = WΣ W , > where we use the fact that U Um= I 1because the U matrix of the singular value deﬁnition is deﬁned to be orthonormal. This shows that if we tak takee z = x> W , we −I because U U = U where w e use the fact that the matrix of the singular value can ensure that the cov covariance ariance of z is diagonal as required: deﬁnition is deﬁned to be orthonormal. This shows that if we take z = x W , we 1 can ensure that the covariance Var[z ] =of z is diagonal Z >Z as required: (5.92) m−1 1 Var[z ] = = 1 W Z >ZX >X W (5.92) (5.93) m − 11 m 11− W X> X2 W > = (5.93) (5.94) = m 1WW Σ WW m−1 1 (5.94) = 1− Σ W2W Σ W W = (5.95) m − 11 , m 1 = − Σ> , (5.95) where this time we use the fact that m W 1 W = I , again from the deﬁnition of the SVD. where this time we use the fact that−W W = I , again from the deﬁnition of the SVD.
148
CHAPTER 5. MACHINE LEARNING BASICS
The ab abov ov ovee analysis shows that when we pro project ject the data x to z, via the linear transformation W , the resulting representation has a diagonal co cov variance matrix x z The ab ov e analysis shows that when we pro ject the data to , via ts theoflinear 2 (as giv given en by Σ ) whic which h immediately implies that the individual elemen elements z are W transformation , the resulting representation has a diagonal co v ariance matrix mutually uncorrelated. (as given by Σ ) which immediately implies that the individual elements of z are This ability of PCA to transform data into a representation where the elemen elements ts mutually uncorrelated. are mutually uncorrelated is a very imp important ortant prop property erty of PCA. It is a simple This ability of PCA to transform data to into a representation where the elements example of a represen representation tation that attempt are mutually underlying uncorrelated is data. a veryInimp prop ertythis of PCA. It is a simple the theortant case of PCA, disen disentangling tangling takes example of a represen tation that attempt to the form of ﬁnding a rotation of the input space (describ (described ed by W ) that aligns the the data. In the casenew of PCA, this disen tangling takes principal axes underlying of variance with the basis of the representation space asso associated ciated the form with z. of ﬁnding a rotation of the input space (described by W ) that aligns the principal axes of variance with the basis of the new representation space associated While correlation is an imp important ortant category of dep dependency endency b et etw ween elements of with z. the data, we are also in interested terested in learning represen representations tations that disentangle more While correlation is an imp ortant category depwe endency b etwmore een elements of complicated forms of feature dep dependencies. endencies. Forofthis, will need than what the data, we with are also interested learning representations that disentangle more can be done a simple linearintransformation. complicated forms of feature dependencies. For this, we will need more than what can be done with a simple linear transformation.
5.8.2
kmeans Clustering k 5.8.2 example means Another of aClustering simple representation learning algorithm is k means clustering.
The k means clustering algorithm divides the training set in into to k diﬀeren diﬀerentt clusters Another example of a simple representation learning algorithm clustering. of examples that are near eac each h other. We can thus think isofk means the algorithm as k k The means clustering algorithm divides the training set in to diﬀeren t clusters pro providing viding a kdimensional onehot co code de vector h represen representing ting an input x. If x of examples that are near eac h other. W e can thus think of the algorithm as belongs to cluster i , then h i = 1 and all other en entries tries of the represen representation tation h are providing a kdimensional onehot code vector h representing an input x. If x zero. belongs to cluster i , then h = 1 and all other entries of the representation h are The onehot co code de provided by kmeans clustering is an example of a sparse zero. represen representation, tation, because the ma majority jority of its entries are zero for ev every ery input. Later, k The onehot co de provided by means clustering is an example of a sparse we will dev develop elop other algorithms that learn more ﬂexible sparse representations, represen tation, because the jority of its entries for xev. ery input. co Later, where more than one en can be nonzero for are eac input Onehot entry tryma each hzero codes des w e will develop example other algorithms learn more ﬂexible sparse are an extreme of sparsethat representations that lose manyrepresentations, of the b eneﬁts where more thanrepresentation. one entry can The be nonzero eacstill h input Onehot codes x . some of a distributed onehot for co code de confers statistical are an extreme examplecon of vsparse that loseinmany of the b eneﬁts adv advantages antages (it naturally conv eys therepresentations idea that all examples the same cluster are of a distributed representation. The onehot co de still confers some statistical similar to each other) and it confers the computational adv advantage antage that the en entire tire advantages (it naturally conveys the that all examples in the same cluster are represen representation tation ma may y be captured by aidea single in integer. teger. similar to each other) and it confers the computational advantage that the entire The ktation meansma algorithm works by diﬀerentt cen centroids troids {µ(1), . . . , µ(k) } k diﬀeren represen y be captured by initializing a single integer. to diﬀerent values, then alternating betw etween een two diﬀerent steps un until til con conv vergence. The means algorithm works by initializing diﬀeren t cen troids k k , . . . , µ of µ i i In one step, each training example is assigned to cluster , where is the index to diﬀerent alues, betw een eac twohdiﬀerent convergence. {tildated µ (i)alternating µ(i) isunup the nearest vcen centroid troidthen . In the other step, each cen centroid troidsteps updated to the} i i In one step, each training example is assigned to cluster , where is the index of ( j ) mean of all training examples x assigned to cluster i. the nearest centroid µ . In the other step, each centroid µ is updated to the mean of all training examples x assigned 149 to cluster i.
CHAPTER 5. MACHINE LEARNING BASICS
One diﬃculty pertaining to clustering is that the clustering problem is inherently illp illposed, osed, in the sense that there is no single criterion that measures ho how w well a One diﬃculty pertaining to clustering is that the clustering problem is inherently clustering of the data corresp corresponds onds to the real world. We can measure properties of illp osed, in the sense that there is no single criterion that measures how well a the clustering suc such h as the average Euclidean distance from a cluster cen centroid troid to the clustering of the data corresp onds to the real w orld. W e can measure properties of mem memb bers of the cluster. This allows us to tell how well we are able to reconstruct the clustering such from as thethe average Euclidean distance a cluster troid to the the the training data cluster assignmen assignments. ts. Wefrom do not kno know wcen how well members of the cluster. Thisond allows us to tell how well weware able to reconstruct cluster assignments corresp correspond to properties of the real orld. Moreo Moreov ver, there the training data from the cluster assignmen ts. W e do not kno w how well ma may y be man many y diﬀerent clusterings that all corresp correspond ond well to some prop propert ert erty ythe of cluster assignments corresp ond to properties of the real w orld. Moreo v er, there the real world. We may hop hopee to ﬁnd a clustering that relates to one feature but may beaman y diﬀerent clusterings that all that corresp ond relev well ant to some ertyFor of obtain diﬀeren diﬀerent, t, equally valid clustering is not relevant to ourprop task. the real wsupp orld.ose Wthat e may to oﬁnd a clustering that relates to oneconsisting feature but example, suppose wehop rune tw two clustering algorithms on a dataset of obtain a diﬀeren t, equally v alid clustering that is not relev ant to our task. F or images of red truc trucks, ks, images of red cars, images of gra gray y trucks, and images of gra gray y example, supp ose that w e run tw o clustering algorithms on a dataset consisting of cars. If we ask each clustering algorithm to ﬁnd two clusters, one algorithm ma may y images of red truc ks, images of red cars, images of gra y trucks, and images of graofy ﬁnd a cluster of cars and a cluster of trucks, while another ma may y ﬁnd a cluster cars.vehicles If we ask clustering algorithm ﬁnd twowclusters, algorithm may red andeach a cluster of gray vehicles.toSuppose e also runone a third clustering ﬁnd a cluster of cars and cluster of trucks, while may This ﬁnd amay cluster of algorithm, which is allo allow weda to determine the num number beranother of clusters. assign red vehicles and a cluster of gray v ehicles. Suppose w e also run a third clustering the examples to four clusters, red cars, red truc trucks, ks, gra gray y cars, and gra gray y trucks. This algorithm, which is at alloleast wed captures to determine the numab ber of clusters. This may new clustering now information b oth attributes, but assign it has about out the examples to four clusters, red cars, red truc ks, gra y cars, and gra y trucks. lost information about similarit similarity y. Red cars are in a diﬀerent cluster from This gra gray y new clustering now at least captures information ab out b oth attributes, but it has cars, just as they are in a diﬀeren diﬀerentt cluster from gray trucks. The output of the lost information about similarit y. us Red cars in are a diﬀerent cluster clustering algorithm do does es not tell that redare cars more similar to from gra gray y gra carsy cars, they just as are truc in aks. diﬀeren cluster from gray btrucks. The and output than arethey to gray trucks. Theyt are diﬀeren diﬀerent t from oth things, thatofisthe all clustering w e kno know. w. algorithm does not tell us that red cars are more similar to gray cars than they are to gray trucks. They are diﬀerent from both things, and that is all These may y prefer a distributed we kno w. issues illustrate some of the reasons that we ma represen representation tation to a onehot represen representation. tation. A distributed represen representation tation could ha hav ve issuesforillustrate some of therepresenting reasons thatitswe mayand prefer distributed twoThese attributes each vehicle—one color one arepresenting representation a onehot represen distributed tation could have whether it is atocar or a truck. Ittation. is stillAnot en entirely tirely represen clear what the optimal two attributes for each vehicle—one color andknow one representing distributed represen representation tation is (ho (how w canrepresenting the learningitsalgorithm whether the whether it is a car or a truck. It is still not en tirely clear what the optimal two attributes we are in interested terested in are color and carv carversustruck ersustruck rather than distributed represen tation is (ho w can the learning algorithm know whether the man and age?) but having man attributes reduces the burden on the manufacturer ufacturer many y talgorithm wo attributes we are in terested in are color and carv ersustruck rather than to guess whic which h single attribute we care ab about, out, and allo allows ws us to measure man ufacturer and age?) but having man y attributes reduces the burden on the similarit similarity y betw between een ob objects jects in a ﬁnegrained way by comparing many attributes algorithm to guess whic h single attribute we care ab out, and allo ws us to measure instead of just testing whether one attribute matc matches. hes. similarity between objects in a ﬁnegrained way by comparing many attributes instead of just testing whether one attribute matches.
5.9
Sto Stocchastic Gradient Descent
Nearly of cdeep learning is pow owered ered Descent by one very imp importan ortan ortantt algorithm: sto stochastic chastic 5.9 all Sto hastic Gradient gr gradient adient desc descent ent or SGD. Sto Stocchastic gradient descent is an extension of the gradient Nearly all of deep learning is powered by one very important algorithm: stochastic gradient descent or SGD. Stochastic gradient 150 descent is an extension of the gradient
CHAPTER 5. MACHINE LEARNING BASICS
descen descentt algorithm introduced in Sec. 4.3. A recurring problem in mac machine hine learning is that large training sets are necessary descent algorithm introduced in Sec. 4.3. for go goo o d generalization, but large training sets are also more computationally A recurring problem in machine learning is that large training sets are necessary exp expensiv ensiv ensive. e. for goo d generalization, but large training sets are also more computationally The cost function used by a machine learning algorithm often decomp decomposes oses as a expensive. sum over training examples of some perexample loss function. For example, the Theecost functionloglikelihoo used by a machine decomp negativ negative conditional loglikelihood d of thelearning trainingalgorithm data can often be written asoses as a sum over training examples of some perexample loss function. For example, the m negative conditional loglikelihood of the training be written as 1 Xdata can J (θ) = E ,y∼ˆp L(x, y, θ) = L(x(i) , y (i), θ) (5.96) m 1 i=1 E J (θ) = L(x, y, θ) = L(x , y , θ) (5.96) mlog p(y  x; θ ). where L is the perexample loss L(x, y, θ) = −
ForLthese additive e cost functions, gradient where is theadditiv perexample loss L(x, ygradien , θ) = t descent log p(y requires x; θ ). computing m X requires computing For these additive cost functions, gradien− t descent 1 X (5.97) ∇θ J (θ) = ∇ θ L(x(i) , y (i), θ). m 1 i=1 (5.97) J (θ ) = L(x , y , θ). m O ( m The computational cost ∇ of this op operation eration is ) . As the training set size grows to ∇ billions of examples, the time to take a single gradien gradientt step becomes prohibitiv prohibitively ely O ( m The computational cost of this op eration is ) . As the training set size grows to long. billions of examples, the time to takeX a single gradient step becomes prohibitively The insight of sto stocchastic gradient descen descentt is that the gradien gradientt is an exp expectation. ectation. long. The exp expectation ectation may be appro approximately ximately estimated using a small set of samples. The insight of stostep chastic gradient descenwteiscan that the gradien t isatch an exp ectation. Sp Speciﬁcally eciﬁcally eciﬁcally, , on each of the algorithm, sample a minib minibatch of examples 0 The ectation may ) } be approximately estimated using a small set of samples. , . . . , x (m x(1) B = {exp dra drawn wn uniformly from the training set. The minibatc minibatch h size Sp eciﬁcally , on each step of the algorithm, w e can sample a minib atch of examples 0 m number ber of examples, ranging from B is typically chosen to be a relatively small num = a xfew, .h.undred. .,x drawn uniformly from the minibatc h size m 1 to Crucially Crucially, , m0 is usually heldtraining ﬁxed asset. the The training set size m is t ypically c hosen to b e a relatively small num ber of examples, ranging from { } gro grows. ws. We may ﬁt a training set with billions of examples using updates computed 1ontoonly a few hundred.examples. Crucially, m is usually held ﬁxed as the training set size m a hundred grows. We may ﬁt a training set with billions of examples using updates computed The estimate of the gradient is formed as on only a hundred examples. m0 The estimate of the gradient1 is formed as X g = 0 ∇θ L(x(i) , y(i) , θ). (5.98) m i=1 1 g= L(x , y , θ). (5.98) m B. The stochastic gradien using examples from the minibatch descen algorithm gradientt descentt ∇ then follo follows ws the estimated gradient B do downhill: wnhill: using examples from the minibatch . The stochastic gradient descent algorithm then follows the estimated gradientθ do wnhill: ← θ − g , (5.99) X
where is the learning rate. where is the learning rate.
θ
θ
← 151−
g ,
(5.99)
CHAPTER 5. MACHINE LEARNING BASICS
Gradien Gradientt descen descentt in general has often been regarded as slow or unreliable. In the past, the application of gradien gradientt descen descentt to nonconv nonconvex ex optimization problems Gradien t descen t in general has often b een regarded slow or unreliable. In was regarded as foolhardy or unprincipled. Toda day y, we as kno know w that the mac machine hine the past, mo thedels application of in gradien ex optimization learning models describ described ed Part tIIdescen work tvto erynonconv well when trained with problems gradient w as regarded as foolhardy or unprincipled. T o da y , w e kno w that the mac hinea descen descent. t. The optimization algorithm may not be guaranteed to arrive at even learning modelsindescrib ed in Pamount art II work verybut wellit when gradient lo local cal minimum a reasonable of time, often trained ﬁnds a vwith ery lo low w value descen t. The optimization algorithm may not b e guaranteed to arrive at even a of the cost function quickly enough to be useful. local minimum in a reasonable amount of time, but it often ﬁnds a very low value Sto Stocchastic gradient descen descentt has man many y imp important ortant uses outside the con context text of of the cost function quickly enough to be useful. deep learning. It is the main way to train large linear mo models dels on very large Stochastic descen t has manyper imp ortant usesdo outside context of datasets. For agradient ﬁxed mo model del size, the cost SGD up update date does es not the dep depend end on the deep learning. It. is main we waoften y to train large linear mothe delstraining on very training set size m In the practice, use a larger mo model del as setlarge size datasets. F or a ﬁxed mo del size, the cost p er SGD up date do es not dep end on the increases, but we are not forced to do so. The num number ber of up updates dates required to reach m training set size . In practice, we often use a larger mo del as the training set size con conv vergence usually increases with training set size. Ho How wev ever, er, as m approac approaches hes increases, butmo wedel arewill notev forced to do so. Theto num of pup dates required inﬁnit inﬁnity y, the model even en entually tually con converge verge itsber best ossible test errortobreach efore m approac convergence usually increases with training set set. size.Increasing However,masfurther hes SGD has sampled ev every ery example in the training will not inﬁnity,the theamount model of willtraining eventually vergetotoreac itshbthe est p ossible error before extend time con needed reach mo model’s del’s btest est possible test m SGD has sampled ev ery example in the training set. Increasing further will not error. From this point of view, one can argue that the asymptotic cost of training amount of O training needed aextend mo model delthe with SGD is (1) as atime function of to m.reach the model’s best possible test error. From this point of view, one can argue that the asymptotic cost of training Prior to the adv adven en ent t (1) of deep themmain wa way y to learn nonlinear models a mo del with SGD is O as a learning, function of . was to use the kernel trick in com combination bination with a linear mo model. del. Man Many y kernel learning Prior to the adv en t of deep learning, the main wa y to learn nonlinear models ( i ) ( j ) algorithms require constructing an m × m matrix Gi,j = k (x , x ). Constructing w as to use the trick in com bination a linear model. Many kernel learning O (m 2)with this matrix haskernel computational cost , which is clearly undesirable for datasets m m matrix G in = k2006, (x , xdeep algorithms require constructing ). learning Constructing with billions of examples. In an academia, starting was O ( m this matrix has computational cost ) , which is clearly undesirable for datasets × initially interesting because it was able to generalize to new examples better with billions of algorithms examples. when In academia, starting in 2006,datasets deep learning was than comp competing eting trained on mediumsized with tens of initially interesting because was deep able learning to generalize to new examples better thousands of examples. Soon itafter, garnered additional interest in than comp eting algorithms trained datasets withon tens of industry industry, , because it pro provided videdwhen a scalable waon y ofmediumsized training nonlinear models large thousands of examples. Soon after, deep learning garnered additional interest in datasets. industry, because it provided a scalable way of training nonlinear models on large Sto Stocchastic gradien gradientt descen descentt and man many y enhancements to it are describ described ed further datasets. in Chapter 8. Stochastic gradient descent and many enhancements to it are described further in Chapter 8.
5.10
Building a Mac Machine hine Learning Algorithm
Nearly deep learning algorithms be describ described ed as particular instances of 5.10 allBuilding a Mac hinecan Learning Algorithm a fairly simple recip recipe: e: combine a speciﬁcation of a dataset, a cost function, an Nearly all deep learning algorithms optimization pro procedure cedure and a mo model. del. can be described as particular instances of a fairly simple recipe: combine a speciﬁcation of a dataset, a cost function, an For example, the linear optimization procedure andregression a model. algorithm combines a dataset consisting of For example, the linear regression algorithm combines a dataset consisting of 152
CHAPTER 5. MACHINE LEARNING BASICS
X and y , the cost function X and y , the cost function J (w, b) = −E ,y∼ˆp (5.100) log pmodel(y  x), E J (w , b) =(y  x ) = N (ylog (y1) ), (5.100) pmodel ; x>pw + b, the mo model del sp speciﬁcation eciﬁcation 1),x , and, in most cases, the optimization algorithm deﬁned b− y solving for where the gradien gradientt of the cost is zero p ( ) = ( y ; x w + b, the mo del sp eciﬁcation 1) , and, in most cases, the y x using the normal equations. optimization algorithm deﬁned by solving N for where the gradient of the cost is zero By realizing that w e can replace an any y of these comp componen onen onents ts mostly independently using the normal equations. from the others, we can obtain a very wide variety of algorithms. By realizing that we can replace any of these components mostly independently The cost function typically includes at least one term that causes the learning from the others, we can obtain a very wide variety of algorithms. pro process cess to perform statistical estimation. The most common cost function is the Theecost function d, typically at least term that causes learning negativ negative loglikelihoo loglikelihood, so thatincludes minimizing theone cost function causesthe maximum pro cessoto perform statistical estimation. The most common cost function is the lik likeliho eliho elihoo d estimation. negative loglikelihood, so that minimizing the cost function causes maximum The cost function ma may y also include additional terms, suc such h as regularization likelihood estimation. terms. For example, we can add weigh eightt deca decay y to the linear regression cost function The cost function may also include additional terms, such as regularization to obtain terms. For example, y tolog thep linear regression cost function J (ww,eb)can = λadd ww 22eigh − Et deca (5.101) ,y∼p model(y  x). to obtain E This still allows closedform log p J (w, b) = λoptimization. w (y x). (5.101) we callows hange closedform the mo model del tooptimization. then most cost functions can no longer be nonlinear, −  ThisIfstill be optimized in closed form. This requires us to choose an iterativ iterativee numerical If w e c hange the mo del to b e nonlinear, then most cost functions can no longer optimization pro procedure, cedure, such as gradien gradientt descen descent. t. be optimized in closed form. This requires us to choose an iterative numerical The recip recipeepro forcedure, constructing a learning combining bining mo models, dels, costs, and optimization such as gradientalgorithm descent. by com optimization algorithms supp supports orts both sup supervised ervised and unsup unsupervised ervised learning. The The recipe forexample constructing algorithm bervised y combining models, costs, and linear regression sho shows wsa learning ho how w to supp support ort sup supervised learning. Unsup Unsupervised ervised optimization orts botha dataset supervised unsup ervised learning. The X and learning can balgorithms e supp supported ortedsupp by deﬁning thatand con contains tains only onlyX providing linear regression example sho ws ho w to supp ort sup ervised learning. Unsup ervised an appropriate unsup unsupervised ervised cost and mo model. del. For example, we can obtain the ﬁrst learning can bby e supp orted by deﬁning a dataset that PCA vector sp specifying ecifying that our loss function is contains only X and providing an appropriate unsupervised cost and model. For example, we can obtain the ﬁrst PCA vector by sp ecifyingJ (that loss function is; w) 22 x − r(x w) =our E ∼p (5.102) E r(x ; w)and reconstruction function J (to w)hav = e w with xnorm (5.102) while our mo model del is deﬁned have one r(x) = w >xw xw..  −  while our model is deﬁned to have w with norm one and reconstruction function cases, the cost function may be a function that we cannot actually r(x)In=some w xw . ev evaluate, aluate, for computational reasons. In these cases, we can still approximately In some cases,iterativ the cost function optimization may be a function cannot minimize it using iterative e numerical so longthat as ww e ehav have e someactually wa way y of ev aluate, for computational reasons. In these cases, we can still approximately appro approximating ximating its gradients. minimize it using iterative numerical optimization so long as we have some way of Most mac machine hine learning algorithms make use of this recipe, though it ma may y not approximating its gradients. immediately be obvious. If a mac machine hine learning algorithm seems esp especially ecially unique or Most machine learning algorithms make use of this recipe, though it may not 153 immediately be obvious. If a machine learning algorithm seems esp ecially unique or
CHAPTER 5. MACHINE LEARNING BASICS
handdesigned, it can usually be understo understoood as using a sp specialcase ecialcase optimizer. Some mo models dels suc such h as decision trees or k means require sp specialcase ecialcase optimizers because handdesigned, it can usually be understo od as them using inappropriate a specialcase for optimizer. Some their cost functions ha hav ve ﬂat regions that make minimization k mo dels such as decision trees or means require specialcase because b y gradientbased optimizers. Recognizing that most machine optimizers learning algorithms their cost functions ha v e ﬂat regions that make them inappropriate for minimization can be describ described ed using this recipe helps to see the diﬀerent algorithms as part of a by gradientbased optimizers. that most machine learning algorithms taxonom taxonomy y of metho methods ds for doingRecognizing related tasks that work for similar reasons, rather can b e describ ed using this recipe helps to see the diﬀerent algorithms as part of a than as a long list of algorithms that eac each h ha hav ve separate justiﬁcations. taxonomy of methods for doing related tasks that work for similar reasons, rather than as a long list of algorithms that each have separate justiﬁcations.
5.11
Challenges Motiv Motivating ating Deep Learning
The simple mac machine hine learning algorithms describ described ed in Learning this chapter work very well on 5.11 Challenges Motiv ating Deep a wide variet ariety y of important problems. Ho How wev ever, er, they ha hav ve not succeeded in solving The simple mac hine learning algorithms describ ed in this chapter work very well on the cen central tral problems in AI, such as recognizing sp speec eec eech h or recognizing ob objects. jects. a wide variety of important problems. However, they have not succeeded in solving dev development elopmentinofAI, deep learning was motiv motivated ated inrecognizing part by theobfailure the The central problems such as recognizing speec h or jects. of traditional algorithms to generalize well on suc such h AI tasks. The development of deep learning was motivated in part by the failure of This section is ab about outtohow the challenge of suc generalizing to new examples becomes traditional algorithms generalize well on h AI tasks. exp exponen onen onentially tially more diﬃcult when working with highdimensional data, and how section isused abouttohow the echallenge of generalizing to new examples ecomes the This mec ac generalization in traditional mac learning mechanisms hanisms achiev hiev hieve machine hine b exponen tially tmore diﬃcult when working withinhighdimensional andSuch how are insuﬃcien insuﬃcient to learn complicated functions highdimensionaldata, spaces. the mec hanisms used tohigh achiev e generalization traditional hine learning spaces also often impose computational costs.in Deep learningmac was designed to are insuﬃcien t to learn complicated functions in highdimensional spaces. Such overcome these and other obstacles. spaces also often impose high computational costs. Deep learning was designed to overcome these and other obstacles.
5.11.1
The Curse of Dimensionalit Dimensionality y
5.11.1 The Curse of Dimensionalit y Man Many y mac machine hine learning problems become exceedingly diﬃcult when the num numb b er of dimensions in the data is high. This phenomenon is kno known wn as the curse Man y mac hine learning problems b ecome exceedingly diﬃcult when the distinct numb er of dimensionality dimensionality.. Of particular concern is that the num umb ber of possible of dimensions in the data is high. This phenomenon is kno wn as the curse conﬁgurations of a set of variables increases exp exponentially onentially as the num numb b er of variables of dimensionality. Of particular concern is that the number of possible distinct increases. conﬁgurations of a set of variables increases exponentially as the numb er of variables increases.
154
CHAPTER 5. MACHINE LEARNING BASICS
× O(× v )
= 1000 10 d
v
d
v
= 1000 10
O( v )
The curse of dimensionality arises in many places in computer science, and esp especially ecially so in machine learning. The curse of dimensionality arises in many places in computer science, and One posed learning. by the curse of dimensionality is a statistical challenge. especiallychallenge so in machine As illustrated in Fig. 5.9, a statistical challenge arises because the num number ber of One challenge posed by the curse of dimensionality is a statistical challenge. possible conﬁgurations of x is much larger than the num umb ber of training examples. As illustrated in Fig. 5.9 , a statistical c hallenge arises because the number To understand the issue, let us consider that the input space is organized in into to of a possible isw m uch larger w than thedescrib number of training examples. grid, lik likeeconﬁgurations in the ﬁgure. of Inxlo low dimensions e can describe e this space with a low T understand let mostly us consider that bthe input space is organized into no um umb ber of grid the cellsissue, that are occupied y the data. When generalizing to aa grid,data like in the low dimensions e can describ this spacethe with a low new poin oint, t, ﬁgure. we can In usually tell what towdo simply by einspecting training n um b er of grid cells that are mostly o ccupied b y the data. When generalizing to a examples that lie in the same cell as the new input. F For or example, if estimating newprobabilit data poinyt,densit we can do just simply by inspecting x, wetocan the probability density y atusually some ptell ointwhat return the num numb berthe of training training examples that lie in the same cell as the new input. F or example, if estimating examples in the same unit volume cell as x , divided by the total num umb ber of training x the probabilit y densit y at some p oint , w e can just return the num b er of training examples. If we wish to classify an example, we can return the most common class x examples in the same unit volume cell as , divided b y the total n um b er of training of training examples in the same cell. If we are doing regression we can av average erage examples. If w e wish to classify an example, we can return the most common the target values observ observed ed over the examples in that cell. But what ab about outclass the of training examples the same cell. If we are doing regression we can average cells for which we ha hav vin e seen no example? Because in highdimensional spaces the the target v alues observ ed o v er the examples in that cell. But what ab out the num umb ber of conﬁgurations is going to be huge, muc much h larger than our num umb ber of cells for which we ha v e seen no example? Because in highdimensional spaces examples, most conﬁgurations will hav havee no training example asso associated ciated withthe it. number of conﬁgurations is going to be huge, much larger than our number of examples, most conﬁgurations will hav155 e no training example associated with it.
CHAPTER 5. MACHINE LEARNING BASICS
Ho How w could we possibly say something meaningful ab about out these new conﬁgurations? Man Many y traditional machine learning algorithms simply assume that the output at a How pcould we possibly say something aboutput out these newnearest conﬁgurations? new oint should be approximately themeaningful same as the at the training Man y traditional machine learning algorithms simply assume that the output at a poin oint. t. new point should be approximately the same as the output at the nearest training point.
5.11.2
Lo Local cal Constancy and Smo Smoothness othness Regularization
In order toLo generalize well, mac machine hine algorithms need to be guided by prior 5.11.2 cal Constancy andlearning Smoothness Regularization beliefs ab about out what kind of function they should learn. Previously Previously,, we hav havee seen In order to generalize w ell, mac hine learning algorithms need to be guided by prior these priors incorp incorporated orated as explicit beliefs in the form of probability distributions about what kind of function they should Previously we hav e seen obveliefs er parameters of the mo model. del. More informally informally, , welearn. may also discuss ,prior beliefs as these priors incorp orated as explicit b eliefs in the form of probability distributions directly inﬂuencing the function itself and only indirectly acting on the parameters overtheir parameters thefunction. model. More informally e may also discuss discuss prior prior bbeliefs eliefs as as via eﬀect onofthe Additionally dditionally, , w, ewinformally directly inﬂuencing the function itself and only indirectly theard parameters b eing expressed implicitly implicitly, , by choosing algorithms that areacting biasedontow toward choosing via their eﬀect on the function. A dditionally , w e informally discuss prior b eliefs as some class of functions over another, even though these biases may not be expressed being expressed , byinchoosing thatdistribution are biased represen toward cting hoosing (or ev even en possibleimplicitly to express) terms ofalgorithms a probability representing our some class functions overfunctions. another, even though these biases may not be expressed degree of bof elief in various (or even possible to express) in terms of a probability distribution representing our Among the most widely used of these implicit “priors” is the smo smoothness othness prior degree of belief in various functions. or lo loccal constancy prior prior.. This prior states that the function we learn should not Among the most widely used of these implicit “priors” is the smoothness prior change very muc much h within a small region. or local constancy prior. This prior states that the function we learn should not Man Many simpler rely region. exclusiv exclusively ely on this prior to generalize well, and change vyery much algorithms within a small as a result they fail to scale to the statistical challenges inv involved olved in solving AIMan y simpler algorithms rely exclusiv ely on this prior to well, and lev level el tasks. Throughout this book, we will describ describee ho how w deep generalize learning introduces as a result (explicit they fail to to thepriors statistical challenges involved in solving AIadditional andscale implicit) in order to reduce the generalization level tasks. Throughout this bHere, ook, we describ howsmo deep learning introduces error on sophisticated tasks. we will explain wh why ye the smoothness othness prior alone is additional (explicit and implicit) priors in order to reduce the generalization insuﬃcien insuﬃcientt for these tasks. error on sophisticated tasks. Here, we explain why the smoothness prior alone is There are many diﬀeren diﬀerentt ways to implicitly or explicitly express a prior belief insuﬃcient for these tasks. that the learned function should b e smooth or lo locally cally constan constant. t. All of these diﬀerent There are many diﬀeren t w a ys to implicitly or explicitly express a prior elief f ∗bthat metho methods ds are designed to encourage the learning pro process cess to learn a function that the learned function should b e smooth or locally constant. All of these diﬀerent satisﬁes the condition methods are designed to encourage the that f ∗ (x ) ≈learning f∗ (x + pro ) cess to learn a function f(5.103) satisﬁes the condition for most conﬁgurations x and small know w (5.103) a go goo od f (x)change f (x.+In ) other words, if we kno answ answer er for an input x (for example, if≈x is a lab labeled eled training example) then that x for most conﬁgurations and small change . In words, if we go kno a good answ answer er is probably go goood in the neigh neighb borho orhoood of x. other If we hav have e several goo o dwanswers x is athem answ er for an b input if bine labeled example) then that in some neigh neighb orho orhoo oxd(for we example, would com combine (b (by ytraining some form of av averaging eraging or answ erolation) is probably goduce od inanthe neighbthat orhoagrees od of xwith . If we e yseveral gooasd m answers in interp terp terpolation) to pro produce answer as hav man many of them uc uch h as inossible. some neighborhood we would combine them (by some form of averaging or p interpolation) to pro duce an answer that agrees with as many of them as much as An extreme example of the lo local cal constancy approac approach h is the k nearest neighbors possible. 156 An extreme example of the local constancy approach is the k nearest neighbors
CHAPTER 5. MACHINE LEARNING BASICS
family of learning algorithms. These predictors are literally constan constantt over eac each h region containing all the points x that hav havee the same set of k nearest neighbors in family of learning algorithms. These are literally constan t ovbe er more each the training set. For number berpredictors of distinguishable regions cannot k = 11,, the num x k regionthe containing all training the points that have the same set of nearest neighbors in than num numb ber of examples. the training set. For k = 1, the number of distinguishable regions cannot be more While the knearest neighbors algorithm copies the output from nearby training than the number of training examples. examples, most kernel mac machines hines interpolate betw between een training set outputs asso associated ciated k While the nearest neighbors algorithm copies the output from nearby training with nearby training examples. An imp important ortant class of kernels is the family of lo loccal examples, most kernel mac hines interpolate betw een training set outputs asso ciated kernels where k(u, v ) is large when u = v and decreases as u and v gro grow w farther with nearby training examples. An imp ortant class of kernels is the family of local apart from eac each h other. A lo local cal kernel can be thought of as a similarity function (u, v ) is large v and decreases v gro kernels where ktemplate when u as u and w farther x that performs matching, by=measuring ho how w closely a test example apart from eac h other. A lo cal kernel can b e thought of as a similarity function ( i ) resem resembles bles each training example x . Muc Much h of the modern motiv motivation ation for deep x that p erforms template matching, by measuring ho w closely a test example learning is deriv derived ed from studying the limitations of lo local cal template matching and x resem bles each training example . Muc h of the modern motiv ation for deep ho how w deep mo models dels are able to succeed in cases where lo local cal template matching fails deriv ed from (learning Bengio etis al. , 2006b ). studying the limitations of local template matching and how deep models are able to succeed in cases where local template matching fails Decision also (Bengio et al.trees , 2006b ). suﬀer from the limitations of exclusively smoothnessbased learning because they break the input space into as many regions as there are trees also suﬀer from the exclusively smoothnessbased lea leav vDecision es and use a separate parameter (orlimitations sometimes of man many y parameters for extensions learning because break the input as many regions as there of decision trees) they in eac each h region. If thespace targetinto function requires a tree withare at lea v es and use a separate parameter (or sometimes man y parameters for extensions least n lea leav ves to be represented accurately accurately,, then at least n training examples are of decision trees) in eac h region. target to function a of tree with at required to ﬁt the tree. A multiple ofIf nthe is needed ac achiev hiev hievee requires some level statistical n n least lea v es to b e represented accurately , then at least training examples are conﬁdence in the predicted output. required to ﬁt the tree. A multiple of n is needed to achieve some level of statistical In general, to distinguish O( k) regions in input space, all of these metho methods ds conﬁdence in the predicted output. require O (k ) examples. Typically there are O( k) parameters, with O (1) parameters O)( kregions. In general, distinguish ) regions in case inputofspace, all of these metho ds asso associated ciated with to each of the O( k The a nearest neighbor scenario, O ( k O ( k O require ) examples. T ypically there are ) parameters, with (1) parameters where each training example can be used to deﬁne at most one region, is illustrated asso ciated each of the O( k ) regions. The case of a nearest neighbor scenario, in Fig. 5.10with . where each training example can be used to deﬁne at most one region, is illustrated Is there way y to represen representt a complex function that has many more regions in Fig. 5.10.a wa to be distinguished than the num umb ber of training examples? Clearly Clearly,, assuming Is there a wa y to represen t a complex function that has many more regions only smo smoothness othness of the underlying function will not allow a learner to do that. toorbeexample, distinguished than thethe num ber offunction trainingisexamples? , assuming F imagine that target a kind of Clearly chec checkerboard. kerboard. A only smo othness of the underlying function will not allow a learner to do that. chec heck kerb erboard oard con contains tains man many y variations but there is a simple structure to them. F or example, imagine that is a kind of chec A Imagine what happens whenthe thetarget num umb bfunction er of training examples is kerboard. substan substantially tially checkerbthan oardthe conntains man variations but there is on a simple structure to Based them. smaller um umb ber of yblac black k and white squares the chec checkerboard. kerboard. Imagine what happens when umbothness er of training substan tially on only lo local cal generalization andthe thensmo smoothness or local examples constancy is prior, we would smaller than the umber ofguess blackthe and white ont the b e guaranteed to n correctly color of asquares new p oin oint if itchec lieskerboard. within theBased same on only lo cal generalization and the smo othness or local constancy prior, w e would chec heck kerb erboard oard square as a training example. There is no guaran guarantee tee that the learner b e guaranteed to correctly guess the color of a new p oin t if it lies same could correctly extend the chec heck kerb erboard oard pattern to poin oints ts lying in within squaresthe that do checcon kerbtain oardtraining square examples. as a training example. There is no tee that the that learner not contain With this prior alone, theguaran only information an could correctly extend the checkerboard pattern to points lying in squares that do not contain training examples. With this 157prior alone, the only information that an
CHAPTER 5. MACHINE LEARNING BASICS
y y
158
CHAPTER 5. MACHINE LEARNING BASICS
example tells us is the color of its square, and the only wa way y to get the colors of the en entire tire chec heck kerb erboard oard right is to co cov ver eac each h of its cells with at least one example. example tells us is the color of its square, and the only way to get the colors of the The smo smoothness othness assumption and the associated nonparametric learning algoentire checkerboard right is to cover each of its cells with at least one example. rithms work extremely well so long as there are enough examples for the learning The smo associated nonparametric algoalgorithm toothness observeassumption high pointsand on the most peaks and lo low w poin oints ts on learning most valleys rithms workunderlying extremely well so long are enough thewhen learning of the true function to as be there learned. This is examples generallyfor true the algorithm to observe high p oints on most p eaks and lo w p oin ts on most v alleys function to be learned is smo smooth oth enough and varies in few enough dimensions. of the true underlying function beoth learned. This generally true when the In high dimensions, even a very to smo smooth function canischange smoothly but in a function to be learned is smo oth enough and v aries in few enough dimensions. diﬀeren diﬀerentt way along each dimension. If the function additionally behav ehaves es diﬀerently In diﬀerent high dimensions, very smo oth function can change smoothly in ofa in regions, iteven can ab ecome extremely complicated to describ describe e withbut a set diﬀerent examples. way along each the function (w additionally behaves diﬀerently training If thedimension. function isIfcomplicated (we e wan antt to distinguish a huge in diﬀerent regions, it can b ecome extremely complicated to describ e with a set to of num umb ber of regions compared to the number of examples), is there any hope training examples. If the function is complicated (w e w an t to distinguish a h uge generalize well? number of regions compared to the number of examples), is there any hope to The answer to both of these questions is yes. The key insigh insightt is that a very generalize well? k large num umb ber of regions, e.g., O(2 ), can be deﬁned with O (k) examples, so long The answer to both of these questions The via keyadditional insight is assumptions that a very as we introduce some dep dependencies endencies bet etw ween is theyes. regions O(2 ), can O (wa k) yexamples, large umbunderlying er of regions, e.g., be deﬁned with so long ab about outnthe data generating distribution. In this way , we can actually as we introduce some endencies etween the, regions via additional assumptions generalize nonlo nonlocally callydep (Bengio and bMonp Monperrus errus 2005; Bengio et al., 2006c ). Man Many y ab out the underlying data generating distribution. In this wa y , we can actually diﬀeren diﬀerentt deep learning algorithms pro provide vide implicit or explicit assumptions that are generalize nonlo cally ( Bengio and Monp , 2005 Bengio these et al.,adv 2006c ). Many reasonable for a broad range of AI tasks errus in order to ;capture advantages. antages. diﬀerent deep learning algorithms provide implicit or explicit assumptions that are Other approac approaches hes to machine make e stronger, taskspeciﬁc eciﬁc asreasonable for a broad range of AIlearning tasks in often order mak to capture thesetasksp advantages. sumptions. For example, we could easily solv solvee the chec heck kerb erboard oard task by pro providing viding Other approac hes to machine learning often mak e stronger, tasksp eciﬁc asthe assumption that the target function is perio periodic. dic. Usually we do not include suc such h sumptions. F or example, w e could easily solv e the c hec k erb oard task by pro viding strong, tasksp taskspeciﬁc eciﬁc assumptions in into to neural netw networks orks so that they can generalize the assumption that the target function is perio dic. Usually we dothat not include to a muc uch h wider variety of structures. AI tasks ha hav ve structure is muc uch hsuc tooh strong, tasksp assumptions into neural netw orks soerties that they complex to be eciﬁc limited to simple, manually sp speciﬁed eciﬁed prop properties suc such h can as pgeneralize erio eriodicity dicity dicity,, to a m uc h wider v ariety of structures. AI tasks ha v e structure that is m uc h too so we wan antt learning algorithms that embo embody dy more generalpurpose assumptions. complex be in limited simple, ismanually eciﬁed that properties suchwas eriodicity, The core to idea deep to learning that we sp assume the data as pgenerated so w e w an t learning algorithms that embo dy more generalpurpose assumptions. by the or features, p oten otentially tially at multiple lev levels els in a The core in deep that weassumptions assume thatcan thefurther data wimpro as generated hierarc hierarch hy.idea Many otherlearning similarlyisgeneric improv ve deep b y the or features, p oten tially at multiple lev els a learning algorithms. These apparen apparently tly mild assumptions allo allow w an exp exponen onen onential tial in gain hierarc hy. Many other generic assumptions canthe further e deep in the relationship bet etw wsimilarly een the num number ber of examples and num numb bimpro er of vregions learning algorithms. These apparen tly mild assumptions allo w an exp onen tial gain that can be distinguished. These exp exponential onential gains are describ described ed more precisely in in the6.4.1 relationship ween num ber exp of examples andan the num ber of regions Sec. , Sec. 15.4b, et and Sec.the 15.5 . The exponential onential adv advan antages tages conferred by the thatofcan be distinguished. These exponential gains describ ed more precisely in use deep, distributed representations coun counter ter theare exp exponen onen onential tial challenges posed Sec. 6.4.1 , Sec. 15.4, and Sec. . 15.5. The exponential advantages conferred by the b y the curse of dimensionality dimensionality. use of deep, distributed representations counter the exponential challenges posed by the curse of dimensionality. 159
CHAPTER 5. MACHINE LEARNING BASICS
5.11.3
Manifold Learning
An imp important ortant conceptLearning underlying man many y ideas in machine learning is that of a 5.11.3 Manifold manifold. An important concept underlying many ideas in machine learning is that of a A manifold is a connected region. Mathematically Mathematically,, it is a set of p oin oints, ts, associated manifold. with a neighborho d around each p oint. F rom an given p oint, the manifold lo neighborhooo any y locally cally A manifold is a connected region. Mathematically , it is a set of p oin ts, associated app appears ears to be a Euclidean space. In everyda everyday y life, we exp experience erience the surface of the with neighborho o d around oint.a F rom any manifold given point, the manifold w orlda as a 2D plane, but it each is in pfact spherical in 3D space. locally appears to be a Euclidean space. In everyday life, we experience the surface of the The deﬁnition of a neigh neighb borho orhoo od surrounding each poin ointt implies the existence world as a 2D plane, but it is in fact a spherical manifold in 3D space. of transformations that can be applied to mo mov ve on the manifold from one position deﬁnitionone. of aInneigh orhood surrounding each pointasimplies the existence to aThe neighboring the bexample of the world’s surface a manifold, one can of transformations that can b e applied to mo v e on the manifold from one position walk north, south, east, or west. to a neighboring one. In the example of the world’s surface as a manifold, one can Although there is a formal mathematical meaning to the term “manifold,” walk north, south, east, or west. in mac machine hine learning it tends to be used more lo loosely osely to designate a connected Although there is a formal mathematical meaning to the “manifold,” set of poin oints ts that can be appro approximated ximated well by considering onlyterm a small num umber ber in mac hine learning it tends to b e used more lo osely to designate a connected of degrees of freedom, or dimensions, embedded in a higherdimensional space. set of dimension points thatcorresp can bonds e appro well by considering small Eac Each h corresponds to ximated a lo local cal direction of variation.only See aFig. 5.11num forber an of degrees freedom, dimensions, embedded in amanifold higherdimensional example of of training dataorlying near a onedimensional em embedded bedded inspace. twoEach dimension corresp onds to a of local direction of variation. 5.11 for any dimensional space. In the context mac machine hine learning, we allowSee theFig. dimensionalit dimensionality example of training lying one nearp oin aoint onedimensional manifold bedded in twoof the manifold to vdata ary from t to another. This often em happ happens ens when a dimensional space. In the context of mac hine learning, w e allow the dimensionalit manifold in intersects tersects itself. For example, a ﬁgure eigh eightt is a manifold that has a singley of the manifold to places vary from one p oint to another. This often happ enscenter. when a dimension in most but tw at the in at the two o dimensions intersection tersection manifold intersects itself. For example, a ﬁgure eight is a manifold that has a single dimension in most places but two dimensions at the intersection at the center.
160
CHAPTER 5. MACHINE LEARNING BASICS
Man Many y mac machine hine learning problems seem hop hopeless eless if we exp expect ect the machine learning algorithm to learn functions with interesting variations across all of y macle hine learning problems seem hopobstacle eless if wbey exp ect thethat machine Manifold learning arning algorithms surmoun surmount t this assuming most Rn. Man learning algorithm to learn functions with interesting v ariations across all of n of inv valid inputs, and that in interesting teresting inputs o ccur only along R R consists of in . Manifold le arning algorithms surmoun t this obstacle b y assuming that most a collection of manifolds con containing taining a small subset of poin oints, ts, with interesting R consists invalidofinputs, and that interesting inputs ccur only along vofariations in theofoutput the learned function occurring onlyoalong directions a collection of manifold, manifolds or con taining a smallvariations subset of happ poinening ts, with interesting that lie on the with interesting happening only when we vmo ariations in the output of the learned function o ccurring only along directions mov ve from one manifold to another. Manifold learning was introduced in the case that lie on the manifold, with interesting variations happ ening only whenthis we of con continuousv tinuousv tinuousvalued alued dataorand the unsup unsupervised ervised learning setting, although move fromyone manifold toidea another. learning was introduced in the probabilit probability concentration can bManifold e generalized to both discrete data andcase the of con tinuousv alued data and the unsup ervised learning setting, although this sup supervised ervised learning setting: the key assumption remains that probability mass is probabilit y concentration idea can be generalized to both discrete data and the highly concen concentrated. trated. supervised learning setting: the key assumption remains that probability mass is The assumption that the data lies along a lowdimensional manifold may not highly concentrated. alw alwa ays be correct or useful. We argue that in the context of AI tasks, suc such h as The assumption that the data lies along a lowdimensional manifold may those that inv involve olve pro processing cessing images, sounds, or text, the manifold assumptionnot is alw a ys b e correct or useful. W e argue that in the context of AI tasks, suc h as at least appro approximately ximately correct. The evidence in fav favor or of this assumption consists those that inv olve pro cessing images, sounds, or text, the manifold assumption is of two categories of observ observations. ations. at least approximately correct. The evidence in favor of this assumption consists The ﬁrst observ observation ation in fav favor or of the manifold hyp hypothesis othesis is that the probability of two categories of observations. distribution over images, text strings, and sounds that occur in real life is highly Thetrated. ﬁrst observ ationnoise in favessentially or of the manifold hypothesisstructured is that theinputs probability concen concentrated. Uniform nev never er resembles from distribution o v er images, text strings, and sounds that o ccur in real life is highly these domains. Fig. 5.12 sho shows ws ho how, w, instead, uniformly sampled points lo look ok like the concentrated. Uniform noise never resembles structured from patterns of static that app appear ear essentially on analog television sets when no signalinputs is available. these domains. 5.12 sho wscumen how, tinstead, uniformly points look likewhat the Similarly Similarly, , if youFig. generate a do documen cument by picking letters sampled uniformly at random, patterns of staticy that onget analog television Englishlanguage sets when no signal is avAlmost ailable. is the probabilit probability that app youear will a meaningful text? Similarly , if you generate a do cumen t by picking letters uniformly at random, zero, again, because most of the long sequences of letters do not corresp correspond ondwhat to a is the probabilit that you the willdistribution get a meaningful Englishlanguage text?oAlmost natural languageysequence: of natural language sequences ccupies ecause most the space long sequences of letters do not correspond to a azero, veryagain, small bvolume in theoftotal of sequences of letters. natural language sequence: the distribution of natural language sequences occupies a very small volume in the total space of sequences of letters.
161
CHAPTER 5. MACHINE LEARNING BASICS
Of course, concentrated probabilit probability y distributions are not suﬃcien suﬃcientt to sho show w that the data lies on a reasonably small number of manifolds. We must also Of course, concentrated probabilit y distributions are to not suﬃcien to other show establish that the examples we encounter are connected eac each h othert by that the data lies on a reasonably small number of manifolds. We must also establish that the examples we encounter 162 are connected to each other by other
CHAPTER 5. MACHINE LEARNING BASICS
examples, with each example surrounded by other highly similar examples that ma may y be reached by applying transformations to trav traverse erse the manifold. The second examples, with each example surrounded by other highly similar that argumen argumentt in fa fav vor of the manifold hyp ypothesis othesis is that we can alsoexamples imagine suc such h ma y b e reached b y applying transformations to trav erse the manifold. The second neigh neighb borho orhoo ods and transformations, at least informally informally.. In the case of images, we argumen t in fa v or of the manifold h yp othesis is thatthat we allo canwalso imagine such can certainly think of many possible transformations allow us to trace out a neighborho and space: transformations, at leastdim informally . In the of images, we manifold inods image we can gradually or brighten thecase lights, gradually canvecertainly of many thatcolors allowon usthe to trace outofa mo mov or rotatethink ob objects jects in thepossible image, transformations gradually alter the surfaces manifold in image space:lik w e can or brighten the inv lights, gradually ob objects, jects, etc. It remains likely ely thatgradually there aredim multiple manifolds involved olved in most mo v e or rotate ob jects in the image, gradually alter the colors on the surfaces applications. For example, the manifold of images of human faces may not bofe ob jects, etc. likely that there arefaces. multiple manifolds involved in most connected to It theremains manifold of images of cat applications. For example, the manifold of images of human faces may not b e These thought exp experiments eriments supp supporting orting the manifold hypotheses conv convey ey some inconnected to the manifold of images of cat faces. tuitiv tuitivee reasons supp supporting orting it. More rigorous exp experimen erimen eriments ts ((Ca Ca Cayton yton, 2005; Nara Naray yanan These thought exp eriments supp orting the manifold hypotheses conv ey some in, and Mitter, 2010; Sc Schölk hölk hölkopf opf et al. al.,, 1998; Ro Row weis and Saul, 2000; Tenen enenbaum baum et al. al., tuitiv reasons supp orting and it. More rigorous erimen ts (Grimes Cayton, ,2003 2005; ;W Nara yanan 2000 ; eBrand , 2003 ; Belkin Niy Niyogi ogi , 2003; exp Donoho and einberger and Mitter 2010) ; clearly Schölkopf etort al., the 1998h;yp Ro weis and , 2000class ; Tenen et al. and Saul, ,2004 supp support ypothesis othesis forSaul a large of baum datasets of, 2000 ; Brand , 2003; Belkin and Niyogi, 2003; Donoho and Grimes, 2003; Weinberger in interest terest in AI. and Saul, 2004) clearly support the hypothesis for a large class of datasets of When the data lies on a lowdimensional manifold, it can be most natural interest in AI. for mac machine hine learning algorithms to represent the data in terms of coordinates on When the rather data lies oninaterms lowdimensional manifold, it can be ymost the manifold, than of co coordinates ordinates in R n . In ev everyda eryda eryday life, natural we can for mac hine learning algorithms to represent the data in terms of coordinates think of roads as 1D manifolds embedded in 3D space. We give directions on to R the manifold, rather than in terms of co ordinates in . In ev eryda y life, we can sp speciﬁc eciﬁc addresses in terms of address num umb bers along these 1D roads, not in terms think of roads in as3D 1Dspace. manifolds embedded 3D space. We giveisdirections to of co coordinates ordinates Extracting theseinmanifold co coordinates ordinates challenging, speciﬁc addresses in terms of address nyum bers along thesealgorithms. 1D roads, not terms but holds the promise to impro improv ve man many machine learning Thisingeneral of coordinates in 3D Extracting these coordinates challenging, principle is applied in space. man many y con contexts. texts. Fig. 5.13manifold shows the manifold is structure of a but holds the promise to impro v e man y machine learning algorithms. This general dataset consisting of faces. By the end of this book, we will ha hav ve dev develop elop eloped ed the principle applied in y con texts. Fig. 5.13 shows theInmanifold metho methods ds is necessary to man learn suc such h a manifold structure. Fig. 20.6structure , we will ofseea dataset consisting of faces. By thecan endsuccessfully of this book, we will ha ve dev eloped the ho how w a machine learning algorithm accomplish this goal. methods necessary to learn such a manifold structure. In Fig. 20.6, we will see concludes Part Ialgorithm , which has provided the basic concepts mathematics howThis a machine learning can successfully accomplish thisingoal. and mac machine hine learning which are emplo employ yed throughout the remaining parts of the This concludes art I, which has provided in mathematics book. You are no now wPprepared to embark up upon on ythe ourbasic studyconcepts of deep learning. and machine learning which are employed throughout the remaining parts of the book. You are now prepared to embark upon your study of deep learning.
163
CHAPTER 5. MACHINE LEARNING BASICS
164
Part II Part II
Deep Net Netw works: Mo Modern dern Practices Deep Networks: Modern Practices
165 165
This part of the book summarizes the state of mo modern dern deep learning as it is used to solv solvee practical applications. This part of the book summarizes the state of modern deep learning as it is Deep learning has a long history and man many y aspirations. Sev Several eral approac approaches hes used to solve practical applications. ha hav ve been proposed that hav havee yet to en entirely tirely bear fruit. Sev Several eral am ambitious bitious goals Deep learning has a long history and man y aspirations. Sev eral ha hav ve yet to be realized. These lessdeveloped branches of deep learningapproac appearhes in haveﬁnal beenpart proposed that the of the bo book. ok.have yet to entirely bear fruit. Several ambitious goals have yet to be realized. These lessdeveloped branches of deep learning appear in This part focuses only on those approac approaches hes that are essen essentially tially working tec techhthe ﬁnal part of the book. nologies that are already used hea heavily vily in industry industry.. This part focuses only on those approaches that are essentially working techMo Modern dern learningused pro provides vides pow powerful erful framew framework ork for supervised nologies thatdeep are already hea vilya invery industry . learning. By adding more lay layers ers and more units within a la lay yer, a deep netw network ork can Mo dern deep learning pro vides a v ery pow erful framew ork for supervised represen representt functions of increasing complexity complexity.. Most tasks that consist of mapping an learning. By adding more lay ers within yer, a to deep ork, can can input vector to an output vector,and andmore that units are easy for aa la person do netw rapidly rapidly, represen t functions increasing complexity . Most tly tasks thatmo consist of mapping an b e accomplished viaofdeep learning, giv large suﬃciently given en suﬃcien suﬃciently models dels and input datasets vector toofanlabeled outputtraining vector, examples. and that are easytasks, for a that person dobrapidly , can large Other cantonot e described be asso accomplished viavector deep learning, given suﬃcien large mo dels and as associating ciating one to another, or that aretly diﬃcult enough thatsuﬃciently a p erson large datasets of labeled training examples. Other tasks, that can not b e described would require time to think and reﬂect in order to accomplish the task, remain asey asso one to another, orw. that are diﬃcult enough that a p erson b eyond ondciating the scop scope e ofvector deep learning for no now. would require time to think and reﬂect in order to accomplish the task, remain This part of the book describ describes es the core parametric function approximation beyond the scope of deep learning for now. tec technology hnology that is behind nearly all modern practical applications of deep learning. the book the describ es theard core parametric approximation We This beginpart b by y of describing feedforw feedforward deep net network workfunction mo model del that is used to tec hnology that is b ehind nearly all modern practical applications of deep learning. represen representt these functions. Next, we present adv advanced anced tec techniques hniques for regularization We optimization begin by describing the feedforw ardthese deep net work model thatsuch is used to and of such mo models. dels. Scaling mo models dels to large inputs as high represent these functions. Next, wesequences present adv anced tec hniques for regularization resolution images or long temporal requires sp specialization. ecialization. We in intro tro troduce duce and optimization of such mo dels. Scaling these mo dels to large inputs such as high the con conv volutional net network work for scaling to large images and the recurren recurrentt neural resolution or long temporal sequencesFinally requires We guidelines introduce net pro temp e ecialization. present general netw work forimages processing cessing temporal oral sequences. Finally, , wsp the the conpractical volutionalmetho network forin scaling large images and the recurren t neural for methodology dology inv volved to in designing, building, and conﬁguring an net w ork for pro cessing temp oral sequences. Finally , w e present general guidelines application in involving volving deep learning, and review some of the applications of deep for the practical methodology involved in designing, building, and conﬁguring an learning. application involving deep learning, and review some of the applications of deep These chapters are the most imp important ortant for a practitioner—someone who wants learning. to begin implemen implementing ting and using deep learning algorithms to solv solvee realw realworld orld These c hapters are the most imp ortant for a practitioner—someone who w ants problems to toda da day y. to begin implementing and using deep learning algorithms to solve realworld problems today.
166
Chapter 6 Chapter 6
Deep Feedforw eedforward ard Net Netw works Deep F eedforw ard Net w orks De Deep ep fe feeedforwar dforward d networks, also often called fe feeedforwar dforward d neur neural al networks, or multilayer per ercceptr eptrons ons (MLPs), are the quintessen quintessential tial deep learning mo models. dels. The goal Deep feedforward netw networks often called fesome edforwar d neurfal∗ .networks , or multiof a feedforward network ork ,isalso to approximate function F For or example, for layer p er c eptr ons ( MLPs ), are the quintessen tial deep learning mo dels. The goal ∗ a classiﬁer, y = f (x) maps an input x to a category y. A feedforward netw network ork f of a feedforward netw ork is to approximate some function . F or example, for deﬁnes a mapping y = f (x; θ ) and learns the value of the parameters θ that result y = f (xapproximation. a classiﬁer, ) maps an input x to a category y. A feedforward network in the b est function deﬁnes a mapping y = f (x; θ ) and learns the value of the parameters θ that result These models dels areapproximation. called fe feeedforwar dforward d b ecause information ﬂo ﬂows ws through the in the b estmo function function b eing ev evaluated aluated from x, through the intermediate computations used to These mo dels aretocalled feedforwar d b ecause ﬂows through the y. There deﬁne f , and ﬁnally the output are noinformation fe feeedb dback ack connections in whic which h x function of b eing ev aluated from , through the intermediate computations used to outputs the mo are fed back in itself. When feedforward neural netw model del into to networks orks f , and ﬁnally deﬁne to the output . There are no feeare dback connections in neur whical h are extended to include feedbac feedback ky connections, they called recurr current ent neural outputs the mo delinare fed back networks networks,of , presented Chapter 10.into itself. When feedforward neural networks are extended to include feedback connections, they are called recurrent neural Feedforw eedforward ard netw networks orks are of extreme imp importance ortance to machine learning practinetworks, presented in Chapter 10. tioners. They form the basis of many important commercial applications. For Feedforw netw orks are of extreme impob ortance to machine learning example, theard conv convolutional olutional net networks works used for object ject recognition from photospractiare a tioners. They form the basis of many important commercial applications. For sp specialized ecialized kind of feedforw feedforward ard net network. work. Feedforw eedforward ard netw networks orks are a conceptual example, the conv olutional net works used for ob ject recognition from photos are stepping stone on the path to recurren recurrentt netw networks, orks, which p ower man many y naturala sp ecialized kind of feedforward network. Feedforward networks are a conceptual language applications. stepping stone on the path to recurrent networks, which p ower many natural Feedforw eedforward ard neural net networks works are called networks b ecause they are typically replanguage applications. resen resented ted by comp composing osing together many diﬀerent functions. The mo model del is asso associated ciated F eedforw ard neural net works are called networks b ecause they are t ypically repwith a directed acyclic graph describing how the functions are comp composed osed together. resen ted by comp many diﬀerentffunctions. Thefmo (1) f (2) (3) del is asso ciated F or example, we osing migh mightttogether hav havee three functions , , and connected in a directed (2) (f (1) (x how the functions are comp osed together. (x ) = graph f (3) (f describing cwith hain,a to form facyclic ))) ))).. These chain structures are the most f case, f is connected For example, westructures might hav three functions , andf (1) in a commonly used ofe neural netw networks. orks.fIn ,this called the ﬁrst f ( x ) = f ( f ( f ( x clayer hain,oftothe form ))) . These chain structures are the most netw network, ork, f (2) is called the se seccond layer layer,, and so on. The overall length commonly used structures of neural networks. In this case, f is called the ﬁrst 167 layer of the network, f is called the se cond layer, and so on. The overall length 167
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
of the chain giv gives es the depth of the mo model. del. It is from this terminology that the name “deep learning” arises. The ﬁnal lay layer er of a feedforward net network work is called the of the c hain giv es the depth of the mo del. It is from this terminology the output layer layer.. During neural netw network ork training, we drive f(x) to match f ∗ (that x). The name “deep The ﬁnal layer of a feedforward isaluated called the f ∗work (x) ev training datalearning” providesarises. us with noisy noisy, , approximate examples ofnet evaluated at output layer . During neural netw ork training, we drive ) to match ) . The f ( x f ( x ∗ diﬀeren diﬀerentt training p oints. Eac Each h example x is accompanied by a lab label el y ≈ f (x). f (do x) at training dataexamples provides sp usecify withdirectly noisy, approximate examples evaluated att The training specify what the output la layer yer of must each p oin oint x yis. accompanied f (xis diﬀeren t training h example a lab el y la ). x ; it must pro produce ducep oints. a valueEac that is close to The b ehavior by of the other layers yers The directly training sp examples sp ecify directly what output layeralgorithm must do at each p oint ≈ decide not speciﬁed eciﬁed by the training data. the The learning must x y ; it m ust pro duce a v alue that is close to . The b ehavior of the other la yers is ho how w to use those lay layers ers to pro produce duce the desired output, but the training data do does es not sa directly eciﬁed by the lay training data. learning algorithm must decide not say y whatsp eac each h individual layer er should do. The Instead, the learning algorithm must ho w to use those lay ers to pro duce the desired output, but the training data do es ∗ decide how to use these lay layers ers to b est implement an approximation of f . Because not training say whatdata each do individual lay should do. output Instead, learning algorithm must the does es not sho show werthe desired forthe each of these lay layers, ers, these f decide how to use these lay ers to b est implement an approximation of . Because la layers yers are called hidden layers. the training data do es not show the desired output for each of these layers, these Finally Finally,, these netw networks orks are called neur neural al b ecause they are lo loosely osely inspired by layers are called hidden layers. neuroscience. Eac Each h hidden lay layer er of the net network work is typically vectorv ectorvalued. alued. The Finally , these netw orks are called neur al b ecause they are lo osely inspired by dimensionalit dimensionality y of these hidden la layers yers determines the width of the mo model. del. Each neuroscience. Eac h hidden lay er of the net work is t ypically v ectorv alued. The elemen elementt of the vector ma may y b e in interpreted terpreted as playing a role analogous to a neuron. dimensionalit y of these hidden yers determines the width of the mo del. Each Rather than thinking of the lay layer erlaas represen representing ting a single vectortovector function, elemen of the vector b eerinterpreted as playing ay role analogous neuron. w e cant also think of ma theylay layer as consisting of man many units that acttoinaparallel, Rather than thinking of the layer as represen tingEac a single function, eac each h representing a vectortoscalar function. Each h unitvectortovector resem resembles bles a neuron in w e can also think of the lay er as consisting of man y units that act in parallel, the sense that it receives input from many other units and computes its own eac h ation representing a vectortoscalar function. unit resem bles a neuron in activ activation value. The idea of using man many y lay layers ersEac of hvectorv ectorvalued alued representation thedrawn sensefrom thatneuroscience. it receives input unitsf (and i) (x computes its own is The from choicemany of theother functions ) used to compute activ ation v alue. The idea of using man y lay ers of v ectorv alued representation these representations is also lo loosely osely guided by neuroscien neuroscientiﬁc tiﬁc observ observations ations ab about out f ( x is drawn from neuroscience. The choice of the functions ) used to compute the functions that biological neurons compute. How Howev ev ever, er, mo modern dern neural netw network ork these representations alsoy lo osely guided b y neuroscien tiﬁcdisciplines, observations abthe out researc research h is guided by isman many mathematical and engineering and the functions biological compute. er, moIt dern network goal of neural that netw networks orks is notneurons to p erfectly mo model delHow theevbrain. is bneural est to think of researc h is guided b y man y mathematical and engineering disciplines, and the feedforw feedforward ard net netw works as function approximation machines that are designed to goal of neural netw orks is not to opccasionally erfectly modrawing del the brain. It is b est to what think we of ac achieve hieve statistical generalization, some insights from feedforw ard the netwbrain, orks as function machines that are designed to kno rather than approximation as mo function. know w ab about out models dels of brain achieve statistical generalization, o ccasionally drawing some insights from what we One wa way y to understand feedforward netw networks orks is to b egin with linear mo models dels know ab out the brain, rather than as mo dels of brain function. and consider ho how w to ov overcome ercome their limitations. Linear mo models, dels, suc such h as logistic One wa y to understand feedforward netw orks is to b egin with linear mo dels regression and linear regression, are app appealing ealing b ecause they may b e ﬁt eﬃcien eﬃciently tly and consider ho w to ov ercome their limitations. Linear mo dels, suc h as logistic and reliably either in closed form or with conv optimization. Linear mo also reliably,, convex ex models dels regression and linear regression, appcapacit ealing byecause theytomay b e functions, ﬁt eﬃciently ha have ve the obvious defect that the are mo model del capacity is limited linear so and reliably , either in closed form or with conv ex optimization. Linear mo dels also the mo model del cannot understand the in interaction teraction b et etw ween an any y two input variables. have the obvious defect that the mo del capacity is limited to linear functions, so To extend linear mo models dels to represen representt nonlinear functions of x, we can apply the mo del cannot understand the interaction b etween any two input variables. the linear mo model del not to x itself but to a transformed input φ(x), where φ is a To extend linear mo dels to represent nonlinear functions of x, we can apply the linear mo del not to x itself but to a transformed input φ(x), where φ is a 168
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
nonlinear transformation. Equiv Equivalently alently alently,, we can apply the kernel tric trick k describ described ed in Sec. 5.7.2, to obtain a nonlinear learning algorithm based on implicitly applying nonlinear transformation. Equiv , we can apply kernel tric k describxed in the We can think ofalently a set the of features describing , or φ mapping. φ as providing Sec. 5.7.2, toaobtain a nonlinear learning as providing new representation for x. algorithm based on implicitly applying the φ mapping. We can think of φ as providing a set of features describing x, or The question is then how to choose the mapping φ. as providing a new representation for x. question is to then to choose mapping φ, such 1.The One option is usehow a very genericthe as theφ.inﬁnitedimensional φ that is implicitly used by kernel machines based on the RBF kernel. If φ(x ) is 1. of One option is to dimension, use a very generic , such the inﬁnitedimensional high enough we canφalw always ays as hav have e enough capacity to φ ﬁtthat the φ ( x is implicitly used by kernel machines based on the RBF kernel. If ) is training set, but generalization to the test set often remains p oor. Very of high enough dimension,are we usually can alwbased ays hav e enough the generic feature mappings only on thecapacity principletoofﬁtlo local cal training set, but generalization to the test set often remains p oor. V ery smo smoothness othness and do not enco encode de enough prior information to solve adv advanced anced generic feature mappings are usually based only on the principle of lo cal problems. smo othness and do not enco de enough prior information to solve advanced 2. Another adventt of deep learning, problems.option is to manually engineer φ. Until the adven this was the dominant approach. This approach requires decades of human φ. Until the sp 2. eﬀort Another to manually adven t of deep foroption eac separate task, engineer with practitioners in learning, diﬀerent each h is specializing ecializing this was the dominant approach. This approach requires decades of domains such as speech recognition or computer vision, and withhuman little eﬀort for eac h separate task, with practitioners sp ecializing in diﬀerent transfer b etw etween een domains. domains such as speech recognition or computer vision, and with little 3. The strategy learning is to learn φ. In this approach, we ha have ve a mo model del transfer b etwof eendeep domains. > now w hav havee parameters θ that we use to learn y = f (x ; θ , w ) = φ(x; θ) w. We no φ. In this approach, 3. φ The strategy of deep learning is to learn we from have aφ(mo w that map x )del from a broad class of functions, and parameters to . Wexample e now hav use to learn y = desired f (x ; θ , w ) = φ(xThis ; θ) iswan θ that the output. ofe aparameters deep feedforw feedforward ardwe net network, work, with φ deﬁning w that φ( xthat from a broad class lay of er. functions, and parameters from ) to φ a hidden layer. This approach is the only onemap of the three the This a deep feedforw network, with giv gives esdesired up on output. the conv convexit exit exity y isofan theexample trainingofproblem, but the ard b eneﬁts outw outweigh eigh φ deﬁning a hidden lay er. This approach is the only one of the three that the harms. In this approac approach, h, we parametrize the represen representation tation as φ(x; θ) givesuse up the on the convexity algorithm of the training problem, butcorresp the b eneﬁts eigh and optimization to ﬁnd the θ that corresponds onds tooutw a go goo od φ ( x ; θ) the harms. In this approac h, w e parametrize the represen tation as represen representation. tation. If we wish, this approach can capture the b eneﬁt of the ﬁrst and use hthe to ﬁnd onds to afamily go o d θ that approac approach byoptimization b eing highlyalgorithm generic—we do the so by usingcorresp a very broad represen tation. If we wish, this approach can capture the b eneﬁt of the ﬁrst φ(x; θ ). This approac approach h can also capture the b eneﬁt of the second approac approach. h. approac by b eing highly generic—we do so by using a very broad family Human hpractitioners can enco their knowledge to help generalization by encode de φ(x; θ ). This approac also capture b eneﬁt of the approac φ(x ;hθcan designing families ) that they exp expect ectthe will p erform well.second The adv advan an antage tageh. Human practitioners can enco de their to righ helpt generalization by is that the human designer only needsknowledge to ﬁnd the right general function φ ( x ; θ designing families ) that they exp ect will p erform well. The adv an tage family rather than ﬁnding precisely the right function. is that the human designer only needs to ﬁnd the right general function family rather than ﬁnding precisely the right function. This general principle of improving mo models dels by learning features extends b ey eyond ond the feedforward net networks works described in this chapter. It is a recurring theme of deep This general principle by learning features extends b ey learning that applies to allofofimproving the kindsmo of dels mo models dels describ described ed throughout this b oond ok. the feedforward net works described in this chapter. It is a recurring theme of deep Feedforward netw networks orks are the application of this principle to learning deterministic learning that applies to all of the kinds of mo dels describ ed throughout this b o ok. 169of this principle to learning deterministic Feedforward networks are the application
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
mappings from x to y that lack feedback connections. Other mo models dels presen presented ted later will apply these principles to learning sto stocchastic mappings, learning functions x tolearning y that probability mappings fromand lack feedback connections. movector. dels presented with feedback, distributions ov over erOther a single later will apply these principles to learning sto chastic mappings, learning functions e b egin this with a simple distributions example of a ov feedforward network. work. Next, withWfeedback, andchapter learning probability er a single net vector. we address each of the design decisions needed to deplo deploy y a feedforward netw network. ork. Wetraining b egin this chapterard withnetw a simple example of a feedforward netsame work.design Next, First, a feedforw feedforward network ork requires making many of the we addressaseach of the design needed deplo a feedforward ork. decisions are necessary for decisions a linear mo model: del: cto ho hoosing osingythe optimizer, netw the cost First, training a form feedforw ardoutput network requires making many of of the same design function, and the of the units. We review these basics gradientbased decisions as are necessary for a linear mo del: c ho osing the optimizer, cost learning, then pro proceed ceed to confront some of the design decisions that arethe unique function, and the form of the output units. Works e review basics gradientbased to feedforward net networks. works. Feedforward net netw w ha have vethese in intro tro troduced duced of the concept of a learning, then pro ceed to confront some of the design decisions that are unique hidden lay layer, er, and this requires us to cho hoose ose the activation functions that will b e to feedforward net works. F eedforward net w orks ha ve in tro duced the concept ofofa used to compute the hidden lay layer er values. We must also design the architecture hidden lay er, and this requires us to c ho ose the activation functions that will be the netw network, ork, including how man many y lay layers ers the netw network ork should contain, ho how w these used to compute hidden layer alues. We must also the architecture of net networks works should the b e connected to veac each h other, and ho how w design many units should b e in the ork, including how man y lay ersorks the requires network computing should contain, how these eac each hnetw la layer. yer. Learning in deep neural netw networks the gradients of net works should b e connected to eac h other, and ho w many units should b e in complicated functions. We present the backpr ackprop op opagation agation algorithm and its mo modern dern eac h la yer. Learning in deep neural netw orks requires computing the gradients of, generalizations, which can b e used to eﬃcien eﬃciently tly compute these gradients. Finally Finally, complicated We present the backpr opagation algorithm and its mo dern w e close withfunctions. some historical p ersp erspectiv ectiv ective. e. generalizations, which can b e used to eﬃciently compute these gradients. Finally, we close with some historical p ersp ective.
6.1
Example: Learning XOR
T o mak make e the idea of aLearning feedforw feedforward ardXOR netw network ork more concrete, we b egin with an 6.1 Example: example of a fully functioning feedforward net netw work on a very simple task: learning T o mak e the idea of a feedforw ard netw ork more concrete, we b egin with an the XOR function. example of a fully functioning feedforward network on a very simple task: learning function (“exclusiv (“exclusivee or”) is an op operation eration on two binary values, x 1 the The XORXOR function. and x2. When exactly one of these binary values is equal to 1, the XOR function The 1XOR functionit (“exclusiv or”) XisOR an function op eration on twothe binary x returns . Otherwise, returns 0.eThe provides targetvalues, function theseOur binary values is equal to 1, they XOR yand = fx∗(. xWhen = f ( xfunction ; θ) and ) that exactly we wantone to of learn. mo model del pro provides vides a function returns 1 . Otherwise, it returns 0. The X OR function provides the target our learning algorithm will adapt the parameters θ to make f as similar as function p ossible y = ∗f ( x) that we want to learn. Our mo del provides a function y = f ( x; θ ) and to f . our learning algorithm will adapt the parameters θ to make f as similar as p ossible to fIn. this simple example, we will not b e concerned with statistical generalization. [0,, 0]> , [0 , 1]> , X = { [0 We wan wantt our netw network ork to p erform correctly on the four p oin oints ts tsX In this simple example, we will not b e concerned with statistical [1 [1,, 0]>, and [1 [1,, 1]>} . W Wee will train the netw network ork on all four ofX thesegeneralization. p oin oints. ts. The = [0 , 0] W e wan t our netw ork to p erform correctly on the four p oin ts , [0 , 1] , only challenge is to ﬁt the training set. [1, 0] , and [1, 1] . We will train the network on all four of these { p oints. The We can treat this problem as a regression problem and use a mean squared error only challenge is to} ﬁt the training set. loss function. We choose this loss function to simplify the math for this example Wuch e can this problem a regression problem andother, use a mean error as m astreat p ossible. We willassee later that there are moresquared appropriate loss function. We choose this loss function to simplify the math for this example as much as p ossible. We will see later that there are other, more appropriate 170
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
approac approaches hes for mo modeling deling binary data. Ev Evaluated aluated on our whole training set, the MSE loss function is approaches for mo deling binary data. 1 Xset,∗ the MSE loss 2function is Evaluated on our whole training J (θ ) = (f (x) − f (x; θ )) . 4 1 x∈X J (θ ) = (f (x) f (x; θ )) . 4 No Now w we must cho hoose ose the form of our mo model, del, Suppose ose that we − f (x; θ ). Supp a linear mo model, del, with θ consisting of w and b. Our mo model del is deﬁned to b e Now we must cho ose the form of our mo del, f (x; θ ). Supp ose that we X a linear mo del, with θ consisting f (x;of w ,wb)and = xb>.wOur + b.mo del is deﬁned to b e
(6.1) (6.1) cho hoose ose cho ose (6.2)
f (xform ; w , b)with = x resp w +ect b. to w and b using the normal (6.2) We can minimize J(θ ) in closed respect equations. We can minimize J(θ ) in closed form with resp ect to w and b using the normal After solving the normal equations, we obtain w = 0 and b = 12 . The linear equations. mo model del simply outputs 0.5 everywhere. Wh Why y do does es this happ happen? en? Fig. 6.1 shows how w = 0 = y .toThe After solving the normal equations, we obtain linear a linear mo model del is not able to represent the XOR function. and Onebwa way solve this mo del simply 0.del 5 everywhere. y do es tthis happ en? Fig. 6.1 shows how problem is to outputs use a mo model that learnsWh a diﬀeren diﬀerent feature space in which a linear a linear deltoisrepresent not able to mo model del ismo able therepresent solution. the XOR function. One way to solve this problem is to use a mo del that learns a diﬀerent feature space in which a linear Sp Speciﬁcally eciﬁcally eciﬁcally, , we will in intro tro troduce a very simple feedforward netw network ork with one mo del is able to represent theduce solution. hidden la layer yer containing two hidden units. See Fig. 6.2 for an illustration of eciﬁcally , wefeedforward will intro duce very feedforward orkh with thisSp mo model. del. This net netw waork hassimple a vector of hiddennetw units that one are hidden la yer containing t w o hidden units. See Fig. 6.2 for an illustration of (1) computed by a function f (x; W , c). The values of these hidden units are then h this mo del. This feedforward net w ork has a v ector of hidden units that are used as the input for a second lay layer. er. The second lay layer er is the output lay layer er of the f ( x ; W , c computed by a function ) . The v alues of these hidden units are net network. work. The output lay layer er is still just a linear regression mo model, del, but no now w then it is used as to thehinput a second layer.netw Theork second layer is the layer of the applied ratherforthan to x. The network now contains twooutput functions chained network. h The lay is still mo del, but nowb eing it is (1) (x; W = foutput , cer y =just f (2)a(hlinear ; w , b ),regression together: ) and with the complete mo model del h x applied to rather than to . The netw ork now contains t wo functions chained f (x; W , c, w , b) = f (2)(f (1)(x)). together: h = f (x; W , c)(1) and y = f (h; w , b ), with the complete mo del b eing f What function should Linear mo models dels ha have ve serv served ed us well so far, f (x; W , c, w , b) = f (f (x))compute? . (1) and it may b e tempting to make f b e linear as well. Unfortunately Unfortunately,, if f(1) were What function should f net compute? delsremain have serv ed us function well so far, linear, then the feedforward network work as aLinear whole mo would a linear of and it may b e tempting to make b e linear as well. Unfortunately , if w ere f f > (1) its input. Ignoring the intercept terms for the momen moment, t, supp suppose ose f (x ) = W x linear, then the feedforward net work as a would remain a tlinear function as of >w > >whole (2) f ( h ) = h f ( x ) = w W x and . Then . We could represen represent this function fits(xinput. ) = x>Ignoring w 0 wherethe w 0 intercept = W w. terms for the moment, supp ose f (x ) = W x and f (h) = h w. Then f( x) = w W x. We could represent this function as Clearly, must use a nonlinear function to describ describee the features. Most neural f (xClearly ) = x ,wwewhere w = W w. net networks works do so using an aﬃne transformation con controlled trolled by learned parameters, Clearly , w e must use a nonlinear function to describ e thefunction. features. W Most neural follo followed wed by a ﬁxed, nonlinear function called an activ activation ation e use that networkshere, do soby using an aﬃne controlled bvides y learned parameters, h = gtransformation ( W >x + c) , where W pro strategy deﬁning provides the weigh weights ts of a followed by a ﬁxed, nonlinear function an activ function. Wregression e use that linear transformation and c the biases. called Previously Previously, , to ation describ describe e a linear h = g ( W x + c ) , W strategy here, by deﬁning where pro vides the weigh ts ofan a mo model, del, we used a vector of weigh weights ts and a scalar bias parameter to describe linear transformation and c the biases. Previously, to describ e a linear regression 171a scalar bias parameter to describe an mo del, we used a vector of weights and
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Original x Space
Learned h Space
Original x Space
Learned h Space
1
x2
h2
1
0
0 0
1
0
x1
1 h1
2
Figure 6.1: Solving the XOR problem by learning a represen representation. tation. The b old num numb b ers prin printed ted on the plot indicate the value that the learned function must output at each p oin oint. t. Figure 6.1: Solving the XOR problem b y learning a represen tation. The b old num b ers (L (Left) eft) A linear mo model del applied directly to the original input cannot implement the XOR printed onWhen the plot indicate valueoutput that the learned function output When at each oin11, t., function. 00,, the the mo model’s del’s must increase as x must x1 = x 1p= 2 increases. (Left) Adel’s linear mo del applied directly the originalAinput the mo model’s output must decrease as xto2 increases. linearcannot modelimplement must applythe a XOR ﬁxed x = 1, function. , thelinear mo del’s output must increase as xthe increases. co coeﬃcien eﬃcien eﬃcienttWhen . 0The mo model del therefore cannot use value of When w 2 toxx2= x 1 to change x moeﬃcient del’s output decrease as this increases. linear ust apply aspace ﬁxed the co coeﬃcient on x2must and cannot solve problem.A(R (Right) ight) model In the m transformed x to co eﬃciented t wbyto . The linear mo del cannot the value of can change represen represented thex features extracted bytherefore a neural netw network, ork,use a linear model no now w solve x example (R ight) the problem. co eﬃcientInonour and cannot solve this problem. In the transformed space solution, the tw two o p oints that must hav havee output 1 ha have ve b een represented byathe features y a neural netww ork, a linear model can now solve collapsed into single p oin ointt extracted in feature bspace. In other ords, the nonlinear features hav havee > two p oints that must have output 1hha > the problem. our[1 the een x= [1,example , 0] > andsolution, x = [0 [0,, 1] [1 ,b0] mapp mapped ed b othIn to a single p oin ointt in feature sp space, ace, =ve . collapsed into a single p oin t in feature space. In other w ords, the nonlinear features hav e The linear mo model del can now describ describee the function as increasing in h 1 and decreasing in h2. [1,motiv 0] and = [0 , 1] to the = [1mo , 0]del. mapp b oth x = a single p oin t in is feature spmake ace, hthe In thisedexample, the motivation ationx for learning feature space only to model h The linear mo delsocan now describ e thetraining function as In increasing in and decreasinglearned in h . capacit capacity y greater that it can ﬁt the set. more realistic applications, In this example, thealso motiv ation learning the feature space is only to make the mo del represen representations tations can help thefor mo model del to generalize. capacity greater so that it can ﬁt the training set. In more realistic applications, learned representations can also help the mo del to generalize.
172
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
y
h W x
Figure 6.2: An example of a feedforward netw network, ork, drawn in tw two o diﬀerent styles. Sp Speciﬁcally eciﬁcally eciﬁcally,, this is the feedforw feedforward ard netw network ork we use to solve the XOR example. It has a single hidden Figure 6.2: An example of a feedforward netwstyle, ork, drawn in twevery o diﬀerent Spde eciﬁcally la layer yer containing two units. (L (Left) eft) In this we draw unit styles. as a no node in the, this is the feedforw ard netw ork we use to solve the X OR example. It has a single hidden graph. This style is very explicit and unambiguous but for net networks works larger than this eft) layer containing two units. In this (R style, e draw everywe unit as example it can consume to too o (L muc much h space. (Right) ight)wIn this style, dra draw w aa no node node de in the graph.for This style is vvery explicit and unambiguous but for net works larger than this graph eac each h en entire tire ector representing a la layer’s yer’s activ activations. ations. This st style yle is muc much h more (R ight) example it can consume to o muc h space. In this style, we dra w a no de in the compact. Sometimes we annotate the edges in this graph with the name of the parameters graphdescrib for eacehthe entire vector representing ao la yer’s ations. This stthat yle isa muc h more W that describe relationship b etw etween een tw two la layers. yers. activ Here, we indicate matrix compact. Sometimes wefrom annotate edgesa in this graph withesthe of thefrom parameters describ describes es the mapping vector describes thename mapping x to the h , and w describ h to y. W that describ eomit the relationship etween twoasso layers. Here, indicate that alab matrix W e typically the interceptbparameters associated ciated withwe each lay layer er when labeling eling this describ the mapping from x to h , and a vector w describ es the mapping from h to y. kind of es drawing. We typically omit the intercept parameters asso ciated with each layer when lab eling this kind of drawing.
aﬃne transformation from an input vector to an output scalar. Now, we describ describee an aﬃne transformation from a vector x to a vector h , so an en vector of bias entire tire aﬃne transformation from an input vector to an output scalar. Now, w e describ parameters is needed. The activ activation ation function g is typical typically ly chosen to be a functione h , somodern an aﬃne transformation fromwith a vector a vector andern entire vector of orks, bias h i = gx( xto>W that is applied elementwise, neural netw networks, :,i + ci ). In mo g parameters is needed. The activ ation function is typical ly chosen to be a function the default recommendation is to use the rectiﬁe ctiﬁed d line linear ar unit or ReLU (Jarrett h = g ( x W + c that is applied elementwise, with ) . In mo dernby neural netwation orks, et al. al.,, 2009; Nair and Hin Hinton ton, 2010; Glorot et al. al.,, 2011a) deﬁned the activ activation the default is to use the r6.3 ectiﬁe function g (zrecommendation ) = max{0, z } depicted in Fig. . d linear unit or ReLU (Jarrett et al., 2009; Nair and Hinton, 2010; Glorot et al., 2011a) deﬁned by the activation We can now w max sp specify ecify completeinnetw network function g (zno )= 0, zourdepicted Fig.ork 6.3as . > max {0as , W > x + c} + b. (6.3) (x{; Wour w , b) = wnetw }, c,complete We can now spfecify ork W x + cLet+ b. fecify (x; Wa ,solution c, w , b) = We can no now w sp specify towthemax XOR0,problem. } 1 1{ We can now sp ecify a solution Let Wto=the XOR problem. , 1 1 1 1 W = , 10 1 c= , −1 0 c= 1 , w = 1 , −2 − 1 w = 173 , 2 −
(6.3) (6.4) (6.4) (6.5) (6.5) (6.6) (6.6)
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Figure 6.3: The rectiﬁed linear activ activation ation function. This activ activation ation function is the default activ activation ation function recommended for use with most feedforw feedforward ard neural netw networks. orks. Applying Figure 6.3: The linear ation function. Thisyields activation function is the default this function to rectiﬁed the output of aactiv linear transformation a nonlinear transformation. activ ation function recommended for use with most feedforw ard neural netw orks. Applying Ho Howev wev wever, er, the function remains very close to linear, in the sense that is a piecewise linear this function thelinear outputpieces. of a linear transformation yieldsunits a nonlinear transformation. function withtotwo Because rectiﬁed linear are nearly linear, they However, the yfunction remains linear,mo indels the easy sensetothat is a piecewise linear preserv preserve e man many of the prop properties ertiesvery thatclose maketolinear models optimize with gradientfunction with t wo linear pieces. Because rectiﬁed linear units are nearly linear, they based metho methods. ds. They also preserve man many y of the prop properties erties that make linear mo models dels preserve man y ofAthe prop erties that make linear mo dels easy to optimize gradientgeneralize well. common principle throughout computer science is thatwith we can build based methosystems ds. They alsominimal preservecomp manonen y of ts. theMuch prop erties linearmemory mo dels complicated from componen onents. as a Tthat uringmake machine’s generalize well. A common principle throughout computer science is that we can build needs only to b e able to store 0 or 1 states, we can build a universal function approximator complicated from minimal comp onents. Much as a Turing machine’s memory from rectiﬁedsystems linear functions. needs only to b e able to store 0 or 1 states, we can build a universal function approximator from rectiﬁed linear functions.
174
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
and b = 0. We can now walk through the wa way y that the mo model del pro processes cesses a batc batch h of inputs. and b = 0. Let X b e the design matrix containing all four p oints in the binary input space, e can now walk the way that the mo del pro cesses a batch of inputs. withWone example p erthrough row: all four p oints in the binary input space, Let X b e the design matrix containing 0 0 with one example p er row: 0 1 X = (6.7) 10 00 . 0 1 X= 1 1 . (6.7) 1 0 The ﬁrst step in the neural netw network ork is to the input matrix by the ﬁrst 1 multiply 1 la layer’s yer’s weight matrix: The ﬁrst step in the neural network isto0multiply the input matrix by the ﬁrst 0 layer’s weight matrix: 1 1 (6.8) X W = 10 10 . 1 1 (6.8) XW = 2 2 . 1 1 Next, we add the bias vector c, to obtain 2 2 Next, we add the bias vector c, to obtain 0 −1 1 0 0 1 . (6.9) 1 0 21 −10 . (6.9) 1 0 2 1 a line with slop In this space, all of the examples lie along slopee 1. As we mov movee along this line, the output needs to b egin at 0, then rise to 1, then drop bac back k down to 0. In linear this space, of theimplement examples lie along a line with slop e 1computing . As we mov e along A mo model delallcannot suc such h a function. T o ﬁnish the value this line, the output needs to b egin at 0 , then rise to 1 , then drop bac k down to 0. of h for eac each h example, we apply the rectiﬁedlinear transformation: A linear mo del cannot implement such a function. To ﬁnish computing the value of h for each example, we apply the rectiﬁed 0 0 linear transformation: 1 0 0 0 . (6.10) 1 0 21 10 . (6.10) 1 0 2 1 This transformation has changed the relationship b etw etween een the examples. They no longer lie on a single line. As shown in Fig. 6.1, they now lie in a space where a This transformation hasthe changed the relationship b etween the examples. They no linear mo model del can solve problem. in Fig. 6.1, they now lie in a space where a longer lie on a single line. As shown We ﬁnish by multiplying by theweigh weightt vector w: linear mo del can solve the problem. We ﬁnish by multiplying by the weigh 0 t vector w : 1 0 . (6.11) 1 01 . (6.11) 1 175 0
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
The neural net netw work has obtained the correct answer for ev every ery example in the batc batch. h. In this example, we simply sp speciﬁed eciﬁed the solution, then show showed ed that it obtained The neural network has obtained the correct answer for every example in the batch. zero error. In a real situation, there might b e billions of mo model del parameters and In this example, w e simply sp eciﬁed the solution, then show that itasobtained billions of training examples, so one cannot simply guess the ed solution we did zero error. In a real situation, there might b e billions of mo del parameters and here. Instead, a gradientbased optimization algorithm can ﬁnd parameters that billions training examples, one cannot simply solution as we did pro produce duceofvery little error. The so solution we describ described ed guess to thethe XOR problem is at a here. Instead, a gradientbased optimization algorithm can ﬁnd parameters that global minim minimum um of the loss function, so gradient descent could con converge verge to this pro duce very little error. The solution w e describ ed to the X OR problem is at a p oin oint. t. There are other equiv equivalen alen alentt solutions to the XOR problem that gradient global minim um of the loss function, so gradient descent could con verge to descen descentt could also ﬁnd. The conv convergence ergence p oint of gradien gradientt descen descentt dep depends ends on this the p oint. vThere arethe other equivalenIn t solutions the XOR problem that gradient initial alues of parameters. practice, to gradien gradient t descent would usually not descen t could alsoundersto ﬁnd. Theo d, conv ergencealued p ointsolutions of gradien t descen t dep on ted the ﬁnd clean, easily understoo integerv integervalued like the one weends presen presented initial values of the parameters. In practice, gradient descent would usually not here. ﬁnd clean, easily understo o d, integervalued solutions like the one we presented here.
6.2
Gradien GradientBased tBased Learning
Designing and training a neuralLearning netw network ork is not much diﬀeren diﬀerentt from training any 6.2 Gradien tBased other machine learning mo model del with gradient descen descent. t. In Sec. 5.10, we describ described ed Designing and training a neural netw ork is not m uch diﬀeren t from training any ho how w to build a machine learning algorithm by sp specifying ecifying an optimization pro procedure, cedure, mo delfamily with. gradient descent. In Sec. 5.10, we describ ed aother cost machine function,learning and a mo model del family. how to build a machine learning algorithm by sp ecifying an optimization pro cedure, Thefunction, largest diﬀerence etw etween een the models dels we hav havee seen so far and neural a cost and a mobdel family . linear mo net networks works is that the nonlinearit nonlinearity y of a neural netw network ork causes most interesting loss The largest diﬀerence b etw een the linear mo dels have seen so farare andusually neural functions to become nonconv nonconvex. ex. This means thatwe neural net networks works networksbyis using that the nonlinearit y of a neuraloptimizers network causes most interesting loss trained iterativ iterative, e, gradientbased that merely drive the cost functionstotoa become ex. This thatequation neural net worksused are to usually function very lo low wnonconv value, rather thanmeans the linear solvers train trained by using iterativ e, gradientbased optimizers that merely drive the cost linear regression mo models dels or the con convex vex optimization algorithms with global conv convererfunction to a v ery lo w v alue, rather than the linear equation solvers used to train gence guarantees used to train logistic regression or SVMs. Conv Convex ex optimization linear regression mofrom dels or theinitial convex optimization global ercon converges verges starting any parameters (in algorithms theory—in with practice it conv is very gence guarantees used to ntrain logistic regression or SVMs. Convex optimization robust but can encounter umerical problems). Sto Stoc chastic gradient descent applied con verges starting from any initial parameters (in theory—in practice it is very to nonconv nonconvex ex loss functions has no such conv convergence ergence guarantee, and is sensitive robust but can encounter n umerical problems). Sto c hastic gradient descent applied to the values of the initial parameters. For feedforward neural net networks, works, it is to nonconv exinitialize loss functions has ts notosuch conv ergencevalues. guarantee, is sensitive imp importan ortan ortantt to all weigh weights small random The and biases ma may y be to the v alues of the initial parameters. F or feedforward neural net works, it is initialized to zero or to small p ositiv ositivee values. The iterativ iterativee gradientbased optiimp ortanalgorithms t to initialize alltoweigh to small random values. The biases madeep y be mization used traintsfeedforward netw networks orks and almost all other initialized to zero or to small p ositiv e v alues. The iterativ e gradientbased optimo models dels will b e describ described ed in detail in Chapter 8, with parameter initialization in mization algorithms to train networks and almost all other deep particular discussed used in Sec. 8.4. Ffeedforward or the moment, it suﬃces to understand that mo dels will b e describis edalmost in detail inysChapter with the parameter in the training algorithm alwa always based on8,using gradientinitialization to descend the particular discussed Sec. 8.4. For the it suﬃces toare understand that cost function in one in way or another. Themoment, sp speciﬁc eciﬁc algorithms impro improv vements the training algorithm is almost alwa ys based on using the gradient to descend the and reﬁnemen reﬁnements ts on the ideas of gradient descent, in intro tro troduced duced in Sec. 4.3, and, cost function in one way or another. The sp eciﬁc algorithms are improvements 176 descent, intro duced in Sec. 4.3, and, and reﬁnements on the ideas of gradient
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
more sp speciﬁcally eciﬁcally eciﬁcally,, are most often improv improvements ements of the sto stochastic chastic gradien gradientt descent algorithm, introduced in Sec. 5.9. more sp eciﬁcally, are most often improvements of the sto chastic gradient descent We can of course, train mo models dels such as linear regression and supp support ort vector algorithm, introduced in Sec. 5.9. mac machines hines with gradien gradientt descen descentt to too, o, and in fact this is common when the training W e can of course, train mo dels suchofasview, linear regression and supp ort visector set is extremely large. From this p oint training a neural net network work not mac hines witht gradien t descenany t toother o, andmo indel. factComputing this is common when theis training m uch diﬀeren diﬀerent from training model. the gradient sligh slightly tly set is extremely large. F rom this p oint of view, training a neural net work is not more complicated for a neural netw network, ork, but can still b e done eﬃciently and exactly exactly.. m uch diﬀeren t from training any other mo del. Computing the gradient is sligh tly Sec. 6.5 will describ describee ho how w to obtain the gradien gradientt using the backpropagation more complicated for a neural netw ork, but can still b e done eﬃciently and exactly . algorithm and mo modern dern generalizations of the backpropagation algorithm. Sec. 6.5 will describ e how to obtain the gradient using the backpropagation As withand other machine learning mo models, apply gradien gradientbased tbased learning we algorithm mo dern generalizations ofdels, the to backpropagation algorithm. must cho hoose ose a cost function, and we must choose how to represent the output of As with other machine mo dels, to apply gradien tbased learning we the mo model. del. W e no now w revisit learning these design considerations with sp special ecial emphasis on m ust c ho ose a cost function, and we must choose how to represent the output of the neural netw networks orks scenario. the mo del. We now revisit these design considerations with sp ecial emphasis on the neural networks scenario. An imp important ortant asp aspect ect of the design of a deep neural net network work is the choice of the cost function. Fortunately ortunately,, the cost functions for neural netw networks orks are more or less An imp ortant asp ect of the design of a deep neural net work the choice of the the same as those for other parametric mo models, dels, suc such h as linear is mo models. dels. cost function. Fortunately, the cost functions for neural networks are more or less In most cases, our parametric mo model del deﬁnes a distribution p ( y  x; θ ) and the same as those for other parametric mo dels, such as linear mo dels. we simply use the principle of maximum likelihoo likelihood. d. This means we use the p ( y asx;the θ ) cost In most cases, our parametric mo del deﬁnes a distribution and crossen crossentropy tropy b etw etween een the training data and the mo model’s del’s predictions w e simply use the principle of maximum likelihoo d. This means w e use the  function. crossentropy b etween the training data and the mo del’s predictions as the cost Sometimes, we take a simpler approach, where rather than predicting a complete function. probabilit probability y distribution ov over er y , we merely predict some statistic of y conditioned Sometimes, w e take a simpler approach, than predicting a complete on x. Sp Specialized ecialized loss functions allow us towhere trainrather a predictor of these estimates. probability distribution over y , we merely predict some statistic of y conditioned The total cost function used to train a neural net network work will often combine one on x. Sp ecialized loss functions allow us to train a predictor of these estimates. of the primary cost functions describ described ed here with a regularization term. We hav havee The total cost function used to train a neural net work will often combine one already seen some simple examples of regularization applied to linear mo models dels in Sec. of the. The primary cost functions describ edfor here withmo a dels regularization term.applicable We have 5.2.2 5.2.2. weigh weight t decay approach used linear models is also directly already seen some simple examples of regularization applied to linear mo dels in Sec. to deep neural netw networks orks and is among the most p opular regularization strategies. 5.2.2. adv Theanced weighregularization t decay approach used forfor linear monetw dels is alsowill directly applicable More advanced strategies neural networks orks b e describ described ed in to deep neural netw orks and is among the most p opular regularization strategies. Chapter 7. More advanced regularization strategies for neural networks will b e describ ed in Chapter 7. Most mo modern dern neural net networks works are trained using maxim maximum um lik likeliho eliho elihoo o d. This means that the cost function is simply the negative loglikelihoo equiv loglikelihood, d, equivalen alen alently tly describ described ed Most mo dern neural networks are trained using maximum likeliho o d. This means 177 that the cost function is simply the negative loglikelihoo d, equivalently describ ed
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
as the crossen crossentropy tropy b et etw ween the training data and the mo model del distribution. This cost function is giv given en by as the crossentropy b etween the training data and the mo del distribution. This cost function is given bJy(θ ) = −E , ∼pˆ log pmodel(y  x). (6.12) E log p from (θ ) cost = function changes (y mo x)del . to mo (6.12) The sp speciﬁc eciﬁc form ofJ the model model, del, dep depending ending − . The expansion of the  ab on the sp speciﬁc eciﬁc form of log pmodel abo ove equation typically The sp eciﬁc form of the cost function c hanges from mo del to mo del, depma ending yields some terms that do not dep depend end on the mo model del parameters and may y be on the sp eciﬁc form of . The expansion of the ab o ve equation typically log p discarded. For example, as we saw in Sec. 5.5.1, if pmodel (y  x) = N (y ; f (x; θ) , I ), yieldswesome terms not dep end cost, on the mo del parameters and may b e then recov recover er the that meandosquared error (y x) = (y ; f (x; θ) , I ), discarded. For example, as we saw in Sec. 5.5.1, if p 1 then we recover theJmean  , N ycost, − f (x; θ )2 + const (θ ) = squared E , ∼pˆ error const, (6.13) 2 1E y f (x; θ ) + const, J (θ ) =1 (6.13) up to a scaling factor of 2 and does es not dep depend end on θ . The discarded 2 a term that do  Gaussian −  constan constantt is based on the variance of the distribution, which in this case up to a scaling factor of and a term that do es not dep end on θ .alence The discarded we chose not to parametrize. Previously Previously,, we saw that the equiv equivalence betw between een constanum t islik based on variancewith of the distribution, which in this case maxim maximum likeliho eliho elihoo o d the estimation an Gaussian output distribution and minimization of w e chose not to parametrize. Previously , we saw that the equiv alence betw een mean squared error holds for a linear model, but in fact, the equiv equivalence alence holds maxim um lik eliho o d estimation with an output distribution and minimization of regardless of the f (x; θ ) used to predict the mean of the Gaussian. mean squared error holds for a linear model, but in fact, the equivalence holds An adv advantage antage approach of deriving the of cost from maximum regardless of the f of (x;this θ ) used to predict the mean thefunction Gaussian. lik likeliho eliho elihoo o d is that it remov removes es the burden of designing cost functions for each mo model. del. An adv antage of this approach of deriving the cost function from maximum Sp Specifying ecifying a mo model del p(y  x ) automatically determines a cost function log p (y  x ). likeliho o d is that it removes the burden of designing cost functions for each mo del. One recurring theme throughout neural netw network ork design is that the gradien gradientt of Sp ecifying a mo del p(y x ) automatically determines a cost function log p (y x ). the cost function must b e large and predictable enough to serve as a go good od guide  throughout neural network design is that the gradien t of One recurring theme for the learning algorithm. Functions that saturate (b (become ecome very ﬂat) undermine the cost function must b e large and predictable enough to serve go od guide this ob objectiv jectiv jectivee b ecause they make the gradien gradientt b ecome very small.asIna many cases for the learning algorithm. F unctions that saturate (b ecome v ery ﬂat) undermine this happ happens ens b ecause the activ activation ation functions used to pro produce duce the output of the this ob jectiv b ecause they make gradienThe t b ecome very small. In many cases hidden unitse or the output units the saturate. negativ to negativee loglikelihoo loglikelihood d helps this happ ens b ecause the activ ation functions used to pro duce the output of the avoid this problem for many models. Man Many y output units inv involv olv olvee an exp function hidden units or the output units saturate. The negativ e loglikelihoo d helps to that can saturate when its argumen argumentt is very negative. The log function in the a void this problem for many models. Man y output units inv olv e an function exp negativ negativee loglik loglikeliho eliho elihooo d cost function undo undoes es the exp of some output units. We will logoffunction that can saturate when its argumen t is very negative. in the discuss the interaction b et etwe we ween en the cost function and the The choice output unit in negativ e loglik eliho o d cost function undo es the exp of some output units. We will Sec. 6.2.2 . discuss the interaction b etween the cost function and the choice of output unit in One un unusual usual prop propert ert erty y of the crossentrop crossentropy y cost used to p erform maximum Sec. 6.2.2. lik likeliho eliho elihooo d estimation is that it usually do does es not ha have ve a minimum value when applied One un usual prop ert y of the crossentrop y usedoutput to p erform maximum to the mo models dels commonly used in practice. For cost discrete variables, most likeliho d estimation is that usually do notthey havecannot a minimum value awhen applied mo models delsoare parametrized initsuch a wa way y es that represent probability to zero the mo usedarbitrarily in practice.close Fortodiscrete output variables, most of or dels one,commonly but can come doing so. Logistic regression mo dels are parametrized in such a wa y that they cannot represent a probability is an example of such a mo model. del. For realv realvalued alued output variables, if the mo model del of zero or one, but can come arbitrarily close to doing so. Logistic regression 178 alued output variables, if the mo del is an example of such a mo del. For realv
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
can control the density of the output distribution (for example, by learning the variance parameter of a Gaussian output distribution) then it b ecomes p ossible canassign control the density the output (for example, by learning to extremely highofdensity to thedistribution correct training set outputs, resultingthe in v ariance parameter of a Gaussian output distribution) then it b ecomes p ossible crossen crossentropy tropy approaching negativ negativee inﬁnity inﬁnity.. Regularization tec techniques hniques describ described ed to assign extremely high density to the correct training set outputs, resulting in in Chapter 7 provide sev several eral diﬀerent wa ways ys of mo modifying difying the learning problem so crossen e inﬁnity . Regularization that thetropy mo model delapproaching cannot reapnegativ unlimited reward in this wa way y. techniques describ ed in Chapter 7 provide several diﬀerent ways of mo difying the learning problem so that the mo del cannot reap unlimited reward in this way. Instead of learning a full probabilit probability y distribution p(y  x ; θ ) we often wan wantt to learn just one conditional statistic of y giv given en x. Instead of learning a full probability distribution p(y x ; θ ) we often want to learn For example, we may hav havee a predictor f (x ; θ) that we wish to predict the mean just one conditional statistic of y given x.  of y . For example, we may have a predictor f (x ; θ) that we wish to predict the mean If we use a suﬃcien suﬃciently tly p owerful neural netw network, ork, we can think of the neural of y . net network work as b eing able to represent an any y function f from a wide class of functions, we use tly ponly owerful neural netw we can think the neural withIfthis classa bsuﬃcien eing limited by features suc such hork, as contin continuit uit uity y andof b oundedness f from networkthan as bbeing able any function a wide rather y ha having vingtoa represent speciﬁc parametric form. From this class pointofoffunctions, view, we with this class b eing limited only by features suc h as contin uit y and b oundedness can view the cost function as being a functional rather than just a function. A rather thanisbay mapping having a from speciﬁc parametric form. Fbrom pointth of functional functions to real num numb ers. this We can thus us view, think we of can viewas the cost function as being a than functional rather thanajust a function. A learning choosing a function rather merely choosing set of parameters. functional is a mapping from functions toe real numb ers. occur We can us think of W e can design our cost functional to hav have its minimum at th some sp speciﬁc eciﬁc learning as rather thandesign merely choosing a set of parameters. function wechoosing desire. Faorfunction example, we can the cost functional to hav havee its W e can design our cost functional to hav e its minimum occur at some spen eciﬁc x. minim minimum um lie on the function that maps x to the exp expected ected value of y giv given functionanwoptimization e desire. Forproblem example, weresp canect design the costrequires functional to have its Solving with respect to a function a mathematical x to y given to x. minim um lie on theoffunction that mapsed the exp ected alue to tool ol called calculus variations variations, , describ described in Sec. 19.4.2 . Itvis notofnecessary Solving an optimization with ect to a function requires a mathematical understand calculus of problem variations to resp understand the conten content t of this chapter. A Att to ol called c alculus of variations , describ ed in Sec. 19.4.2 . It is not necessary the moment, it is only necessary to understand that calculus of variations ma may y btoe understand calculus of variations to understand the content of this chapter. At used to derive the following two results. the moment, it is only necessary to understand that calculus of variations may b e Our ﬁrst result derived using calculus of variations is that solving the optimizaused to derive the following two results. tion problem Our ﬁrst result derived calculus is) that solving the optimiza2 (6.14) y − f (x f ∗ =using arg min E , ∼pof variations tion problem f E (6.14) y f (x) f = arg min yields  −  f ∗ (x) = E ∼p (6.15) (y x)[y ], yields E so long as this function lies within if we over. er. In other words,(6.15) f (x)the = class we optimize [y ], ov could train on inﬁnitely man many y samples from the true datagenerating distribution, so long as this function lies within class we optimize er. In other words, if the we minimizing the mean squared errorthe cost function gives a ov function that predicts could train on inﬁnitely man y samples from the true datagenerating distribution, mean of y for each value of x. minimizing the mean squared error cost function gives a function that predicts the 179 mean of y for each value of x.
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Diﬀeren Diﬀerentt cost functions give diﬀeren diﬀerentt statistics. A second result deriv derived ed using calculus of variations is that Diﬀerent cost functions give diﬀerent statistics. A second result derived using calculus of variations isf ∗that = arg min E , ∼p y − f (x)1 (6.16) f E f = arg min y f (x) (6.16) yields a function that predicts the value of y for eac each h x , so long as such a  −  function may b e describ described ed by the family of functions we optimize over. This cost yields a function that predicts the alue function is commonly called me mean an absolute verr error or or..of y for each x , so long as such a function may b e describ ed by the family of functions we optimize over. This cost Unfortunately Unfortunately,, mean squared error and mean absolute error often lead to p o or function is commonly called mean absolute error. results when used with gradientbased optimization. Some output units that Unfortunately meansmall squared error and mean absolute error often lead to p o or saturate pro produce duce,very gradients when combined with these cost functions. results when usedthat withthe gradientbased optimization. units mean that This is one reason crossentrop crossentropy y cost function is Some more poutput opular than saturateerror pro duce very absolute small gradients when combined these cost functions. squared or mean error, ev even en when it is notwith necessary to estimate an This is one reason that the crossentrop y cost function is more p opular than mean en entire tire distribution p(y  x). squared error or mean absolute error, even when it is not necessary to estimate an entire distribution p(y x).  The choice of cost function is tightly coupled with the choice of output unit. Most of the time, we simply use the crossentrop crossentropy y b et etween ween the data distribution and the The c hoice of cost function is tightly coupled with thethe choice of output unit. Most mo model del distribution. The choice of how to represent output then determines of the time, w e simply use the crossentrop y b et ween the data distribution and the the form of the crossentrop crossentropy y function. mo del distribution. The choice of how to represent the output then determines Any y kind of crossentrop neural netw network unit that may b e used as an output can also b e the An form of the york function. used as a hidden unit. Here, we fo focus cus on the use of these units as outputs of the An y kind of neural netw ork unit thatinternally may b e used as an also be mo model, del, but in principle they can b e used as well. Weoutput revisit can these units used as a hidden unit. Here, we fo cus on the use of these units as outputs of the with additional detail ab about out their use as hidden units in Sec. 6.3. mo del, but in principle they can b e used internally as well. We revisit these units Throughout this section, we supp suppose ose that the feedforw feedforward ard net network work pro provides vides a with additional detail ab out their use as hidden units in Sec. 6.3. set of hidden features deﬁned by h = f (x; θ). The role of the output lay layer er is then Throughout this section, we supp ose that the feedforw ard net work pro vides a to provide some additional transformation from the features to complete the task set ofthe hidden features by h = f (x; θ). The role of the output layer is then that netw network ork mustdeﬁned p erform. to provide some additional transformation from the features to complete the task that the network must p erform. One simple kind of output unit is an output unit based on an aﬃne transformation with no nonlinearity nonlinearity.. These are often just called linear units. One simple kind of output unit is an output unit based on an aﬃne transformation Given features h,.aThese la layer yer of linear units linear pro produces duces a vector yˆ = W > h+ b. withGiv noennonlinearity are oftenoutput just called units. Linear outputhlay layers ers are often used to pro produce duce the mean of a conditional Given features , a layer of linear output units pro duces a vector yˆ = W h+ b. Gaussian distribution: Linear output layers are poften duce yˆ, I ). the mean of a conditional (6.17) (y  xused ) = Nto (y ;pro Gaussian distribution: (6.17) p(y x) =180 (y ; yˆ, I ).  N
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Maximizing the loglikelihoo loglikelihood d is then equiv equivalent alent to minimizing the mean squared error. Maximizing the loglikelihoo d is then equivalent to minimizing the mean squared The maximum likelihoo likelihood d framew framework ork mak makes es it straigh straightforward tforward to learn the error. co cov variance of the Gaussian to too, o, or to mak makee the cov covariance ariance of the Gaussian b e a The maximum likelihoo d framew ork mak es it tforward thee function of the input. How However, ever, the cov covariance ariance must bstraigh e constrained to to b e learn a p ositiv ositive covariance of the to makto e the covariance of the ts Gaussian be a deﬁnite matrix for Gaussian all inputs.toIto,isordiﬃcult satisfy suc such h constrain constraints with a linear functionla of thesoinput. How ever,output the covunits ariance e constrained b ecov a pariance. ositive output layer, yer, typically other aremust usedbto parametrizetothe covariance. deﬁnite matrix for deling all inputs. It is diﬃcult satisfyedsuc h constrain ts with a linear Approac Approaches hes to mo modeling the cov covariance ariance aretodescrib described shortly shortly, , in Sec. 6.2.2.4 . output layer, so typically other output units are used to parametrize the covariance. Because linear units the do not saturate, pose diﬃcult diﬃculty y for gradien gradientApproac hes to mo deling covariance arethey describ edlittle shortly , in Sec. 6.2.2.4 . tbased optimization algorithms and ma may y b e used with a wide variety of optimization Because linear units do not saturate, they pose little diﬃculty for gradientalgorithms. based optimization algorithms and may b e used with a wide variety of optimization algorithms. Man Many y tasks require predicting the value of a binary variable y . Classiﬁcation problems with tw two o classes can b e cast in this form. Many tasks require predicting the value of a binary variable y . Classiﬁcation The maximumlik maximumlikeliho eliho elihood od approach is to deﬁne a Bernoulli distribution over y problems with two classes can b e cast in this form. conditioned on x. The maximumlikeliho od approach is to deﬁne a Bernoulli distribution over y A Bernoulli is deﬁned by just a single num numb b er. The neural net conditioned on xdistribution . needs to predict only P ( y = 1  x). For this num numb b er to b e a valid probability probability,, it A Bernoulli distribution is deﬁned b y just a single num b er. The neural net must lie in the in interv terv terval al [0, 1]. needs to predict only P ( y = 1 x). For this numb er to b e a valid probability, it Satisfying this constraint requires some careful design eﬀort. Supp Suppose ose we were must lie in the interval [0, 1].  to use a linear unit, and threshold its value to obtain a valid probabilit probability: y: Satisfying this constraint requires some careful design eﬀort. Supp ose we were n n oo > to use a linear unit, its value to obtain y: (6.18) P (and y = threshold 1  x) = max 0, min 1, w ha+valid b probabilit .
(y = a1 valid x) =conditional max 0, min 1, w h +but b we . would not b(6.18) This would indeed Pdeﬁne distribution, e able  to train it very eﬀectiv eﬀectively ely with gradient descent. Any time that w>h + b stra strayed yed This would indeed deﬁne a v alid conditional distribution, but we would not b e able outside the unit interv interval, al, the gradient of the output of the mo model del with resp respect ect to w h + b to train it v ery eﬀectiv ely with gradient descent. Any time that stra its parameters would b e 0. A gradientnof 0 is ntypically problematic b ecause yed the oo outside the unit interv al, the gradient of the output of the mo del with resp ect to learning algorithm no longer has a guide for how to impro improv ve the corresponding its parameters would b e 0. A gradient of 0 is typically problematic b ecause the parameters. learning algorithm no longer has a guide for how to improve the corresponding Instead, it is b etter to use a diﬀerent approach that ensures there is alwa always ys a parameters. strong gradien gradientt whenever the mo model del has the wrong answ answer. er. This approach is based Instead, it is b etter to use a diﬀerent approach that on using sigmoid output units combined with maximum ensures lik likeliho eliho elihoo othere d. is always a strong gradient whenever the mo del has the wrong answer. This approach is based A sigmoid output unit is deﬁned by on using sigmoid output units combined with maximum likeliho o d. A sigmoid output unit is deﬁned yˆ = σbyw>h + b yˆ = σ 181 w h+b
(6.19) (6.19)
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
where σ is the logistic sigmoid function describ described ed in Sec. 3.10. We can think of the sigmoid output unit as having tw two o comp componen onen onents. ts. First, it where σ is the logistic sigmoid function describ ed in Sec. 3.10. > uses a linear lay layer er to compute z = w h + b. Next, it uses the sigmoid activ activation ation W e can think of the sigmoid output unit as having tw o comp onen ts. First, it function to conv convert ert z into a probability probability.. uses a linear layer to compute z = w h + b. Next, it uses the sigmoid activation We omit the dep dependence endence on x for the moment to discuss how to deﬁne a function to convert z into a probability. probabilit probability y distribution ov over er y using the value z . The sigmoid can b e motiv motivated ated x W e omit the dep endence on for the moment to discuss how to deﬁne a ˜ by constructing an unnormalized probabilit probability y distribution P ( y ), whic which h do does es not y z probabilit y distribution ov er using the v alue . The sigmoid can b e motiv ated sum to 1. We can then divide by an appropriate constant to obtain a valid ˜( y ), which do es not b y constructing an unnormalized y distribution probabilit probability y distribution. If we b eginprobabilit with the assumption thatPthe unnormalized log sum to 1. W e can an appropriate a valid probabilities are linearthen in y divide and z , by we can exponentiateconstant to obtainto theobtain unnormalized probability distribution. If we b egintowith the unnormalized log probabilities. We then normalize see the thatassumption this yieldsthat a Bernoulli distribution y z probabilities linear in transformation and , we can exponentiate to obtain the unnormalized con controlled trolled byare a sigmoidal of z : probabilities. We then normalize to see that this yields a Bernoulli distribution controlled by a sigmoidal transformation log P˜(y ) = y z of z : (6.20) ˜ (6.21) log P P˜((yy)) = = exp( yz yz) (6.20) exp(y z ) (6.21) P˜(y ) = exp( (6.22) P1 y z ) exp(y y 0 z) y exp( =0 exp( yz) P ( y ) = (6.22) P (y ) = σ ((2y − 1)zy) z. ) (6.23) exp( (y )exp =σ ((2tiation y 1)z )and . normalization are common (6.23) Probabilit Probability y distributions basedPon exponen onen onentiation throughout the statistical mo modeling deling literature. such h a − The z variable deﬁning suc Probabilit y distributions based on exp onen tiation and normalization are common distribution ov over er binary variables is called logit git git.. P a lo throughout the statistical mo deling literature. The z variable deﬁning such a This approach to predicting the probabilities in logspace is natural to use distribution over binary variables is called a logit. with maximum likelihoo likelihood d learning. Because the cost function used with maxim maximum um This approach to predicting the probabilities in logspace is natural to use lik likeliho eliho elihoo o d is − log P ( y  x), the log in the cost function undo undoes es the exp of the with maximum likelihoo d learning. Because the cost function used with um sigmoid. Without this eﬀect, the saturation of the sigmoid could prev prevent entmaxim gradientlog P ( ymaking x), the likeliholearning o d is log in the costThe function undo es the of the based from goo good d progress. loss function for exp maxim maximum um sigmoid. the saturation of the could − this Bernoulli lik likeliho eliho elihoo o dWithout learning of aeﬀect, parametrized by sigmoid a sigmoid is prevent gradientbased learning from making goo d progress. The loss function for maximum likeliho o d learning of a Bernoulli J (θ ) =parametrized − log P (y  xby ) a sigmoid is (6.24) = − log σ ((2y − 1)z ) (6.25) J (θ ) = log P (y x) (6.24) =− ζ ((1 − 2y )z ) . (6.26) = log σ ((2y 1)z ) (6.25) ζ−((1 prop 2yerties )z )−. from Sec. 3.10. By rewriting (6.26) This deriv derivation ation mak makes es use of=some properties − we can see that it saturates only when the loss in terms of the softplus function, This deriv ation mak es use of some prop Sec. 3.10 Bydelrewriting (1 − 2y )z is very negative. Saturation thus oerties ccurs from only when the .mo model already the loss in terms of the softplus function, we can see that it saturates only has the right answ answer—when er—when y = 1 and z is very p ositiv ositive, e, or y = 0 and z iswhen very (1 2 y ) z is very negative. Saturation thus o ccurs only when the mo del already negativ negative. e. When z has the wrong sign, the argument to the softplus function, has−the right answer—when y = 1 and z is very p ositive, or y = 0 and z is very 182the argument to the softplus function, negative. When z has the wrong sign,
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
(1 (1− − 2 y )z , may b e simpliﬁed to z . As z  b ecomes large while z has the wrong sign, the softplus function asymptotes tow toward ard simply returning its argumen argumentt z . The (1 2 y ) z z z z , may b e simpliﬁed to . As b ecomes large while has sign, deriv derivative ative with resp respect ect to z asymptotes to sign ofwrong extremely sign((z), so, in the limitthe z the function asymptotes tow simplythe returning . The − softplus  do  shrink z , the incorrect softplus function does es ard not gradien gradientits t atargumen all. Thist prop property erty deriv ative with resp ect to asymptotes to ) , so, in the limit of extremely z sign ( z   is very useful b ecause it means that gradien gradientbased tbased learning can act to quic quickly kly incorrecta zmistaken , the softplus correct z . function do es not shrink the gradient at all. This prop erty is very useful b ecause it means that gradientbased learning can act to quickly When we use other loss functions, such as mean squared error, the loss can correct a mistaken z . saturate anytime σ(z ) saturates. The sigmoid activ activation ation function saturates to 0 When we use other loss functions, such as mean squared error,very the ploss can when z b ecomes very negativ negativee and saturates to 1 when ositiv ositive. e. z b ecomes σ(z ) saturates. saturate anytime The activ ation function to 0 The gradien gradient t can shrink to too o small to b esigmoid useful for learning wheneversaturates this happ happens, ens, when z bthe ecomes e andanswer saturates to incorrect 1 when zanswer. b ecomesFor very e. whether mo model delvery has negativ the correct or the thisp ositiv reason, The gradien t eliho can shrink to o small toysb ethe useful for learning whenever this happ ens, maxim maximum um lik likeliho elihoo o d is almost alwa always preferred approach to training sigmoid whether the mo del has the correct answer or the incorrect answer. F or this reason, output units. maximum likeliho o d is almost always the preferred approach to training sigmoid Analytically Analytically,, the logarithm of the sigmoid is alwa always ys deﬁned and ﬁnite, b ecause output units. the sigmoid returns values restricted to the op open en interv interval al (0 (0,, 1) 1),, rather than using Analytically , the logarithm of the sigmoid is alwa ys deﬁned ﬁnite, b ecause the entire closed interv interval al of valid probabilities [0 [0,, 1] 1].. In softw software areand implementations, theav sigmoid returns problems, values restricted to the op en interv al (0, 1) rather thandusing to avoid oid numerical it is b est to write the negativ negative e ,loglikelihoo loglikelihood as a the entireofclosed intervthan al of as valid probabilities are sigmoid implementations, z, rather z ).softw function a function of yˆ[0=, 1]σ. (In If the function to avoidws numerical problems, is blogarithm est to write negativ e loglikelihoo underﬂo underﬂows to zero, then takingitthe of yˆthe yields negativ negative e inﬁnit inﬁnity y. d as a z y ˆ = σ ( z function of , rather than as a function of ). If the sigmoid function underﬂows to zero, then taking the logarithm of yˆ yields negative inﬁnity. An Any y time we wish to represen representt a probability distribution ov over er a discrete variable with n p ossible values, we ma may y use the softmax function. This can b e seen as a An y time we wish to sigmoid representfunction a probability er a discrete variabley generalization of the whic which h distribution was used to ov represen represent t a probabilit probability n p ossible with wevma y use the softmax function. This can b e seen as a distribution ov over ervalues, a binary ariable. generalization of the sigmoid function which was used to represent a probability Softmax functions are most often used as the output of a classiﬁer, to represen representt distribution over a binary variable. the probabilit probability y distribution ov over er n diﬀeren diﬀerentt classes. More rarely rarely,, softmax functions functions most oftenifused as the to represen can Softmax b e used inside the are mo model del itself, we wish theoutput mo model deloftoacclassiﬁer, ho hoose ose b et etw ween one oft thediﬀerent probabilit y distribution er t classes. More rarely, softmax functions n diﬀeren n options for someov in internal ternal variable. can b e used inside the mo del itself, if we wish the mo del to cho ose b etween one of In the case of binary variables, we wished to pro produce duce a single num numb b er n diﬀerent options for some internal variable. In the case of binary variables, to).pro duce a single numb er (6.27) yˆ =we P (wished y=1x (yween = 10x ). 1, and b ecause we wan (6.27) Because this number needed to yˆlie= bPet etween and wanted ted the logarithm of the number to b e wellb ellbeha eha ehaved ved for gradien gradientbased tbased optimization of Because this number to instead lie b etween 0 and 1, and x). = logwP˜e(ywan = ted 1  the the loglikelihoo loglikelihood, d, weneeded chose to predict a num numb b erbzecause logarithm of the n umber to b e w ellb eha ved for gradien tbased optimization of Exp Exponen onen onentiating tiating and normalizing gav gavee us a Bernoulli distribution controlled by the ˜ (y = 1 x). z = log P the loglikelihoo d, we c hose to instead predict a num b er sigmoid function. Exp onentiating and normalizing gave us a Bernoulli distribution controlled by the 183 sigmoid function.
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
To generalize to the case of a discrete variable with n values, we now need to pro produce duce a vector yˆ, with yˆi = P (y = i  x ). We require not only that eac each h n T o generalize to the case of a discrete v ariable with v alues, w e now need elemen elementt of yˆi b e b et etween ween 0 and 1, but also that the entire vector sums to 1 so that yˆ, with yˆy distribution. = P (y = i x to pro duce a v ector ). W e require not that only wthat eac h it represents a valid probabilit probability The same approach orked for elemen t of b e b et ween 0 and 1 , but also that the entire vector sums to 1 so that y ˆ  the Bernoulli distribution generalizes to the multinoulli distribution. First, a linear it represents a v alid probabilit y distribution. la layer yer predicts unnormalized log probabilities: The same approach that worked for the Bernoulli distribution generalizes to the multinoulli distribution. First, a linear layer predicts unnormalized log probabilities: z = W > h + b, (6.28) z= W h+ b, (6.28) where z i = log P˜( y = i  x) . The softmax function can then exp exponentiate onentiate and normalize z to obtain the desired yˆ. Formally ormally,, the softmax function is given by where z = log P˜( y = i x) . The softmax function can then exp onentiate and exp( exp(z zi) softmax function is given by normalize z to obtain the desired yˆ. Formally , the softmax(z )i = P . (6.29) exp(zzj ) j exp( exp(z ) softmax(z ) = . (6.29) exp( z ) As with the logistic sigmoid, the use of the exp function works very well when training the softmax to output a target value y using maximum loglikelihoo loglikelihood. d. In exp As with the logistic sigmoid, the use of the function works very well softmax((z )i. Deﬁningwhen this case, we wish to maximize log P (y = i ; z ) = log softmax the training the softmax to output a target v alue y using maximum loglikelihoo d. In softmax in terms of exp is natural b ecauseP the log in the loglikelihoo loglikelihood d can undo log P ( = i ; z ) = log softmax ( z ) this case, we wish to maximize y . Deﬁning the the exp of the softmax: softmax in terms of exp is natural b ecause the log in the loglikelihoo d can undo X the exp of the softmax: log softmax(z )i = zi − log exp( exp(zz j ). (6.30) j
log softmax(z ) = z exp(z ). log (6.30) The ﬁrst term of Eq. 6.30 sho shows ws that−the input zi alw always ays has a direct contribution to the cost function. Because this term cannot saturate, we know that z alw The ﬁrst term of even Eq. 6.30 ws that theof input ays has direct z i to the learning can pro proceed, ceed, if thesho contribution second terma of Eq. con6.30 X tribution to the cost function. Because this term cannot saturate, we know that b ecomes very small. When maximizing the loglikelihoo loglikelihood, d, the ﬁrst term encourages can pro ceed, even if the contribution of z to the second of Eq.down. 6.30 zlearning z to term all of b e pushed i to b e pushed up, while the second term encourages P boecomes very intuition small. When maximizing the loglikelihoo ﬁrst eterm log j exp exp((d, z j)the T gain some for the second term, , observ observe thatencourages this term zcantobbeeroughly z pushedapproximated up, while the by second term encourages all of to b e pushed maxj z j. This approximation is based on thedown. idea log exp ( z T o gain some intuition for the second term, ) , observ e that this term that exp exp((z k ) is insigniﬁcant for any z k that is noticeably less than max j z j. The max z . This approximation can b e roughly byapproximation is based on the idea in intuition tuition we canapproximated gain from this is that the negative loglik loglikeliho eliho elihoo od that function insigniﬁcant any z the thatmost is noticeably less than exp(z ) is max z . IfThe cost alwa always ys stronglyfor p enalizes activ activee incorrect prediction. the intuitionanswer we canalready gain from thatsoftmax, the negative od correct has this the approximation largest input toisthe then loglik the −zeliho i term P P cost the function ys( zstrongly p enalizes the most active incorrect prediction. If the log alwa exp( and j ) ≈ maxj z j = zi terms will roughly cancel. This example j exp correct answer already has the input tocost, the softmax, the z term will then contribute little to the largest ov overall erall training whic which h willthen b e dominated by max log exp ) not = z terms and the will roughly cancel. This−example other examples that( zare yetzcorrectly classiﬁed. will then contribute little ≈ to the overall training cost, which will b e dominated by So far we hav havee discussed only a single example. Ov Overall, erall, unregularized maximum other examples that are not yet correctly classiﬁed. lik likeliho eliho elihooo d will drive the mo model del to learn parameters that drive the softmax to predict So far we have discussed only a single example. Overall, unregularized maximum P 184 likeliho o d will drive the mo del to learn parameters that drive the softmax to predict
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
the fraction of coun counts ts of each outcome observed in the training set: Pm the fraction of counts of each outcome observed the training set: j=1 1yin =i,x =x Pm . (6.31) softmax(z (x; θ )) i ≈ 1x =x j=1 1 . (6.31) softmax(z (x; θ )) 1 this is guaranteed to happ Because maximum lik likeliho eliho elihooo d is a consistent estimator, happen en ≈ so long as the mo model del family is capable of represen representing ting the training distribution. In Because maximum likdel eliho o d is a consistent estimator, this is guaranteed happthe en practice, limited mo model capacity and imp imperfect erfect optimization will meantothat P so long as the mo del family is capable of represen ting the training distribution. In mo model del is only able to appro approximate ximate these fractions. practice, limited mo del capacity and imp erfect P optimization will mean that the Man Many y ob objectiv jectiv jectivee functions other than the loglikelihoo loglikelihood d do not work as well mo del is only able to approximate these fractions. with the softmax function. Sp Speciﬁcally eciﬁcally eciﬁcally,, ob objectiv jectiv jectivee functions that do not use a log to Man ob jectiv e functions other than when the loglikelihoo dtdo workb ecomes as well undo they exp of the softmax fail to learn the argumen argument to not the exp with the softmax function. Sp eciﬁcally , ob jectiv e functions that do not use a very negative, causing the gradien gradientt to vanish. In particular, squared errorlogistoa undo the exp of theforsoftmax to learn whenfail the the b ecomes p oor loss function softmaxfail units, and can toargumen train thet to mo model del exp to change its voutput, ery negative, causing thedel gradien to vanish. In particular, squared error is a, even when the mo model makest highly conﬁden conﬁdent t incorrect predictions (Bridle p oor).loss for softmax units, fail to train the we moneed del totochange its 1990 1990). To function understand wh why y these otherand losscan functions can fail, examine output, even function when theitself. mo del makes highly conﬁdent incorrect predictions (Bridle, the softmax 1990). To understand why these other loss functions can fail, we need to examine Lik Likee the sigmoid, the softmax activ activation ation can saturate. The sigmoid function has the softmax function itself. a single output that saturates when its input is extremely negative or extremely Likee.the activation canare saturate. The sigmoid function has p ositiv ositive. Insigmoid, the casethe ofsoftmax the softmax, there multiple output values. These a single output that saturates when its input is extremely negative or extremely output values can saturate when the diﬀerences betw between een input values b ecome p ositiv e. In the case of the softmax, there are multiple values. These extreme. When the softmax saturates, many cost functionsoutput based on the softmax output values unless can saturate theinv diﬀerences betweenactiv input values b ecome also saturate, they arewhen able to invert ert the saturating activating ating function. extreme. When the softmax saturates, many cost functions based on the softmax o see thatunless the softmax function responds onds tosaturating the diﬀerence et etw ween its inputs, alsoTsaturate, they are able toresp invert the activbating function. observ observee that the softmax output is inv invarian arian ariantt to adding the same scalar to all of its T o see that the softmax function resp onds to the diﬀerence b etween its inputs, inputs: observe that the softmax output is zinv arian t to adding of its softmax( )= softmax( z + c)the . same scalar to all(6.32) inputs: Using this prop propert ert erty y, we can derive za)numerically softmax( = softmax(zstable + c). variant of the softmax: (6.32) Using this prop erty, wesoftmax( can derive a softmax( numerically of the softmax: z) = z −stable max ziv)ariant . (6.33) i
softmax( )= max zwith ). only small numerical (6.33) The reformulated version allows zus tosoftmax( ev evaluate aluatezsoftmax errors even when z con contains tains extremely large or − extremely negativ negativee num numb b ers. ExThe reformulated version allows us to ev aluate softmax with only small numerical amining the numerically stable variant, we see that the softmax function is driven z con errors when extremely large or extremely b y the even amount that its tains arguments deviate from maxi z i . negative numb ers. Examining the numerically stable variant, we see that the softmax function is driven softmax((z )i saturates to 1 when the corresp An output softmax corresponding onding input is maximal by the amount that its arguments deviate from max z . (zi = max i zi ) and zi is much greater than all of the other inputs. The output softmax (z ) saturates An output to 1 when the corresp onding input is maximal softmax softmax( ( z)i can zi is not also saturate to 0 when maximal and the maximum is z uch = max z z (m ) and is m uch greater than all of the other inputs. The output greater. This is a generalization of the way that sigmoid units saturate, and softmax( z) can also saturate to 0 when z is not maximal and the maximum is 185the way that sigmoid units saturate, and much greater. This is a generalization of
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
can cause similar diﬃculties for learning if the loss function is not designed to comp compensate ensate for it. can cause similar diﬃculties for learning if the loss function is not designed to The argumen argumentt to the softmax function can b e pro produced duced in tw twoo diﬀerent wa ways. ys. comp ensate for it.z The most common is simply to hav havee an earlier lay layer er of the neural net network work output The argumen t to the softmax function can b e pro duced in tw o diﬀerent ways. z > ev every ery element of z, as describ described ed ab aboove using the linear lay layer er z = W h + b . While The most commonthis is simply to hav e an earlier layer of the neural network output straigh straightforward, tforward, approach actually ov overparametrizes erparametrizes the distribution. The z z = W h + b ev ery element of , as describ ed ab o ve using the linear lay er . While constrain constraintt that the n outputs must sum to 1 means that only n − 1 parameters are straigh tforward, this approach overparametrizes theby distribution. n th value necessary; the probability of theactually may b e obtained subtracting The the n n constrain t that the outputs m ust sum to 1 means that only 1 parameters are ﬁrst n − 1 probabilities from 1. We can thus imp impose ose a requiremen requirementt that one element n threquire necessary; the F probability of w the value may obtained by subtracting the − z b e ﬁxed. of or example, e can that bzne = 0. Indeed, this is exactly ﬁrst nthe1sigmoid probabilities from . We canPthus requiremen t that element (y =imp 1  ose (z ) is equiv x) =a σ what unit do does. es. 1Deﬁning equivalent alentone to deﬁning of(yz = b− e1 ﬁxed. or example, we can require that z z=and 0. Indeed, this isthe exactly  x) = Fsoftmax n −1 P softmax( (z )1 with z1 = 00.. Both a tw twodimensional odimensional P ( y = 1 ) = σ ( z x what the sigmoid unit do es. Deﬁning ) is equiv alent to deﬁning argumen argumentt and the n argumen argumentt approaches to the softmax can describ describee the same n 1 P (yof=probability 1 x) = softmax (z ) withbut z and zdynamics. a tw odimensional = 0. BothInthe  set distributions, hav have e diﬀerent learning practice, n diﬀerence argumen the argument bapproaches to the the ov softmax can describ e the or same t and m − there is rarely uch et etw ween using overparametrized erparametrized version the set of probability distributions, but hav e diﬀerent learning dynamics. In practice, restricted version, and it is simpler to implement the ov overparametrized erparametrized version. there is rarely much diﬀerence b etween using the overparametrized version or the From a neuroscientiﬁc p oin ointt of view, it is in interesting teresting to think of the softmax as restricted version, and it is simpler to implement the overparametrized version. a wa way y to create a form of comp competition etition b etw etween een the units that participate in it: the Fromoutputs a neuroscientiﬁc p oin it is interesting to think the softmax as softmax alw always ays sum tot 1ofsoview, an increase in the value of oneofunit necessarily a way toonds create form of comp etition etwothers. een theThis unitsisthat participate in lateral it: the corresp corresponds to aa decrease in the valuebof analogous to the softmax outputs alw ays sum to 1 so an increase in the v alue of one unit necessarily inhibition that is b elieved to exist b et etw ween nearb nearby y neurons in the cortex. At the corresp onds to athe decrease in the vween alue of This analogous to the lateral ai isand extreme (when diﬀerence b et etween theothers. maximal the others is large in inhibition that is b elieved to exist b et w een nearb y neurons in the cortex. At the magnitude) it b ecomes a form of winnertakeal winnertakealll (one of the outputs is nearly 1 extreme (when the diﬀerence b et ween the maximal a and the others is large in and the others are nearly 0). magnitude) it b ecomes a form of winnertakeal l (one of the outputs is nearly 1 The name “softmax” can b e somewhat confusing. The function is more closely and the others are nearly 0). related to the argmax function than the max function. The term “soft” derives The can b e somewhat Theand function is more closely from thename fact “softmax” that the softmax function confusing. is con continuous tinuous diﬀerentiable. The related to the argmax function than the max function. The term “soft” derives argmax function, with its result represented as a onehot vector, is not con continuous tinuous from the fact that the softmax function is con tinuous and diﬀerentiable. or diﬀerentiable. The softmax function thus pro provides vides a “softened” version ofThe the argmax function, with its result represented a onehotfunction vector, isis not continuous softmax softmax( (z ) > z. argmax. The corresp corresponding onding soft version of theasmaximum or would diﬀerentiable. softmax provides a “softened” versionbut of the the It p erhapsThe b e better to function call the thus softmax function “softargmax,” softmax ( z ) z. argmax. The corresp onding softcon version of the maximum function is curren current t name is an en entrenched trenched conven ven vention. tion. It would p erhaps b e better to call the softmax function “softargmax,” but the current name is an entrenched convention. The linear, sigmoid, and softmax output units describ described ed ab abo ove are the most common. Neural net networks works can generalize to almost any kind of output lay layer er that The linear, sigmoid, and softmax output units describ ed ab o ve are the most we wish. The principle of maximum likelihoo likelihood d pro provides vides a guide for how to design common. Neural networks can generalize to almost any kind of output layer that we wish. The principle of maximum likelihoo d provides a guide for how to design 186
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
a go goo o d cost function for nearly any kind of output lay layer. er. ; θ ), the principle of In general, if we deﬁne a conditional distribution p ( a go o d cost function for nearly any kind of output layer.y  x maxim maximum um lik likeliho eliho elihoo o d suggests we use − log p(y  x; θ ) as our cost function. In general, if we deﬁne a conditional distribution p (y x; θ ), the principle of In general, we can think of the net network as; θrepresenting a function f( x; θ). maxim um likeliho o d suggests we neural use log pwork (y x ) as our  cost function. The outputs of this function are not direct predictions of the value y. Instead, − network as representing a function f( x; θ). wevides can think of the neural f (xIn ;θ )general, = ω pro provides the parameters for a distribution ov over er y. Our loss function The outputs of this function are not direct predictions of the value y. Instead, can then b e interpreted as − log p(y ; ω (x)). f (x ;θ ) = ω provides the parameters for a distribution over y. Our loss function or example, we ma may yaswishlog toplearn can Fthen b e interpreted (y ; ω (the x)).variance of 2a conditional Gaussian for y , giv x given en . In the simple case, where the variance σ is a constant, there is a − to learn the variance of a conditional Gaussian for F or example, we mab yecause wish closed form expression the maximum likelihoo likelihood d estimator of variance is ysimply x . empirical σ , giventhe In the simple case, where the v ariance is a constant, there a y is mean of the squared diﬀerence b et etween ween observ observations ations and closed form expression b ecause the maximum likelihoo d estimator of v ariance is their exp expected ected value. A computationally more exp expensiv ensiv ensivee approach that do does es not simply the empirical mean ofco the squared diﬀerence ween observas ations require writing sp to simply includeb et the variance one yof and the specialcase ecialcase code de is their exp ected v alue. A computationally more exp ensiv e approach that do es not prop properties erties of the distribution p( y  x) that is con controlled trolled by ω = f (x ; θ ). The require sp ecialcase include thea vcost ariance as one of the the ; ωis(xto negativ negativeewriting loglik loglikeliho eliho elihoo o d − log pco(yde )) simply will then pro provide vide function with prop erties of the distribution ) that is con trolled b y ) . The p ( ω = f ( x ; θ y x appropriate terms necessary to mak makee our optimization pro procedure cedure incremen incrementally tally ( y ; ω ( x log p negativ e loglik eliho o d )) will then pro vide a cost function with the  learn the variance. In the simple case where the standard deviation do does es not dep depend end appropriate necessary to mak e our optimization prothat cedure incremen tally − a new on the input,terms we can make parameter in the netw network ork is copied directly learn variance. In the simple case standard dovesrepresen not depting end σ itselfthe in into to ωthe . This new parameter might b e where or could b e adeviation parameter representing on the input, we can make a new parameter in the netw ork that is copied directly representing ting σ1 , dep depending ending on how we choose to σ 2 or it could b e a parameter β represen ω σ v represen in to . This new parameter might b e itself or could b e apredict parameter ting parametrize the distribution. We may wish our mo model del to a diﬀeren diﬀerent t amount it could a parameter represen ending on how we choose to σ vor of ariance in byefor diﬀerent vβalues of x .ting This ,isdep called a heter heterosc osc osceedastic mo model. del. parametrize the distribution. W may wish ourthe mo sp del to predict of a diﬀeren t amount In the heteroscedastic case, w eesimply make speciﬁcation eciﬁcation the variance be y x of v ariance in for diﬀerent v alues of . This is called a heter osc e dastic mo del. one of the values output by f ( x; θ). A typical way to do this is to formulate the In the heteroscedastic case, w e simply make sp eciﬁcation of described the variance be Gaussian distribution using precision, ratherthe than variance, as in Eq. ( xis ; θmost one output by fit ). A common typical wto ayuse to do this is toprecision formulate the 3.22 3.22..ofInthe thevalues multiv multivariate ariate case a diagonal matrix Gaussian distribution using precision, rather than variance, as described in Eq. 3.22. In the multivariate case it is most common to use a diagonal precision matrix diag (β ). (6.34) diag(βt).descen (6.34) This form formulation ulation works well with gradien gradient descentt b ecause the formula for the loglik loglikeliho eliho elihoo o d of the Gaussian distribution parametrized by β in involv volv volves es only mulThis formulation works well with gradien t descen t b ecause the formulaaddition, for the βi . The tiplication by β i and addition of log gradient of multiplication, loglik eliho o d of the Gaussian distribution parametrized volves only muland logarithm op operations erations is wellbehav wellbehaved. ed. By comparison,by if β weinparametrized the β log β tiplication by and addition of . The gradient of multiplication, addition, output in terms of variance, we would need to use division. The division function and logarithm op erations wellbehav By comparison, if wecan parametrized the b ecomes arbitrarily steepisnear zero. ed. While large gradients help learning, output in terms variance,usually we would needintoinstability use division. division function arbitrarily large ofgradients result we parametrized the instability. . If The b ecomes arbitrarily steep near zero. While large gradients can help learning, output in terms of standard deviation, the loglikelihoo loglikelihood d would still inv involve olve division, arbitrarily large gradients usually result in instability . If we parametrized the and would also inv involv olv olvee squaring. The gradient through the squaring operation output in terms of standard deviation, the loglikelihoo d would still inv olve division, can vanish near zero, making it diﬃcult to learn parameters that are squared. and would also involve squaring. The gradient through the squaring operation 187 to learn parameters that are squared. can vanish near zero, making it diﬃcult
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Regardless of whether we use standard deviation, variance, or precision, we must ensure that the cov covariance ariance matrix of the Gaussian is positive deﬁnite. Because Regardless of whether use standard variance, we must the eigenv eigenvalues alues of the we precision matrixdeviation, are the recipro reciprocals cals or of precision, the eigenv eigenvalues alues of ensure that the cov ariance matrix of the Gaussian is positive deﬁnite. Because the cov covariance ariance matrix, this is equiv equivalen alen alentt to ensuring that the precision matrix is the eigenv alues of the precision matrix are the cals ofthe thediagonal eigenvalues of p ositiv ositivee deﬁnite. If we use a diagonal matrix, or arecipro scalar times matrix, the cov matrix, this is equiv alent toonensuring thatofthe is. then theariance only condition we need to enforce the output theprecision mo model del is pmatrix ositivity ositivity. p ositiv e deﬁnite. If we usethe a diagonal matrix, or the a scalar times thetodiagonal matrix, a is If we supp suppose ose that ra raw w activ activation ation of mo model del used determine the then the only condition we need to enforce on the output of the mo del is p ositivity diagonal precision, we can use the softplus function to obtain a p ositiv ositivee precision. a If w e supp ose that is the ra w activ ation of the mo del used to determine the vector: β = ζ( a). This same strategy applies equally if using variance or standard diagonal precision, we can use theorsoftplus obtainiden a ptity ositiv e precision deviation rather than precision if usingfunction a scalartotimes identity rather than vdiagonal ector: β matrix. = ζ( a). This same strategy applies equally if using variance or standard deviation rather than precision or if using a scalar times identity rather than It is rare to learn a cov covariance ariance or precision matrix with richer structure than diagonal matrix. diagonal. If the cov covariance ariance is full and conditional, then a parametrization must It is rare toguaran learn atees covpariance or precision of matrix with richer structure than b e chosen that guarantees ositiv ositivedeﬁniteness edeﬁniteness the predicted cov covariance ariance matrix. diagonal. the cov is full aB parametrization must Σ(xand ) = conditional, B (x)B> (x) ,then This can b eIfachiev achieved ed ariance by writing where is an unconstrained b e chosen that guaran tees p ositiv covariance matrix. square matrix. One practical issueedeﬁniteness if the matrixofisthe fullpredicted rank is that computing the Σ ( x ) = B ( x ) B ( x ) B This can b e achiev ed by writing , where is an unconstrained 3 lik likeliho eliho elihoo o d is exp expensiv ensiv ensive, e, with a d × d matrix requiring O(d ) computation for the square matrix. One practical if the matrix is, full that computing ( x) (or determinan determinant t and inv inverse erse of Σissue equiv equivalently alently alently, andrank moreiscommonly done, the its d d O ( d lik eliho o d is exp ensiv e, with a matrix requiring ) computation for the eigendecomp eigendecomposition osition or that of B (x)). determinant and inverse of Σ( x)×(or equivalently, and more commonly done, its We often osition want toorp erform regression, that is, to predict real values eigendecomp that ofmultimodal B (x)). that come from a conditional distribution p ( y  x) that can ha have ve sev several eral diﬀeren diﬀerentt Weinoften want for to pthe erform multimodal that is, to predict mixture real values y space p eaks same value of xregression, . In this case, a Gaussian is y x p ( that come from a conditional distribution ) that can ha ve sev eral diﬀeren a natural representation for the output (Jacobs et al., 1991; Bishop, 1994). t y space p eaks in for the same mixtures value of xas . their In this case, are a Gaussian mixture ise Neural netw networks orks with Gaussian output often called mixtur mixture a natural representation for the output ( Jacobs et al. , 1991 ; Bishop , 1994 ). density networks networks.. A Gaussian mixture output with n comp componen onen onents ts is deﬁned by Neural networksprobability with Gaussian mixtures as their output are often called mixture the conditional distribution density networks. A Gaussian mixture output with n comp onents is deﬁned by n the conditional probability X distribution (6.35) p(y  x) = p(c = i  x)N (y ; µ (i) (x), Σ (i)(x)). i=1
(6.35) p(y x) = p(c = i x) (y ; µ (x), Σ (x)). The neural net network work must hav havee three outputs: a vector deﬁning p ( c = i  x ), a   N matrix providing µ(i) (x) for all i, and a tensor providing Σ (i)( x) for all i. These The neural netsatisfy work must havconstraints: e three outputs: a vector deﬁning p ( c = i x ), a outputs must diﬀerent matrix providing µ (x) forX all i, and a tensor providing Σ ( x) for all i. These outputs must satisfy diﬀerent p( cconstraints: = i  x): these form a multinoulli distribution 1. Mixture comp components onents over the n diﬀeren diﬀerentt comp componen onen onents ts asso associated ciated with latent variable c, and can 1. Mixture comp onents p( c = i x): these form a multinoulli distribution Weoconsider be latent because we do in thelatent data: given input c, and ver the cntodiﬀeren t comp onen ts not assoobserve ciateditwith variable andtarget can , it is not possible to know with certainty which Gaussian component was responsible for , but we can imagine that was generated by picking one of them, and make that unobserved choice a random variable. 188
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
typically b e obtained by a softmax ov over er an ndimensional vector, to guarantee that these outputs are p ositive and sum to 1. typically b e obtained by a softmax over an ndimensional vector, to guarantee µ(i)(outputs x ): these 2. Means associated ciated with the ith that these areindicate p ositive the andcenter sum toor1.mean asso Gaussian comp component, onent, and are unconstrained (typically with no nonlinearity µ ( x ith 2. at Means ) : these units). indicateIf y the or mean ciated all for these output is center a d v vector, ector, then asso the netw network orkwith mustthe output Gaussian comp onent, and are unconstrained (typically with no nonlinearity an n × d matrix con containing taining all n of these ddimensional vectors. Learning yeliho at all for these output units). is a dov then the netw ork must output these means with maxim maximum um Iflik likeliho elihoo d ector, is slightly more complicated than an n d the matrix conof taining all n of these dimensional vectors. learning means a distribution with donly one output mo mode. de. Learning We only these means withthe maxim eliho o d is slightly complicated w ant ×to up update date meanum forlik the comp componen onen onent t thatmore actually pro produced ducedthan the learning the means of a distribution with only one output mo de. W e only observ observation. ation. In practice, we do not kno know w which comp componen onen onentt pro produced duced eac each h w ant to up date the mean for the comp onen t that actually pro duced the observ observation. ation. The expression for the negative loglikelihoo loglikelihood d naturally weights observ ation. Incontribution practice, wetodothe not kno which comp onen t pro each eac each h example’s loss forweach comp component onent by the duced probability observ ation. The expression forthe theexample. negative loglikelihoo d naturally weights that the comp componen onen onent t pro produced duced each example’s contribution to the loss for each comp onent by the probability i)(x Σ(onen 3. Co Cov variances )t: pro these sp specify ecify cov covariance ariance matrix for each comp componen onen onentt that the comp duced the the example. i. As when learning a single Gaussian comp componen onen onent, t, we typically use a diagonal Σ ( x 3. matrix Covariances ) : these sp ecify the cov ariance each comp onent to av avoid oid needing to compute determinan determinants. ts. matrix As withfor learning the means i. As learning a single Gaussiandcomp onent, we typically use ato diagonal of thewhen mixture, maxim is complicated by needing assign maximum um likelihoo likelihood matrix to av oid needing to compute determinan ts. As with learning the means partial resp responsibility onsibility for eac each h p oin ointt to eac each h mixture comp componen onen onent. t. Gradient of the mixture, maxim um likelihoo d is complicated by needing tocorrect assign descen descentt will automatically follo follow w the correct pro process cess if given the partial resp onsibility for each p oint to eac h mixture comp onen t.del. Gradient sp speciﬁcation eciﬁcation of the negative loglikelihoo loglikelihood d under the mixture mo model. descent will automatically follow the correct pro cess if given the correct eciﬁcation of the loglikelihoo d under the mo del. It hasspbeen rep reported orted thatnegative gradien gradientbased tbased optimization of mixture conditional Gaussian mixtures (on the output of neural netw networks) orks) can b e unreliable, in part b ecause one It has been rep orted gradien tbased of unstable conditional Gaussian gets divisions (by the that variance) which can optimization b e numerically (when some mixtures (on the output of neural netw orks) can b e unreliable, in part b ecause one variance gets to b e small for a particular example, yielding very large gradients). gets divisions (by the v ariance) which can b e numerically unstable (when some One solution is to clip gr gradients adients (see Sec. 10.11.1) while another is to scale the vgradien ariance gets to b e small for a particular example, yielding very large gradients). gradients ts heuristically (Murray and Laro Larochelle chelle , 2014 ). One solution is to clip gradients (see Sec. 10.11.1) while another is to scale the Gaussian mixture outputs are particularly eﬀective in generativ generativee mo models dels of gradients heuristically (Murray and Laro chelle, 2014). sp speec eec eech h (Sc Schuster huster, 1999) or mo mov vements of physical ob objects jects (Grav Graves es, 2013). The Gaussian mixture outputs are particularly eﬀective in generativ e mo dels of mixture density strategy giv gives es a way for the netw network ork to represent multiple output sp eec (Schuster , 1999the ) orvmo vements of output, physicalwhich ob jects Gravesfor , 2013 ). The mo modes desh and to control ariance of its is (crucial obtaining density givthese es a wrealv ay foralued the netw ork to An represent multiple output amixture high degree of strategy quality in realvalued domains. example of a mixture mo desy and to control theinvariance densit density net network work is shown Fig. 6.4of . its output, which is crucial for obtaining a high degree of quality in these realvalued domains. An example of a mixture In ygeneral, weisma may y wish contin continue model del larger vectors y con containing taining more densit network shown in to Fig. 6.4.ue to mo variables, and to imp impose ose ric richer her and richer structures on these output variables. For y conof In general, we may wish to contin to mo taining more example, we ma may y wish for our neuralue netw network orkdel to larger outputvectors a sequence characters vthat ariables, anda to imp ose ric and richer on tinue these output ariables. For forms sen sentence. tence. Inherthese cases, structures w wee ma may y con continue to use vthe principle example, we ma y wish for our neural netw ork to output a sequence of c haracters of maximum likelihoo likelihood d applied to our model p(y ; ω( x )) )),, but the mo model del we use that forms a sentence. In these cases, we may continue to use the principle of maximum likelihoo d applied to our189 model p(y ; ω( x )), but the mo del we use
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Figure 6.4: Samples drawn from a neural netw network ork with a mixture density output la layer. yer. The input x is sampled from a uniform distribution and the output y is sampled from Figure drawn from orknonlinear with a mixture density layer. pmodel (y6.4: . The neural netw network orkaisneural able tonetw learn mappings from output the input to  x)Samples x y The input is sampled from a uniform distribution and the output is sampled from the parameters of the output distribution. These parameters include the probabilities p verning (y xwhich ). Theofneural ork iscomponents able to learn nonlinear from to go governing three netw mixture will generatemappings the output asthe wellinput as the the parameters of themixture output component. distribution. Each Thesemixture parameters include isthe probabilities  for each parameters component Gaussian with governing mean whichand of three mixture willofgenerate thedistribution output as well as the predicted variance. All ofcomponents these asp aspects ects the output are able to parameters for each mixture mixture component is Gaussian with v ary with respect to the input component. x, and to do Each so in nonlinear wa ways. ys. predicted mean and variance. All of these asp ects of the output distribution are able to vary with respect to the input x, and to do so in nonlinear ways.
to describe y b ecomes complex enough to be b ey eyond ond the scope of this chapter. Chapter 10 describ describes es how to use recurrent neural netw networks orks to deﬁne such mo models dels y to describe b ecomes complex enough to be b ey ond the scope of this c hapter. over sequences, and Part I I I describ describes es adv advanced anced techniques for mo modeling deling arbitrary Chapter 10 describ es how to use recurrent neural netw orks to deﬁne such mo dels probabilit probability y distributions. over sequences, and Part I I I describ es advanced techniques for mo deling arbitrary probability distributions.
6.3
Hidden Units
So e ha have ve fo focused cused our discussion on design choices for neural netw networks orks that 6.3far wHidden Units are common to most parametric machine learning mo models dels trained with gradientSo far w e ha ve fo cused our discussion on design choices for to neural networks that based optimization. Now we turn to an issue that is unique feedforward neural are common parametric mo dels trained with gradientnet networks: works: ho how wtotomost cho hoose ose the type machine of hiddenlearning unit to use in the hidden la layers yers of the based optimization. Now w e turn to an issue that is unique to feedforward neural mo model. del. networks: how to cho ose the type of hidden unit to use in the hidden layers of the The design of hidden units is an extremely active area of research and do does es not mo del. yet ha have ve man many y deﬁnitiv deﬁnitivee guiding theoretical principles. The design of hidden units is an extremely active area of research and do es not unitse are an excellen excellent t default choice of hidden unit. Many other yet Rectiﬁed have manlinear y deﬁnitiv guiding theoretical principles. typ ypes es of hidden units are av available. ailable. It can b e diﬃcult to determine when to use Rectiﬁed linear units are an excellen t default choice an of hidden unit.choice). Many other whic which h kind (though rectiﬁed linear units are usually acceptable We typ es of hidden units are available. It can b e diﬃcult to determine when to use which kind (though rectiﬁed linear units 190 are usually an acceptable choice). We
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
describ describee here some of the basic intuitions motiv motivating ating each type of hidden units. These intuitions can b e used to suggest when to try out each of these units. It is describ eimp here sometoofpredict the basic intuitions motiv ating type hiddenpro units. usually impossible ossible in adv advance ance which will workeach b est. Theofdesign process cess These intuitions can b e used to suggest when to try out each of these units. It is consists of trial and error, in intuiting tuiting that a kind of hidden unit may work well, usually imp ossible to predict in adv ance which will w ork b est. The design pro cess and then training a net network work with that kind of hidden unit and ev evaluating aluating its consists of trial error, inset. tuiting that a kind of hidden unit may work well, p erformance on aand validation and then training a network with that kind of hidden unit and evaluating its Some of the hidden units included in this list are not actually diﬀerentiable at p erformance on a validation set. all input p oints. For example, the rectiﬁed linear function g (z ) = max max{ {0 , z } is not Some of the hidden units included in this list are not actually diﬀerentiable at diﬀeren diﬀerentiable tiable at z = 00.. This may seem lik likee it inv invalidates alidates g for use with a gradientg (zp)erforms = max w0ell all input p oints.algorithm. For example, the rectiﬁed lineardescent function is not , z enough based learning In practice, gradient still diﬀeren tiable at z to =b 0.e This seem likelearning it invalidates for use with { a gradient} for these mo models dels used may for machine tasks.g This is in part because based learning In practice, descent stillatp erforms well enough neural netw network orkalgorithm. training algorithms dogradient not usually arrive a lo local cal minimum of for these mo dels to b e used for machine learning tasks. This is in part because the cost function, but instead merely reduce its value signiﬁcan signiﬁcantly tly tly,, as shown in neural ork training do usually arrive at8. aBecause lo cal minimum of Fig. 4.3netw . These ideas willalgorithms b e describ described ed not further in Chapter we do not the costtraining function, but instead merely reduce value signiﬁcan shown in exp expect ect to actually reach a p oin oint t whereitsthe gradient is 0 ,tly it ,isasacceptable Fig.the 4.3minima . Theseofideas willfunction b e describ further 8. Because do not for the cost to ed corresp correspond ondin toChapter p oints with undeﬁnedwegradient. 0 exp ect training to actually reach a p oin t where the gradient is , it is acceptable Hidden units that are not diﬀerentiable are usually nondiﬀeren nondiﬀerentiable tiable at only a for thenum minima of pthe cost function to acorresp ondgto (z )p oints small numb b er of oin oints. ts. In general, function has awith left undeﬁned deriv derivative ativegradient. deﬁned Hidden units that are not diﬀerentiable are usually nondiﬀeren tiable at only by the slope of the function immediately to the left of z and a right deriv derivativ ativ ativeea small num er ofslop p oin ts. the In general, function g (z )tohas left deriv deﬁned deﬁned bybthe slope e of functionaimmediately thea right of z .ative A function z b y the slope of the function immediately to the left of and a right deriv ativ e is diﬀeren diﬀerentiable tiable at z only if b oth the left deriv derivative ative and the right deriv derivativ ativ ative e are deﬁned b y the slopto e of the other. function immediately to the of ztext . A of function deﬁned and equal each The functions used in right the con context neural z is diﬀeren tiable at only if b oth the left deriv ative and the right deriv ativ e are net networks works usually ha have ve deﬁned left deriv derivatives atives and deﬁned right deriv derivativ ativ atives. es. In the deﬁned and equal to each other. The functions used in the con text of neural max{ {0 , z }, the left deriv case of g (z ) = max derivative ative at z = 0 is 0 and the right deriv derivativ ativ ativee net works usually ha ve deﬁned left deriv atives and deﬁned right deriv ativ es. In is 1. Soft Software ware implemen implementations tations of neural netw network ork training usually return onethe of 0 , zes, rather maxativ caseonesided of g (z ) =deriv the leftthan deriv ative at zthat = 0the is 0deriv and ative the right derivativor e the derivativ atives rep reporting orting derivative is undeﬁned is 1. Soft implemen of neural netw ork training usually that return one of { ma } tations raising anware error. This may y b e heuristically justiﬁed by observing gradien gradienttthe onesided deriv ativ es rather than rep orting that the deriv ative is undeﬁned or based optimization on a digital computer is sub subject ject to nu numerical merical error an anyw yw yway ay ay.. raising an error. This ma y b e heuristically justiﬁed b y observing that gradien When a function is asked to ev evaluate aluate g(0) (0),, it is very unlikely that the underlyingtoptimization on a digital sub ject to vnu merical anyway. thaterror vbased alue truly was 0. Instead, it was computer lik small alue was rounded likely ely to b eissome When a function is asked to ev aluate g(0),pleasing it is veryjustiﬁcations unlikely that underlying to 0. In some con contexts, texts, more theoretically arethe av available, ailable, but vthese alue truly was 0 . Instead, it w as lik ely to b e some small v alue that was rounded usually do not apply to neural net network work training. The imp importan ortan ortantt p oin ointt is that to 0 . In some con texts, more theoretically pleasing justiﬁcations are av ailable, but in practice one can safely disregard the nondiﬀeren nondiﬀerentiability tiability of the hidden unit theseation usually do not apply toedneural activ activation functions describ described b elo elow. w.network training. The imp ortant p oint is that in practice one can safely disregard the nondiﬀerentiability of the hidden unit Unless indicated otherwise, most hidden units can b e describ described ed as accepting activation functions describ ed b elow. a vector of inputs x, computing an aﬃne transformation z = W >x + b, and indicated otherwise, hidden units can describ ed as accepting thenUnless applying an elemen elementwise twise most nonlinear function hidden units are g( z)b.e Most x z = W x + b,ation a vector of inputs , computing an aﬃne transformation and distinguished from eac each h other only by the choice of the form of the activ activation then applying function g (z ). an elementwise nonlinear function g( z). Most hidden units are distinguished from each other only by the choice of the form of the activation 191 function g (z ).
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Rectiﬁed linear units use the activ activation ation function g (z ) = max{0, z }.
Rectiﬁed linear units optimize b ecause are0so to linear Rectiﬁed linear units useare theeasy activto ation function g (z ) they = max , z similar . units. The only diﬀerence b et etween ween a linear unit and a rectiﬁed linear unit is { so }similar to linear linear units areoutputs easy to zero optimize b ecause they are thatRectiﬁed a rectiﬁed linear unit across half its domain. This makes the units. The only diﬀerence b et ween a linear unit and a rectiﬁed is deriv derivatives atives through a rectiﬁed linear unit remain large whenev whenever er the linear unit isunit active. thatgradients a rectiﬁedarelinear unit large outputs half its domain. Thisative makes the The not only but zero also across consisten consistent. t. The second deriv derivative of the deriv atives through a rectiﬁed linear unit remain large whenev er the unit is active. rectifying op operation eration is 0 almost ev everywhere, erywhere, and the deriv derivative ative of the rectifying The gradients are not only large but also consisten t. The second derivthe ative of the op operation eration is 1 everywhere that the unit is activ active. e. This means that gradient rectifyingisop eration is 0 almost everywhere, the bderiv ative ofation the rectifying direction far more useful for learning than itand would e with activ activation functions op eration is 1 everywhere that the unit is activ e. This means that the gradient that introduce secondorder eﬀects. direction is far more useful for learning than it would b e with activation functions linear units are eﬀects. typically used on top of an aﬃne transformation: thatRectiﬁed introduce secondorder > on top of an aﬃne transformation: Rectiﬁed linear units are typically used h = g (W x + b). (6.36)
h =ofgthe (W aﬃne x + btransformation, ). When initializing the parameters it can b e a(6.36) go goo od practice to set all elements of b to a small, p ositive value, such as 0.1. This makes When of the aﬃne transformation, canmost b e ainputs go o d it veryinitializing lik likely ely that the the parameters rectiﬁed linear units will b e initially activeit for b toderiv practice to set all of the a small, p ositive such as 0.1. This makes in the training setelements and allow derivatives atives to passvalue, through. it very likely that the rectiﬁed linear units will b e initially active for most inputs Sev Several eral generalizations rectiﬁed lineartounits Most of these generalin the training set and allowofthe derivatives pass exist. through. izations p erform comparably to rectiﬁed linear units and o ccasionally p erform Several generalizations of rectiﬁed linear units exist. Most of these generalb etter. izations p erform comparably to rectiﬁed linear units and o ccasionally p erform One drawback to rectiﬁed linear units is that they cannot learn via gradientb etter. drawback based methods on examples for which their activ activation ation is zero. A v variety ariety of One drawbac k to rectiﬁed linear units is that they cannot learn via gradientgeneralizations of rectiﬁed linear units guarantee that they receive gradient ev everyerybased methods on examples for which their activ ation is zero. A v ariety of where. generalizations of rectiﬁed linear units guarantee that they receive gradient everyThree generalizations of rectiﬁed linear units are based on using a nonzero where. max(0 (0, zi) + α i min min(0 (0, zi ). Absolute value slop slopee α i when zi < 0: hi = g ( z , α) i = max Three generalizations of rectiﬁed linear units are based using a nonzero rectiﬁc ctiﬁcation ation ﬁxes α i = −1 to obtain g (z) = z . It is used foronob object ject recognition α z < h = g ( z , α ) = max (0 , z ) + α min (0 , z slop e when 0 : ) . A bsolute value from images (Jarrett et al., 2009), where it mak makes es sense to seek features that are αolarity = 1rev g (z)input = z illumination. rinv evctiﬁc ation ﬁxes toersal obtain . It is used for ob ject recognition in ariant under ap reversal of the Other generalizations from imageslinear (Jarrett et− al., more 2009),broadly where it mak toaky seek features that   es sense of rectiﬁed units are applicable. A le leaky ReLU (Maas et are al., in v ariant under a p olarity rev ersal of the input illumination. Other generalizations 2013 2013)) ﬁxes αi to a small value like 0.01 while a par arametric ametric ReLU or PR PReLU eLU treats of rectiﬁed linear units are more broadly applicable. A le aky R eLU ( Maas et al., αi as a learnable parameter (He et al., 2015). 2013) ﬁxes α to a small value like 0.01 while a parametric ReLU or PReLU treats Maxout units (Go Gooo dfello dfellow w et al. al.,, 2013a) generalize rectiﬁed linear units further. α as a learnable parameter (He et al., 2015). Instead of applying an elemen elementwise twise function g (z ), maxout units divide z in into to Maxout units ( Go o dfello w et al. , 2013a ) generalize rectiﬁed linear units further. groups of k values. Each maxout unit then outputs the maximum element of one Instead of applying an elementwise function g (z ), maxout units divide z into groups of k values. Each maxout unit then outputs the maximum element of one 192
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
of these groups: of these groups:
g (z ) i = max z j j∈G
(6.37)
g (z ) = max z (6.37) 1)k k + 1, . . . , ik where G (i) is the indices of the inputs for group i , { (i − 1) ik} } . This pro provides vides way y of learning a piecewise linear function that resp responds onds to multiple G a wa i ( 1) k + 1, . . . , ik . This where is the indices of the inputs for group , i directions in the input x space. provides a way of learning a piecewise linear function { that − resp onds to m } ultiple k pieces. A maxout unit can learn a piecewise linear, con conv v ex function with up to directions in the input x space. Maxout units can thus b e seen as itself rather k pieces. A maxout unit can learn a piecewise linear, con v ex function with up to unit than just the relationship b etw etween een units. With large enough k , a maxout can Maxout units can thus b e seen as itself rather learn to appro approximate ximate an any y conv convex ex function with arbitrary ﬁdelit ﬁdelity y. In particular, k than just the relationship b etw een units. With large enough , a maxout unit a maxout la layer yer with tw two o pieces can learn to implement the same function of can the learn x toasappro ximate an y er conv ex function with arbitrary ﬁdelit y. In particular, input a traditional lay layer using the rectiﬁed linear activ activation ation function, absolute maxout layer with two pieces can leaky learn to sameorfunction of the vaalue rectiﬁcation function, or the or implement parametricthe ReLU, can learn to x input as a traditional lay er using the rectiﬁed linear activ ation function, absolute implemen implementt a totally diﬀerent function altogether. The maxout lay layer er will of course vbalue rectiﬁcation function, or the leaky or parametric ReLU, or to e parametrized diﬀerently from an any y of these other la layer yer types, so can the learn learning implementwill a totally diﬀerent function altogether. The maxout will of course dynamics b e diﬀeren diﬀerent t even in the cases where maxout learnslay toerimplement the b e parametrized diﬀerently from an y of these other la yer types, so the learning same function of x as one of the other lay layer er types. dynamics will b e diﬀerent even in the cases where maxout learns to implement the Eac Each h maxout unit is now parametrized by k weight vectors instead of just one, same function of x as one of the other layer types. so maxout units typically need more regularization than rectiﬁed linear units. They k weight h maxout unit isregularization now parametrized just one, can Eac work well without if the by training set visectors large instead and theofnumber of so maxout units typically need more regularization than rectiﬁed linear units. They pieces p er unit is kept lo low w (Cai et al., 2013). can work well without regularization if the training set is large and the number of Maxout units have few(Cai other b eneﬁts. pieces p er unit is hav kepte alow et al. , 2013).In some cases, one can gain some statistical and computational adv advantages antages by requiring few fewer er parameters. Sp Speciﬁcally eciﬁcally eciﬁcally,, Maxout units have a few b eneﬁts. Inﬁlters some cases, can gain some staif the features captured by nother diﬀeren diﬀerent t linear can b eone summarized without tisticalinformation and computational antages byerrequiring fewof erkparameters. Sp eciﬁcally losing by takingadv the max ov over each group features, then the next, n if the features captured by diﬀeren t linear ﬁlters can b e summarized without la layer yer can get by with k times fewer weights. losing information by taking the max over each group of k features, then the next Because unit kis times drivenfewer by mw ultiple havee some redunlayer can geteach by with eights.ﬁlters, maxout units hav dancy that helps them to resist a phenomenon called catastr atastrophic ophic for forgetting getting in Because each unit is driven b y m ultiple ﬁlters, maxout units hav e some whic which h neural netw networks orks forget how to p erform tasks that they were trainedredunon in dancy that helps them to resist a phenomenon called c atastr ophic for getting in the past (Go Goo o dfello dfellow w et al. al.,, 2014a). which neural networks forget how to p erform tasks that they were trained on in and all of).these generalizations of them are based on the the Rectiﬁed past (Go olinear dfellounits w et al. , 2014a principle that mo models dels are easier to optimize if their behavior is closer to linear. Rectiﬁed linear units andofall of these of them are optimization based on the This same general principle using lineargeneralizations b ehavior to obtain easier principle that mo dels are easier to optimize if their behavior is closer to linear. also applies in other con contexts texts b esides deep linear netw networks. orks. Recurrent netw networks orks can This principle of using linear b ehavior to obtain easier optimization learnsame from general sequences and pro a sequence of states and outputs. When training produce duce also applies in other contexts b esides deepthrough linear netw orks.time Recurrent networks can them, one needs to propagate information sev several eral steps, which is much learn from andcomputations pro duce a sequence of states and outputs. When btraining easier whensequences some linear (with some directional deriv derivatives atives eing of them, one needs to propagate information through sev eral time steps, which is m magnitude near 1) are inv involv olv olved. ed. One of the bestp bestperforming erforming recurren recurrentt net netw wuch ork easier when some linear computations (with some directional derivatives b eing of magnitude near 1) are involved. One 193 of the bestp erforming recurrent network
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
arc architectures, hitectures, the LSTM, propagates information through time via summation—a particular straightforw straightforward ard kind of suc such h linear activ activation. ation. This is discussed further arc hitectures, the LSTM, propagates information through time via summation—a in Sec. 10.10. particular straightforward kind of such linear activation. This is discussed further in Sec. 10.10. Prior to the introduction of rectiﬁed linear units, most neural netw networks orks used the logistic sigmoid activ activation ation function Prior to the introduction of rectiﬁed linear units, most neural networks used the logistic sigmoid activation functiong (z ) = σ (z ) (6.38) g (zfunction ) = σ (z ) or the hyperb hyperbolic olic tangent activ activation ation
(6.38)
or the hyperb olic tangent activation g (z ) function = tanh(z ).
(6.39)
g (z ) = tanh(zb)ecause . These activ activation ation functions are closely related tanh(z ) = 2σ(2z ) − 1(6.39) .
We ha have ve already seen sigmoid units as output units, used to predict the These activation functions are closely related b ecause tanh(z ) = 2σ(2z ) 1. probabilit probability y that a binary variable is 1. Unlike piecewise linear units, sigmoidal − W e ha ve already seen of sigmoid units as output units, toused to predict the units saturate across most their domain—they saturate a high value when probabilit y that a binary v ariable is 1 . Unlike piecewise linear units, sigmoidal z is very p ositive, saturate to a low value when z is very negative, and are only units saturate across mostinput of their domain—they to a high value when z is near 0. saturate strongly sensitive to their when The widespread saturation of zsigmoidal z is very punits ositive, saturate to a low v alue when is very negative, and are only can mak makee gradientbased learning very diﬃcult. For this reason, strongly sensitive to theirininput when z isnetw near 0. is The widespread saturation of their use as hidden units feedforward networks orks no now w discouraged. Their use sigmoidal units can mak e gradientbased learning very diﬃcult. F or this reason, as output units is compatible with the use of gradien gradientbased tbased learning when an their use as hidden units in feedforward netw orks is no w discouraged. Their use appropriate cost function can undo the saturation of the sigmoid in the output as output units is compatible with the use of gradientbased learning when an la layer. yer. appropriate cost function can undo the saturation of the sigmoid in the output When a sigmoidal activ activation ation function must b e used, the hyperb hyperbolic olic tangent layer. activ activation ation function typically p erforms b etter than the logistic sigmoid. It resembles ay sigmoidal activclosely ation function must that b e used, the = hyperb olicσ (0) tangent tanh(0) = 12 . the When identit function more 0 while identity closely, , in the sense activationtanh function typically p erforms b etter the logistic Itwork resembles yˆ = Because is similar to iden identity tity near 0, than training a deep sigmoid. neural net network tanh σ (0) = the identit y function more closely , in the sense that (0) = 0 while > > > > > > w tanh tanh((U tanh tanh((V x)) resembles training a linear model yˆ = w U V x so. tanh = Because is ations similaroftothe iden tity near a deep work yˆthe long as the activ activations netw network ork can0b, etraining kept small. Thisneural mak makes esnet training w (Uorktanh (V x)) resembles training a linear model yˆ = w U V x so tanhtanh netw easier. network long as the activations of the network can b e kept small. This makes training the activ activation ation functions are more common in settings other than feedtanhSigmoidal network easier. forw forward ard netw networks. orks. Recurren Recurrentt netw networks, orks, man many y probabilistic mo models, dels, and some Sigmoidal activ ation functions are more common in settings other feedauto autoencoders encoders ha have ve additional requirements that rule out the use of than piecewise forward netw orks. Recurren t netw many units probabilistic mo dels,despite and some linear activ activation ation functions and makeorks, sigmoidal more app appealing ealing the auto encoders ha ve additional requirements that rule out the use of piecewise dra drawbacks wbacks of saturation. linear activation functions and make sigmoidal units more app ealing despite the drawbacks of saturation. 194
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Man Many y other types of hidden units are p ossible, but are used less frequently frequently.. general, a wide variet ariety y of diﬀerentiable functions p erform p erfectly ManIn y other types of hidden units are p ossible, but are used less frequently . well. Man Many y unpublished activ activation ation functions p erform just as well as the p opular ones. In general, a wide v ariet y ofthe diﬀerentiable functions p erform net p erfectly well. To pro a concrete example, authors tested a feedforward provide vide network work using Man y unpublished activ functions just as an well as the ones. h = cos cos( (W x + b) on theation MNIST datasetp erform and obtained error ratep opular of less than To pro videisa comp concrete example, the authors feedforward network using 1%, which competitiv etitiv etitive e with results obtainedtested using amore con conven ven ventional tional activ activation ation h = cos ( W x + b ) on the MNIST dataset and obtained an error rate of less than functions. During researc research h and dev development elopment of new tec techniques, hniques, it is common 1%, which is comp etitiv withation results obtainedand using ventional activation to test many diﬀeren diﬀerent t eactiv activation functions ﬁndmore thatcon several variations on functions.practice Duringp erform researchcomparably and development of new hniques, it hidden is common standard comparably. . This means thattecusually new unit test many diﬀeren t activ ation and ﬁnd thattoseveral on tto yp ypes es are published only if they arefunctions clearly demonstrated providevariations a signiﬁcant standard practice p erform comparably . This means that usually new hidden unit impro improvemen vemen vement. t. New hidden unit types that p erform roughly comparably to known ttyp yp es are published if they are clearly demonstrated to provide a signiﬁcant are so commononly as to b e uninteresting. ypes es improvement. New hidden unit types that p erform roughly comparably to known It would b e impractical to list all of the hidden unit types that hav havee app appeared eared typ es are so common as to b e uninteresting. in the literature. We highlight a few esp especially ecially useful and distinctiv distinctivee ones. It would b e impractical to list all of the hidden unit types that have app eared (z ) at all. One can also think of Oneliterature. p ossibilityWis to not hav have an activ activation ation guseful in the e highlight aefew esp ecially and distinctive ones. this as using the iden identity tity function as the activ activation ation function. We ha hav ve already g ( z One p ossibility is to not hav e an activ ation ) at all. One can also of seen that a linear unit can be useful as the output of a neural net network. work.think It may this base using identityunit. function as lay theeractiv ation function. e have already also used asthe a hidden If every layer of the neural netw network orkWconsists of only seen that a linear unit can be useful as the output of a neural net work. Iter, may linear transformations, then the netw network ork as a whole will b e linear. Ho Howev wev wever, it also b e used as hidden unit. If every lay er of the neural netw ork consists of only is acceptable fora some lay of the neural net to b e purely linear. Consider layers ers network work transformations, then nthe netwand ork pasoutputs, a wholehwill > x + bHowever, it = gb(eWlinear. alinear neural netw network ork lay layer er with inputs ). We ma may y is acceptable for some lay ers of the neural net work to b e purely linear. Consider replace this with tw two o lay layers, ers, with one lay layer er using weight matrix U and the other n p outputs, h = gfunction, ( W x +then b ). W a neural netw ork lay er with inputs and ma y using weigh weightt matrix V . If the ﬁrst lay layer er has no activ activation ation wee ha have ve U and replace thisfactored with twothe layers, with one lay using weightlay matrix W . other essen essentially tially weigh weight t matrix oferthe original layer er based on the The V using weigh t matrix . If the ﬁrst lay er has no activ ation function, then we have > > factored approach is to compute h = g(V U x + b ). If U pro produces duces q outputs, essenU tially the contain weight matrix original layer based . The W V together + pthe ) q parameters, W on np then andfactored only (n of while con contains tains h = g ( V U x + b U q factored approach is to compute ) . If pro duces outputs, parameters. For small q , this can b e a considerable sa saving ving in parameters. It + p) qtransformation np then Uatand together contain only parameters, while con tains but comes theVcost of constraining the(n linear to b eWlo lowrank, wrank, parameters. small q , this a considerable ving in parameters. It these lowrankFor relationships are can oftenb esuﬃcient. Linear sa hidden units thus oﬀer an comes at cost of constraining transformation to b e lowrank, but eﬀectiv eﬀective e wthe ay of reducing the num numb bthe er oflinear parameters in a net network. work. these lowrank relationships are often suﬃcient. Linear hidden units thus oﬀer an Softmax units are another kind of unit that is usually used as an output (as eﬀective way of reducing the numb er of parameters in a network. describ described ed in Sec. 6.2.2.3) but ma may y sometimes b e used as a hidden unit. Softmax Softmax units are another kind of unit that is ousually used asvan output k units naturally represent a probabilit probability y distribution ver a discrete ariable with(as describ ed in Sec. 6.2.2.3 ) but ma y sometimes b e used as a hidden unit. Softmax p ossible values, so they may b e used as a kind of switch. These kinds of hidden k units are naturally a probabilit y distribution over a discrete variablelearn withto units usuallyrepresent only used in more adv advanced anced architectures that explicitly p ossible values, so they mayedb einused a kind manipulate memory memory, , describ described Sec.as10.12 . of switch. These kinds of hidden units are usually only used in more advanced architectures that explicitly learn to manipulate memory, describ ed in Sec. 10.12. 195
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
A few other reasonably common hidden unit types include: A few other reasonably common hidden unit types include: • Radial bbasis W W:,i − x2 . This asis function or RBF unit: hi = exp −σ1  function b ecomes more active as x approac approaches hes a template W :,i. Because it W x Radial basis function or RBF unit: h = exp . This saturates to 0 for most x, it can b e diﬃcult to optimize. • function b ecomes more active as x approaches a template −  W− . Because it a ( a)0 = a ) = log + e b).e This • Softplus saturates forζ (most x, (1 it can diﬃcult optimize. Softplus:: gto is a to smo smooth oth version of the rectiﬁer, in by Dugas et al. ( 2001 ) for function approximation by Nair intro tro troduced duced oth version of and rectiﬁer, g ( a ) = ζ ( a ) = log (1 + e Softplus : ) . This is a smo the and Hinton (2010) for the conditional distributions of undirected probabilistic in tro duced by Dugas al. ()2001 ) for function approximation and byfound Nair • mo models. dels. Glorot et al. et (2011a compared the softplus and rectiﬁer and and Hinton ( 2010 ) for the conditional distributions of undirected probabilistic b etter results with the latter. The use of the softplus is generally discouraged. mo dels. Glorot et al. (2011athat ) compared the softplus rectiﬁer found The softplus demonstrates the p erformance of and hidden unit and types can b etter withtuitiv the latter. use of the is generally discouraged. b e veryresults counterin counterintuitiv tuitive—one e—oneThe might exp expect ectsoftplus it to hav have e an adv advan an antage tage over Therectiﬁer softplusdue demonstrates that tiable the p erformance types less can the to b eing diﬀeren diﬀerentiable ev everywhere erywhereof orhidden due to unit saturating b e v ery counterin tuitiv e—one might exp ect it to hav e an adv an tage o ver completely completely,, but empirically it do does es not. the rectiﬁer due to b eing diﬀerentiable everywhere or due to saturating less • Har completely but empirically it do es not. Hard d tanh :, this is shap shaped ed similarly to the tanh and the rectiﬁer but unlik unlikee min(1 (1, a)) the latter, it is b ounded, g(a) = max (− 1, min )).. It was introduced tanh tanh Har d : this is shap ed similarly to the and the rectiﬁer but unlike by Collob ( 2004 ). Collobert ert • the latter, it is b ounded, g(a) = max ( 1, min(1, a)). It was introduced by Collob (2004 ). Hidden unit ert design remains an activ activee area − of researc research h and man many y useful hidden unit types remain to b e disco discovered. vered. Hidden unit design remains an active area of research and many useful hidden unit types remain to b e discovered.
6.4
Arc Architecture hitecture Design
Another key design consideration for neural netw networks orks is determining the architecture. 6.4 Arc hitecture Design The word ar archite chite chitectur ctur cturee refers to the ov overall erall structure of the net netw work: ho how w many Another key design consideration for neural netw orks is determining the architecture. units it should ha have ve and how these units should b e connected to each other. The word architecture refers to the overall structure of the network: how many Most neuralhanet networks are these organized groups of units called lay layers. ers. Most units it should veworks and how units into should b e connected to each other. neural netw network ork architectures arrange these lay layers ers in a chain structure, with eac each h neural networks organized into groups units called the layers. la layer yerMost b eing a function of theare la layer yer that preceded it. In of this structure, ﬁrst Most la layer yer neural is givennetw by ork architectures arrangethese layers in a chain structure, with each layer b eing a function of the(1) layer that preceded it. In this structure, the ﬁrst layer (6.40) h = g(1) W (1)> x + b(1) , is given by the second lay layer er is giv given en bhy = g , (6.40) W x+b the second layer is given hb(2) y = g(2) W (2)>h (1) + b(2) , (6.41) and so on. and so on.
h
=g
W
196
h
+b ,
(6.41)
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
In these chainbased arc architectures, hitectures, the main architectural considerations are to cho hoose ose the depth of the net network work and the width of each lay layer. er. As we will see, In these chainbased arc hitectures, the main architectural considerations are a netw network ork with even one hidden lay layer er is suﬃcien suﬃcientt to ﬁt the training set. Deep Deeper er to c ho ose the depth of the net work and the width of each lay er. As we will see, net networks works often are able to use far few fewer er units p er la layer yer and far fewer parameters a netw ork with even one hidden lay er is suﬃcien t to ﬁt harder the training set. Deep er and often generalize to the test set, but are also often to optimize. The networks often are able tofor usea far p er lavia yerexp and far fewer parameters ideal netw network ork architecture taskfew mer ustunits b e found experimentation erimentation guided by and often generalize to the test set, but are also often harder to optimize. The monitoring the validation set error. ideal network architecture for a task must b e found via exp erimentation guided by monitoring the validation set error. A linear mo model, del, mapping from features to outputs via matrix multiplication, can by deﬁnition represen representt only linear functions. It has the adv advantage antage of b eing easy to A linear mo del, mapping from features to outputs via matrix multiplication, can train b ecause many loss functions result in conv convex ex optimization problems when by deﬁnition represen t only linear functions. It has the antage of b eing easy to applied to linear mo models. dels. Unfortunately Unfortunately, , we often wan want t toadv learn nonlinear functions. train b ecause many loss functions result in convex optimization problems when At ﬁrst glance, we migh mightt presume that learning a nonlinear function requires applied to linear mo dels. Unfortunately, we often want to learn nonlinear functions. designing a sp specialized ecialized mo model del family for the kind of nonlinearity we wan wantt to learn. A t ﬁrst glance, we migh t presume that learning a nonlinear function Fortunately ortunately,, feedforward netw networks orks with hidden la layers yers pro provide vide a univ universal ersal requires appro approxixidesigning a sp ecialized mo del family for the kind of nonlinearity we wan t to learn. mation framew framework. ork. Sp Speciﬁcally eciﬁcally eciﬁcally,, the universal appr approximation oximation the theor or orem em (Hornik et al., F ortunately , feedforward netw orks with hidden la yers pro vide a univ appro xi1989; Cyb Cybenko enko, 1989) states that a feedforward netw network ork with a linearersal output la layer yer mation framew ork. Sp eciﬁcally the universal approximation theorfunction em (Hornik eth al. and at least one hidden lay layer er ,with any “squashing” activ activation ation (suc (such as, 1989 ; Cyb enko , 1989)activ states thatfunction) a feedforward network withany a linear layer the logistic sigmoid activation ation can approximate Boreloutput measurable and at least layer with any “squashing” activ ation (such as function fromone onehidden ﬁnitedimensional space to another with an any yfunction desired nonzero the logistic sigmoid activ ation function) can approximate any Borel measurable amoun amountt of error, provided that the net netw work is given enough hidden units. The function from one ﬁnitedimensional space to another with an desired nonzero deriv of the feedforward netw can also appro they deriv of the derivatives atives network ork approximate ximate derivatives atives amoun t of error, provided that the net w ork is given enough hidden units. The function arbitrarily well (Hornik et al., 1990). The concept of Borel measurability deriv atives the of the feedforward approximate the deriv atives the is b eyond scop scope e of this bnetw o ok;ork forcan ouralso purposes it suﬃces to say thatof any function arbitrarily well ( Hornik et al. , 1990 ). The concept of Borel measurability con continuous tinuous function on a closed and b ounded subset of Rn is Borel measurable is b thema scop this b o ok; for purposes it suﬃces to say that any and eyond therefore approximated by our a neural net may y bee of network. work. network ork may RA neural netw continuous function a closed and b ounded subset is Boreldiscrete measurable also appro approximate ximate an any yon function mapping from an any y ﬁnite of dimensional space and therefore ma y b e approximated b y a neural net work. A neural netw ork may to another. While the original theorems were ﬁrst stated in terms of units with also appro ximate an y function mapping from an y ﬁnite dimensional discrete space activ activation ation functions that saturate b oth for very negativ negativee and for very p ositive to another. While the original theorems w ere ﬁrst in terms of for units with argumen arguments, ts, universal appro approximation ximation theorems hav havee stated also b een prov proven en a wider activation functions that saturate b oth forthevery e and forrectiﬁed very p ositive class of activ activation ation functions, whic which h includes no now wnegativ commonly used linear argumen ts, universal appro ximation theorems hav e also b een prov en for a wider unit (Leshno et al., 1993). class of activation functions, which includes the now commonly used rectiﬁed linear universal approximation unitThe (Leshno et al.approxim , 1993). ation theorem means that regardless of what function we are trying to learn, we kno know w that a large MLP will b e able to this The universal ation theorem regardless of what function. How However, ever,approxim we are not guaran guaranteed teed means that thethat training algorithm willfunction b e able w learn, we knoifwthe that a large MLP will b e able this toe are trying that to function. Even MLP is able to represent theto function, learning function. How ever, we are not guaran teed that the training algorithm will b e able can fail for tw twoo diﬀeren diﬀerentt reasons. First, the optimization algorithm used for training to that function. Even if the MLP is able to represent the function, learning 197 optimization algorithm used for training can fail for two diﬀerent reasons. First, the
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
ma may y not b e able to ﬁnd the value of the parameters that corresp corresponds onds to the desired function. Second, the training algorithm might choose the wrong function due to y not b e able to ﬁnd value of the that corresp onds to sho thews desired oma verﬁtting. Recall fromthe Sec. 5.2.1 thatparameters the “no free lunch” theorem shows that function. Second, the training algorithm might choose the wrong function due to there is no universally sup superior erior machine learning algorithm. Feedforw eedforward ard netw networks orks o verﬁtting. Recall from Sec. 5.2.1 that the “no free lunch” theorem sho ws that pro provide vide a universal system for represen representing ting functions, in the sense that, giv given en a there is nothere universally erior machine learning algorithm. Feedforw ard netw orks function, exists a sup feedforward netw network ork that approximates the function. There prono vide a universal systemfor forexamining representing functions, senseexamples that, given a is universal pro procedure cedure a training setinofthe sp speciﬁc eciﬁc and existsthat a feedforward network the function. cfunction, ho hoosing osing athere function will generalize to pthat ointsapproximates not in the training set. There is no universal pro cedure for examining a training set of sp eciﬁc examples and The universal approximation theorem says that there exists a net network work large cho osing a function that will generalize to p oints not in the training set. enough to ac achieve hieve any degree of accuracy we desire, but the theorem do does es not The universal approximation theorem says that there exists a net work sa say y how large this netw network ork will b e. Barron (1993) pro provides vides some b ounds onlarge the enough to ac hieve any degree of accuracy w e desire, but the theorem do es not size of a singlelay singlelayer er netw network ork needed to approximate a broad class of functions. sa y how large this netw ork will b e. Barron ( 1993 ) pro vides some b ounds on the Unfortunately Unfortunately,, in the worse case, an exp exponential onential num numb b er of hidden units (p (possibly ossibly size of singlelay er netw orkonding neededtotoeac approximate a broad class functions. with onea hidden unit corresp corresponding each h input conﬁguration that of needs to b e Unfortunately , in the worse case, an exp onential num b er of hidden units (p ossibly distinguished) ma may y b e required. This is easiest to see in the binary case: the with one hidden unit corresp onding tooneac h inputv conﬁguration needs to b e ∈ {0, 1}n is 22that number of p ossible binary functions vectors and selecting distinguished) ma y b e required. This is easiest to see in the binary case: one such function requires 2n bits, which will in general require O(2 n) degreesthe of 0, 1 is 2 and selecting number of p ossible binary functions on vectors v freedom. O(2 ) degrees of one such function requires 2 bits, which will in general ∈ { require } In summary summary, , a feedforward net network work with a single la layer yer is suﬃcient to represen representt freedom. an any y function, but the lay layer er may b e infeasibly large and ma may y fail to learn and In summary , a feedforward net work with a single la yer is suﬃcient represen generalize correctly correctly.. In many circumstances, using deep deeper er mo models dels cantoreduce thet an y function, butrequired the layer b e infeasibly large and maand y fail toreduce learn and n umber of units to may represent the desired function can the generalize correctly . In many circumstances, using deep er mo dels can reduce the amoun amountt of generalization error. number of units required to represent the desired function and can reduce the There families oferror. functions whic which h can b e appro approximated ximated eﬃcien eﬃciently tly by an amoun t ofexist generalization arc architecture hitecture with depth greater than some value d, but whic which h require a muc much h larger There existisfamilies of functions whichorcan b eto appro eﬃcien byban d. Inximated mo model del if depth restricted to b e less than equal many cases, thetly num numb er arc hitecture with depth greater than some v alue , but whic h require a muc h larger d of hidden units required by the shallow mo model del is exp exponen onen onential tial in n. Suc Such h results d mo del if depth is restricted to b e less than or equal to . In many cases, the num b er were ﬁrst prov proven en for mo models dels that do not resemble the contin continuous, uous, diﬀeren diﬀerentiable tiable n of hidden units required by the shallow mo del is exp onen tial in . Suc h results neural netw networks orks used for machine learning, but hav havee since b een extended to these w eredels. ﬁrstThe provﬁrst en for mo dels that not resemble contin uous,, diﬀeren mo models. results were fordocircuits of logicthe gates (Håstad 1986). tiable Later neural netw orks used for machine learning, but hav e since b een extended to these work extended these results to linear threshold units with nonnegative weigh weights ts mo dels. The ﬁrst results w ere for circuits of logic gates ( Håstad , 1986 ). Later (Håstad and Goldmann, 1991; Ha Hajnal jnal et al., 1993), and then to netw networks orks with w ork extended these results to linear threshold units with nonnegative weigh ts con continuousv tinuousv tinuousvalued alued activ activations ations (Maass, 1992; Maass et al., 1994). Many mo modern dern (neural Håstadnet and Goldmann , 1991;linear Ha jnal et al.,Leshno 1993), et andal.then to) netw orks with netwo wo works rks use rectiﬁed units. (1993 demonstrated continuousv activ ations (Maass , 1992 Maassolynomial et al., 1994 ). ation Manyfunctions, mo dern that shallow alued netw networks orks with a broad family of; nonp nonpolynomial activ activation neural netrectiﬁed works use rectiﬁed units. Leshno et al. (1993 ) demonstrated including linear units, linear hav havee universal approximation prop properties, erties, but these that shallow netw orks with a broad family of nonp olynomial activ ation functions, results do not address the questions of depth or eﬃciency—they sp specify ecify only that rectiﬁed units, hav e universal approximation prop erties, butu these aincluding suﬃciently wide linear rectiﬁer netw network ork could represen represent t any function. Pascan Pascanu et al. results do not address the questions of depth or eﬃciency—they sp ecify only that a suﬃciently wide rectiﬁer network could represent any function. Pascanu et al. 198
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
(2013b) and Montufar et al. (2014) sho showed wed that functions representable with a deep rectiﬁer net can require an exp exponen onen onential tial num umb b er of hidden units with a shallow ((one 2013b ) and Montufar et al. ( 2014 ) sho wed that representable a hidden lay layer) er) net network. work. More precisely precisely,, theyfunctions show showed ed that piecewisewith linear deep rectiﬁer net can can brequire an exp onen tial num b er of hiddenorunits withunits) a shallow net networks works (which e obtained from rectiﬁer nonlinearities maxout can (one hidden lay er) net work. More precisely , they show ed that piecewise linear represen representt functions with a num numb b er of regions that is exp exponen onen onential tial in the depth of the net works Fig. (which b e obtained rectiﬁer maxout units) can net network. work. 6.5can illustrates how afrom netw network ork withnonlinearities absolute valueorrectiﬁcation creates represenimages t functions a numcomputed b er of regions thatofissome exp onen tial in the with depthresp of the mirror of thewith function on top hidden unit, respect ect net work. Fig. 6.5 illustrates how a netw ork with absolute v alue rectiﬁcation creates to the input of that hidden unit. Each hidden unit sp speciﬁes eciﬁes where to fold the mirror images of the function computed on top of some hidden unit,absolute with resp ect input space in order to create mirror resp responses onses (on b oth sides of the value to the input hidden unit. folding Each hidden unit sp toonentially fold the nonlinearit nonlinearity). y).ofBythat comp composing osing these op operations, erations, weeciﬁes obtainwhere an exp exponentially input nspace in of order to create mirror resp onses (oncan b oth sides ofallthe absolute value large umber piecewise linear regions which capture kinds of regular nonlinearit y). Bypatterns. comp osing these folding op erations, we obtain an exp onentially (e.g., rep repeating) eating) large number of piecewise linear regions which can capture all kinds of regular (e.g., rep eating) patterns.
Figure 6.5: An intuitiv intuitive, e, geometric explanation of the exp exponential onential adv advan an antage tage of deep deeper er rectiﬁer netw networks orks formally shown by Pascan Pascanu u et al. (2014a) and by Montufar et al. (2014). Figure 6.5:absolute An intuitiv e, rectiﬁcation geometric explanation of same the exp onential antage deep er (L (Left) eft) An value unit has the output for adv every pair of mirror et al. et al. rectiﬁer orks formally shown byofPascan u (2014a ) and by Montufardeﬁned (by 2014 p oin oints ts innetw its input. The mirror axis symmetry is given by the hyperplane the). (Leights eft) An absolute value rectiﬁcation has theon same mirror w and bias of the unit. A functionunit computed top output of that for unitevery (the pair greenofdecision p oints inwill its input. The mirror axis of symmetry is given y theaxis hyperplane deﬁned by the surface) b e a mirror image of a simpler pattern acrossbthat of symmetry symmetry. . (Center) weights and bias unit. Abfunction top ofthe that unit green. decision The function canofbethe obtained y foldingcomputed the space on around axis of (the symmetry symmetry. (R (Right) ight) surface) will b e a mirror image ofe afolded simpler across of symmetry . (Center) Another repeating pattern can b onpattern top of the ﬁrstthat (byaxis another do downstream wnstream unit) (Right) The function can be obtained(which by folding therep space around the axis of tw symmetry to obtain another symmetry is now repeated eated four times, with two o hidden. lay layers). ers). Another repeating pattern can b e folded on top of the ﬁrst (by another downstream unit) to obtain symmetry (which is nowin repMontufar eated four et times, hiddenthat layers). More another precisely precisely, , the main theorem al. (with 2014tw ) ostates the
number of linear regions carv carved ed out by a deep rectiﬁer net network work with d inputs, More precisely , the theorem depth l , and n units p ermain hidden lay layer, er, isin Montufar et al. (2014) states that the number of linear regions carved out by a deep rectiﬁer network with d inputs, isd(l−1) ! depth l , and n units p er hidden lay er, n O nd , (6.42) d n O n , (6.42) d case of maxout netw i.e., exp exponential onential in the depth l. In the networks orks with k ﬁlters p er unit, the num numb b er of linear regions is i.e., exp onential in the depth l. In the case of maxout networks with k ﬁlters p er ! unit, the numb er of linear regionsOis k (l−1)+d . (6.43) O k 199
.
(6.43)
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Of course, there is no guaran guarantee tee that the kinds of functions we wan wantt to learn in applications of machine learning (and in particular for AI) share such a prop property erty erty.. Of course, there is no guarantee that the kinds of functions we want to learn in We ma may y also wan wantt to choose a deep mo model del for statistical reasons. Any time applications of machine learning (and in particular for AI) share such a prop erty. we choose a sp speciﬁc eciﬁc machine learning algorithm, we are implicitly stating some W e ma y also wan choose a deep del of forfunction statistical Any time set of prior b eliefs wet to hav have e ab about out what mo kind thereasons. algorithm should w e choose aosing sp eciﬁc machine we are implicitly some learn. Cho Choosing a deep mo model dellearning enco encodes des algorithm, a very general b elief that thestating function we set of prior b eliefs w e hav e ab out what kind of function the algorithm should want to learn should inv involv olv olvee comp composition osition of several simpler functions. This can b e learn. Cho osing a deep mo del enco des a very general b elief that wee in interpreted terpreted from a representation learning p oin oint t of view as sa saying ying the thatfunction we b eliev elieve w ant to learn should inv olv e comp osition of several simpler functions. This can be the learning problem consists of disco discovering vering a set of underlying factors of variation interpreted t of view as sa ying that w e b elievofe that can in from turnabrepresentation e describ described ed in learning terms ofp oin other, simpler underlying factors learning problem ,consists discovering a set underlying factorsasofexpressing variation vthe ariation. Alternately Alternately, we can of in interpret terpret the use of aofdeep arc architecture hitecture canthat in turn b e describ in terms of other, simpler program underlying factors of of athat b elief the function weedwant to learn is a computer consisting vm ariation. Alternately , we can in terpret the use of a deep arc hitecture as expressing ultiple steps, where eac each h step makes use of the previous step’s output. These a b elief that the function we w ant to learn is a computer program consisting in intermediate termediate outputs are not necessarily factors of variation, but can instead bofe multiple steps, where step makes theork previous output. These analogous to coun counters ters eac or h p ointers that use the of netw network uses tostep’s organize its internal in termediate outputs are not necessarily of result variation, but can instead b e pro processing. cessing. Empirically Empirically, , greater depth do does esfactors seem to in b etter generalization analogous to coun ters or p ointers that the netw ork uses to organize its internal for a wide variety of tasks (Bengio et al., 2007; Erhan et al., 2009; Bengio , 2009; pro cessing. Empirically , greater depth es seem to result b etter generalization Mesnil et al. al., , 2011; Ciresan et al. al., , 2012do ; Krizhevsky et al. al.,,in2012 ; Sermanet et al. al.,, for a wide v ariety of tasks ( Bengio et al. , 2007 ; Erhan et al. , 2009 ; Bengio , 2009 2013; Farab arabet et et al. al.,, 2013; Couprie et al. al.,, 2013; Kahou et al. al.,, 2013; Go Goo o dfello dfellow w; Mesnil et al.;, Szegedy 2011; Ciresan et al.,).2012 Krizhevsky et al.6.7 , 2012 Sermanet al., et al., 2014d et al., 2014a See; Fig. 6.6 and Fig. for ;examples of et some 2013 ; F arab et et al. , 2013 ; Couprie et al. , 2013 ; Kahou et al. , 2013 ; Go o dfello w of these empirical results. This suggests that using deep architectures do does es indeed et al., 2014d ; Szegedy al.,the 2014a ). See Fig. 6.6 and 6.7 learns. for examples of some express a useful prior et ov over er space of functions theFig. mo model del of these empirical results. This suggests that using deep architectures do es indeed express a useful prior over the space of functions the mo del learns. So far we hav havee describ described ed neural net networks works as b eing simple chains of la layers, yers, with the main considerations b eing the depth of the netw network ork and the width of eac each h la layer. yer. So far we hav e describ ed neural net works as b eing simple chains of la yers, with the In practice, neural net networks works sho show w considerably more div diversity ersity ersity.. main considerations b eing the depth of the network and the width of each layer. Man Many y neural net network work architectures ha have ve b een dev develop elop eloped ed for sp speciﬁc eciﬁc tasks. In practice, neural networks show considerably more diversity. Sp Specialized ecialized arc architectures hitectures for computer vision called conv convolutional olutional net networks works are Man y neural net work architectures ha ve b een dev elop ed for sp eciﬁc tasks. describ described ed in Chapter 9. Feedforward netw networks orks may also b e generalized to the Sp ecialized arc hitectures for computer vision called conv olutional net works are recurren recurrentt neural net networks works for sequence pro processing, cessing, describ described ed in Chapter 10, which describ ed in Chapter 9. Feedforward networks may also b e generalized to the ha have ve their own arc architectural hitectural considerations. recurrent neural networks for sequence pro cessing, describ ed in Chapter 10, which In general, the la layers yers need not b e connected in a chain, ev even en though this is the have their own architectural considerations. most common practice. Many arc architectures hitectures build a main chain but then add extra In general, the la yers need not b e skip connected in a chain, en though the arc architectural hitectural features to it, such as connections goingevfrom lay layer er ithis to is lay layer er common Many architectures a main chain but then imost + 2 or higher.practice. These skip connections mak makeebuild it easier for the gradient to add ﬂo ﬂow w extra from architectural it, such skip connections going from layer i to layer output lay layers ers features to lay layers ers to nearer the as input. i + 2 or higher. These skip connections make it easier for the gradient to ﬂow from output layers to layers nearer the input.200
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Figure 6.6: Empirical results showing that deeper netw networks orks generalize better when used to transcrib transcribee multidigit numbers from photographs of addresses. Data from Go Goo o dfello dfellow w Figure 6.6: Empirical results showing that deeper netw orks generalize better when used et al. (2014d). The test set accuracy consistently increases with increasing depth. See to transcrib ultidigit numbers from photographs addresses. Data from Go w Fig. 6.7 for ae m con control trol exp experimen erimen eriment t demonstrating that of other increases to the mo model delo dfello size do et al. (2014d Theeﬀect. test set accuracy consistently increases with increasing depth. See not yield the ).same Fig. 6.7 for a control exp eriment demonstrating that other increases to the mo del size do not yield the same eﬀect.
201
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Figure 6.7: Deep Deeper er mo models dels tend to p erform b etter. This is not merely b ecause the mo model del is larger. This exp experimen erimen erimentt from Go Goodfellow odfellow et al. (2014d) shows that increasing the number Figure 6.7: Deep mo dels to p erformnetw b etter. is notincreasing merely b ecause the moisdel is of parameters inerla layers yers of tend con convolutional volutional networks orksThis without their depth not et al. larger. This exp erimen t from Go odfellow ( 2014d ) shows that increasing the n umber nearly as eﬀectiv eﬀectivee at increasing test set p erformance. The legend indicates the depth of of parameters layers of con volutional networks without increasing their in depth is not net network work used toinmake each curve and whether the curve represents variation the size of nearly asolutional eﬀective or at the increasing test set play erformance. The elegend indicates the depth of the conv convolutional fully connected layers. ers. We observ observe that shallow mo models dels in this net work used to make each curve and whether the curve represents v ariation in the size of con context text overﬁt at around 20 million parameters while deep ones can b eneﬁt from having olutional or the fully connected ers. mo Wedel observ e thatashallow mo dels in ov this othe verconv 60 million. This suggests that using lay a deep model expresses useful preference over er context overﬁt at around million parameters while deep can b eneﬁt fromthat having the space of functions the20mo model del can learn. Sp Speciﬁcally eciﬁcally eciﬁcally, , itones expresses a b elief the over 60 million. This suggests that using functions a deep mocomposed del expresses a useful preference over function should consist of many simpler together. This could result the space of functions the mo delthat can islearn. Sp eciﬁcally expresses a b elief that(e.g., the either in learning a representation comp composed osed in turn, ofit simpler represen representations tations functiondeﬁned should in consist functionsa composed together. This could result corners termsofofmany edges)simpler or in learning program with sequentially dep dependent endent either(e.g., in learning acate representation that then is comp osed inthem turnfrom of simpler represen tations (e.g., steps ﬁrst lo locate a set of ob objects, jects, segment each other, then recognize corners deﬁned in terms of edges) or in learning a program with sequentially dep endent them). steps (e.g., ﬁrst lo cate a set of ob jects, then segment them from each other, then recognize them).
202
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Another key consideration of architecture design is exactly how to connect a pair of la layers yers to each other. In the default neural netw network ork la layer yer describ described ed by a linear Another key consideration of architecture design is exactly how to connect a transformation via a matrix W , every input unit is connected to every output pair ofMany layerssp toecialized each other. In thein default neural netw orkhav layer describ ed by a linear unit. specialized netw networks orks the chapters ahead have e fewer connections, so transformation via a matrix , every input unit is connected to every output W that eac each h unit in the input lay layer er is connected to only a small subset of units in unit.output Manylay sper. ecialized orks infor thereducing chaptersthe ahead hav fewer connections, so the layer. Thesenetw strategies num numb b ere of connections reduce thatneac h unit in the inputand layer connected to only a small subsetto of ev units in the umber of parameters theis amount of computation required evaluate aluate the netw output er. are These strategies for reducingenden the num of connections reduce the network, ork,lay but often highly problemdep problemdependen endent. t. Fb orerexample, conv convolutional olutional the n umber of parameters and the amount of computation required to ev aluate net networks, works, described in Chapter 9, use sp specialized ecialized patterns of sparse connections the netw are often highly problemdep endent.InFor that are ork, very but eﬀective for computer vision problems. thisexample, chapter,conv it isolutional diﬃcult net works, described in Chapter 9 , use sp ecialized patterns of sparse connections to give much more sp speciﬁc eciﬁc advice concerning the arc architecture hitecture of a generic neural that are very eﬀective for computer vision problems. this chapter, it is diﬃcult net network. work. Subsequen Subsequentt chapters develop the particular In architectural strategies that to give m uch more sp eciﬁc advice concerning the arc hitecture of a generic neural ha have ve b een found to work well for diﬀerent application domains. network. Subsequent chapters develop the particular architectural strategies that have b een found to work well for diﬀerent application domains.
6.5
Bac BackPropagation kPropagation and Other Diﬀeren Diﬀerentiation tiation Algorithms 6.5 BackPropagation and Other Diﬀerentiation AlgoWhen we use a feedforward neural netw network ork to accept an input x and pro produce duce an rithms output yˆ, information ﬂo ﬂows ws forward through the netw network. ork. The inputs x pro provide vide x andatpro When we use a feedforward neural netw ork to accept an input units duce an the initial information that then propagates up to the hidden eac each h la layer yer yˆ, information through output forward the ork. The provide and ﬁnally pro produces duces yˆﬂo . ws This is called forwar forward d netw pr prop op opagation agation agation. . inputs Duringx training, the initial information that then propagates up to the hidden units at eac yer forw forward ard propagation can contin continue ue onw onward ard un until til it pro produces duces a scalar cost costJ ). la The J (θh yˆ . This ﬁnally pro duces is calledet forwar d pr agation . During band ackpr ackprop op opagation agation algorithm (Rumelhart al., 1986a ),op often simply called btraining, ackpr ackprop op, forwws ardthe propagation canfrom contin onwto ard until it pro duces a scalar cost J (netw θ). The allo allows information theuecost then ﬂow backw backwards ards through the network, ork, backpr agation algorithm (Rumelhart et al., 1986a), often simply called backprop, in orderopto compute the gradient. allows the information from the cost to then ﬂow backwards through the network, Computing an analytical expression for the gradien gradientt is straigh straightforward, tforward, but in order to compute the gradient. numerically ev such an expression can b e computationally exp evaluating aluating expensive. ensive. The Computing an analytical expression for the gradien t is straigh tforward, bac backpropagation kpropagation algorithm do does es so using a simple and inexp inexpensiv ensiv ensivee pro procedure. cedure.but numerically evaluating such an expression can b e computationally exp ensive. The The term backpropagation misunderstoo as meaning whole backpropagation algorithm do es isso often using misundersto a simple ando dinexp ensive prothe cedure. learning algorithm for multila multilayer yer neural netw networks. orks. Actually Actually,, bac backpropagation kpropagation The term backpropagation is often misundersto as meaning the whole refers only to the metho method d for computing the gradient,o dwhile another algorithm, learning algorithm for multila yer neural orks. Actually backpropagation suc such h as sto stochastic chastic gradient descent, is used netw to p erform learning ,using this gradient. refers only to the metho d for computing the gradient, while another Furthermore, backpropagation is often misundersto misunderstood od as b eing sp speciﬁc eciﬁcalgorithm, to multisuc h as sto chastic gradient descent, is used to p erform learning using this la layer yer neural netw networks, orks, but in principle it can compute deriv derivativ ativ atives es of any gradient. function F urthermore, backpropagation is often misundersto od as b eing sp eciﬁc to multi(for some functions, the correct resp response onse is to rep report ort that the deriv derivativ ativ ativee of the layer neural networks, but in principle can compute deriv es of any function is undeﬁned). Sp Speciﬁcally eciﬁcally eciﬁcally, , we it will describ describe e ho how w to ativ compute the function gradien gradientt (for the correct onse isxto the whose derivativ e of the ∇x fsome ( x, y) functions, f, where for an arbitrary functionresp is arep setort of vthat ariables deriv derivativ ativ atives es function is undeﬁned). Sp eciﬁcally , we will describ e ho w to compute the gradien are desired, and y is an additional set of variables that are inputs to the functiont f( x, y) for an arbitrary function f, where x is a set of variables whose derivatives are desired, and y is an additional set of203variables that are inputs to the function ∇
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
but whose deriv derivatives atives are not required. In learning algorithms, the gradient we most often require is the gradien gradientt of the cost function with resp respect ect to the parameters, but whose deriv atives are not required. In learning algorithms, thederiv gradient weeither most involv volv volvee computing other derivatives, atives, ∇θ J( θ ). Many machine learning tasks in often require the gradien t cess, of theorcost withlearned resp ect mo to del. the parameters, as part of theis learning pro process, to function analyze the model. The back) . Many machine learning tasks in volv e computing other deriv atives, either J ( θ propagation algorithm can b e applied to these tasks as well, and is not restricted as part of thethelearning or function to analyze learned moparameters. del. The back∇ computing to gradientpro of cess, the cost withthe resp respect ect to the The propagation algorithm can b e applied to these tasks as well, and is not restricted idea of computing deriv derivatives atives by propagating information through a netw network ork is to computing the gradient of the cost function with resp ect to the parameters. The very general, and can b e used to compute values such as the Jacobian of a function idea of computing deriv atives by propagating information through a netw ork is f with multiple outputs. We restrict our description here to the most commonly vused ery general, and fcan compute values such as the Jacobian of a function case where hasb ea used singletooutput. f with multiple outputs. We restrict our description here to the most commonly used case where f has a single output. So far we hav havee discussed neural net networks works with a relatively informal graph language. To describ describee the backpropagation algorithm more precisely precisely,, it is helpful to hav havee a So far we hav e discussed neural net works with a relatively informal graph language. more precise computational gr graph aph language. To describ e the backpropagation algorithm more precisely, it is helpful to have a Man Many y ways of formalizing computation more precise computational graph language.as graphs are p ossible. Here, useofeach no node de in computation the graph to as indicate variable. The variable may Many we ways formalizing graphsa are p ossible. b e a scalar, vector, matrix, tensor, or ev even en a variable of another typ ype. e. Here, we use each no de in the graph to indicate a variable. The variable may To formalize our graphs, we also need to in intro tro troduce duce the idea of an op oper er eration ation ation.. b e a scalar, vector, matrix, tensor, or even a variable of another typ e. An op operation eration is a simple function of one or more variables. Our graph language To formalize by oura graphs, we also to intro duce the idea of an operation. is accompanied set of allow allowable ableneed op operations. erations. Functions more complicated An op eration is a simple function of one or more v ariables. Our graph language than the op operations erations in this set ma may y b e describ described ed by comp composing osing many op operations erations is accompanied by a set of allow able op erations. F unctions more complicated together. than the op erations in this set may b e describ ed by comp osing many op erations Without loss of generality generality,, w wee deﬁne an op operation eration to return only a single together. output variable. This do does es not lose generalit generality y b ecause the output variable can ha hav ve Without loss of generality , w e deﬁne an op eration to return only a single multiple en entries, tries, suc such h as a vector. Softw Software are implementations of backpropagation output vsupp ariable. do es not losemgeneralit y b ecausebut thewe output caninha ve usually support ort This op operations erations with ultiple outputs, avoidvariable this case our multiple entries, suchitasin atro vector. implementations of not backpropagation description because intro troduces duces Softw man many yare extra details that are imp importan ortan ortantt to usually supp ort op erations with m ultiple outputs, but we a void this case in our conceptual understanding. description because it intro duces many extra details that are not imp ortant to If a variable y is computed by applying an op operation eration to a variable x, then conceptual understanding. we draw a directed edge from x to y . W Wee sometimes annotate the output node x, then a vname ariable is op computed by applying an optimes eration to this a variable withIfthe of ythe operation eration applied, and other omit lab label el when the x y w e draw a directed edge from to . W e sometimes annotate the output node op operation eration is clear from context. with the name of the op eration applied, and other times omit this lab el when the Examples of computational op eration is clear from context. graphs are shown in Fig. 6.8. Examples of computational graphs are shown in Fig. 6.8. 204
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Figure 6.8: Examples of computational graphs. (a) The graph using the × op operation to eration > compute z = xy . (b) The graph for the logistic regression prediction yˆ = σ x w + b . (a) Figureof6.8: of computational graphs. graph usingalgebraic the opexpression eration to Some theExamples intermediate expressions do not ha have veThe names in the (b) z = xy yˆ = w+ b . compute The graphW for the logistic prediction ×uσ( ) .x (c) but need names. in the graph. e simply nameregression the ith suc such h variable The Some of the intermediate do=not ha algebraic expression , Xnames W + bin computational graph for theexpressions expression H which computes a design max max{ { 0ve }, the (c) The i u but need names in the graph. W e simply name the th suc h v ariable matrix of rectiﬁed linear unit activ activations ations H giv given en a design matrix con containing taining a .minibatch H = 0 , X W + max b computational graph for the expression , which computes a design of inputs X . (d) Examples a–c applied at most one op operation eration to each variable, but it H given matrix of rectiﬁed activ matrix containing agraph minibatch {a design }a computation is p ossible to applylinear moreunit than oneations op operation. eration. Here we show that X . than (d) Examples of inputsmore a–c applied most one to each variable, but it applies one op operation eration to the at weights linear regression mo model. del. w ofopaeration P The iseights p ossible apply morethe than one eration. Here show a tcomputation graph w are to used to make b oth theopprediction weigh weight decay p enalty yˆ andwethe λ that w 2. w applies more than one op eration to the weights of a linear regression mo del. The w . weights are used to make the b oth the prediction yˆ and the weight decay p enalty λ
205
P
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
The chain rule of calculus (not to b e confused with the chain rule of probability) is used to compute the deriv derivativ ativ atives es of functions formed by comp composing osing other functions The c hain rule of calculus (not to b e confused with the chain rule of probability) is whose deriv derivativ ativ atives es are kno known. wn. Bac Backpropagation kpropagation is an algorithm that computes the compute derivativ es of compeﬃcien osing other functions cused haintorule, with athe sp speciﬁc eciﬁc order of functions op operations erationsformed that isbyhighly eﬃcient. t. whose derivatives are known. Backpropagation is an algorithm that computes the Let x b e a real num numb b er, and let f and g b oth b e functions mapping from a real chain rule, with a sp eciﬁc order of op erations that is highly eﬃcient. number to a real num numb b er. Supp Suppose ose that y = g (x) and z = f (g (x)) = f (y ). Then x f and g b oth b e functions mapping from a real Let b e a real num b er, and let the chain rule states that y =dyg (x) and z = f (g (x)) = f (y ). Then number to a real numb er. Supp osedzthatdz = . (6.44) the chain rule states that dx dy dx dz dz dy = . (6.44) We can generalize this b ey eyond ond dx the scalar Suppose ose that x ∈ R m, y ∈ Rn , dy dxcase. Supp g maps from Rm to Rn , and f maps from R n to R. If y = g (x ) and z = R f (y ), then R We can generalize this b eyond the scalar case. Supp ose that x ,y , R R R R X ∂ z from ∂ zto∂ y j. If y = g (x ) and ∈ g maps from z = f (y ),∈then to , and f maps . (6.45) = ∂xi ∂ y j ∂ xi j ∂z ∂y ∂z . (6.45) = ∂x ∂y ∂x In vector notation, this may b e equiv equivalently alently written as > In vector notation, this may b e equivalently ∂ y written as (6.46) ∇xz = X ∇y z , ∂x ∂y z= z, (6.46) ∂y x g. matrix∂of where ∂x is the n × m Jacobian ∇ ∇ x can b e obtained by multiplying F rom this we see that the gradient of a v ariable is the n ∂ym Jacobian matrix of g . where a Jacobian matrix ∂x by a gradient ∇yz. The algorithm consists backpropagation × that the gradient of a variable x F rom this we see can b e obtained ultiplying of p erforming suc such h a Jacobiangradient pro product duct for each op operation eration by in m the graph. z. The backpropagation algorithm consists a Jacobian matrix by a gradient wesuc doh not apply the bac backpropagation kpropagation algorithm merely to vectors, of pUsually erforming a Jacobiangradient ∇ pro duct for each op eration in the graph. but rather to tensors of arbitrary dimensionalit dimensionality y. Conceptually Conceptually,, this is exactly the Usually w e do not apply the bac kpropagation algorithmis merely vectors, same as backpropagation with vectors. The only diﬀerence ho how w thetonum numb b ers but rather to tensors of arbitrary dimensionalit y . Conceptually , this is exactly the are arranged in a grid to form a tensor. We could imagine ﬂattening each tensor same as backpropagation with vectors. The only diﬀerence is ho w the num b ers in into to a vector b efore we run backpropagation, computing a vectorv vectorvalued alued gradient, are arranged in a gridthe to gradien form a ttensor. We could imagine each tensor and then reshaping gradient back into a tensor. In ﬂattening this rearranged view, in to a v ector b efore we run backpropagation, computing a vectorv alued gradient, bac backpropagation kpropagation is still just multiplying Jacobians by gradien gradients. ts. and then reshaping the gradient back into a tensor. In this rearranged view, To denote the gradient of a value z with resp respect ect to a tensor , we write ∇ z , backpropagation is still just multiplying Jacobians by gradients. just as if were a vector. The indices into no now w ha have ve multiple co coordinates—for ordinates—for z with zy, To denote gradient of a value resp ect to W a tensor , we write example, a 3Dthe tensor is indexed by three co coordinates. ordinates. e can abstract this awa way just as if a single were avvector. intothe no w have m ultiple ordinates—for ∇ all b y using ariable The represent complete tuple of co indices. For i to indices example, a 3Dtuples tensori,is(∇ indexed bes y three ∂z co ordinates. We can abstract this away z)i giv p ossible index gives ∂ . This is exactly the same as how for all by using a single variable i to represent the complete tuple of indices. For all z) gives p ossible index tuples i, ( . This is exactly the same as how for all 206 ∇
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
∂z p ossible integer indices i in into to a vector, (∇x z )i giv gives es ∂x . Using this notation, we can write the chain rule as it applies to tensors. If = g ( ) and z = f ( ), then p ossible integer indices i into a vector, ( z ) gives . Using this notation, we X ∂Ifz = g ( ) and z = f ( ), then can write the chain rule as it applies to tensors. ∇ z= (∇ ∇ j) . (6.47) ∂ j j ∂z ( z= ) . (6.47) ∂ ∇ ∇
Using the chain rule, it is straigh straightforward tforward down wn an algebraic expression for X to write do the gradient of a scalar with resp respect ect to an any y no node de in the computational graph that Using the chain rule, it is straigh tforward to write dothat wn an algebraic in expression for pro produced duced that scalar. Ho Howev wev wever, er, actually ev evaluating aluating expression a computer the gradient of a scalar with resp ect to an y no de in the computational graph that in intro tro troduces duces some extra considerations. pro duced that scalar. However, actually evaluating that expression in a computer Sp Speciﬁcally eciﬁcally eciﬁcally, , many subexpressions expressions ma may y be rep repeated eated several times within the intro duces some extra sub considerations. overall expression for the gradien gradient. t. Any pro procedure cedure that computes the gradien gradientt Sp eciﬁcally , many sub expressions ma y be rep eated several times within the will need to choose whether to store these sub subexpressions expressions or to recompute them o verall expression for the gradien t. Any pro cedure that computes the gradien sev several eral times. An example of ho how w these rep repeated eated sub subexpressions expressions arise is given int will whether to store the these sub expressions or twice to recompute them Fig. need 6.9. to In choose some cases, computing same sub subexpression expression would simply sev times.F An example of ho w thesethere rep eated expressions is of given in b e eral wasteful. For or complicated graphs, can bsub e exp exponentially onentiallyarise many these Fig. 6.9 . In some cases, computing the same sub expression twice would simply wasted computations, making a naiv naivee implemen implementation tation of the chain rule infeasible. b e w asteful. F or complicated graphs, there can b etwice exp onentially these In other cases, computing the same sub subexpression expression could b e amany validofway to w asted memory computations, making at a naiv implemen tation of the chain rule infeasible. reduce consumption the ecost of higher run runtime. time. In other cases, computing the same sub expression twice could b e a valid way to We ﬁrst b egin by a version of the backpropagation algorithm that sp speciﬁes eciﬁes reduce memory consumption at the cost of higher runtime. the actual gradient computation directly (Algorithm 6.2 along with Algorithm 6.1 We asso ﬁrstciated b egin forw by aard version of the backpropagation algorithm sp eciﬁes for the associated forward computation), in the order it will actuallythat b e done and the actual to gradient computation directlyof(Algorithm along witheither Algorithm 6.1 according the recursive application chain rule.6.2One could directly for the asso ciated forward computation), in the order actuallyasb ea done p erform these computations or view the description of it thewill algorithm sym symb band olic according to the recursive application of c hain rule. One could either directly sp speciﬁcation eciﬁcation of the computational graph for computing the backpropagation. Howp erform computations ormake view explicit the description of the algorithm a symb olic ev ever, er, thisthese formulation do does es not the manipulation and theasconstruction sp eciﬁcation of the computational graph computing the backpropagation. Howof the sym symb b olic graph that p erforms the for gradient computation. Such a formulation ever, this formulation es not make explicit the manipulation andalso thegeneralize construction is presented below in do Sec. 6.5.6 , with Algorithm 6.5, where we to of the sym b olic graph that p erforms the gradient computation. Such a formulation no nodes des that con contain tain arbitrary tensors. is presented below in Sec. 6.5.6, with Algorithm 6.5, where we also generalize to First consider a computational graph describing ho how w to compute a single scalar no des that contain arbitrary tensors. u(n) (sa (say y the loss on a training example). This scalar is the quantit quantity y whose First consider a computational graph describing ho w to compute a single (1) ) . In gradien gradientt we wan wantt to obtain, with resp respect ect to the n i input no nodes des u to u (nscalar u (say the loss on a training∂uexample). This scalar is the quantity whose other words we wish to compute ∂u for all i ∈ {1 ,2 , . . . , ni } . In the application gradient we want to obtain, with resp ect to the n input no des u to u . In of backpropagation to computing gradients for gradien gradientt descent over parameters, i , . . . , n h, other words we wish to compute for all . In the uapplication ( n ) (1) to u (n ) u will b e the cost asso associated ciated with an example or1a,2minibatc minibatch, while of backpropagation to computing gradients for gradien t descent o ver parameters, } corresp correspond ond to the parameters of the mo model. del. ∈ { u will b e the cost asso ciated with an example or a minibatch, while u to u corresp ond to the parameters of the mo207 del.
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
We will assume that the no nodes des of the graph hav havee b een ordered in such a wa way y that we can compute their output one after the other, starting at u(n +1) and Weupwill the in noAlgorithm des of the 6.1 graph hav e de b een in such a wa going to assume deﬁned , eac each h no node is asso associated ciated with any u(n). Asthat u (i)ordered u that we can their output one after the thefunction other, starting at and op operation eration f (i)compute and is computed by ev evaluating aluating going up to u . As deﬁned in Algorithm 6.1, each no de u is asso ciated with an (i) aluating op eration f and is computed byuev = f (A (i)the ) function (6.48) A u = ) ts of u (i). (6.48) where A (i) is the set of all no nodes des that aref (paren parents A where is the A setpro ofcedure all no des that are paren of u . procedure that p erforms thetscomputations mapping n i inputs (1) (n ) ( n ) u to u to an output u . This deﬁnes a computational graph where each no node de n A pro cedure that p erforms the computations mapping inputs ( i ) ( i ) computes numerical value u by applying a function f to the set of argumen arguments ts u u u to to an output . This deﬁnes a computational graph where each no de ( i ) ( j ) ( i ) A that comprises the values of previous no nodes des u , j < i, with j ∈ P a(u ). The computes numerical value ugraph by applying a function the the set of argumen ts x , and fis settointo n i no input to the computational is the vector ﬁrst nodes des A a(u (output) i, with that the values of previous no des ugraph , j< ). The ) u(1) to u (ncomprises . The output of the computational is read oﬀj thePlast input ∈ ﬁrst n no des no node de uto(n)the . computational graph is the vector x , and is set into the u to u . The output of the computational graph is read oﬀ the last (output) i = 1, . . . , ni no deuu(i) ← . x i i = 1, . . . , n x u i = ni + 1, . . . , n ← {u(j )  j ∈ P a(u(i) )} A(i) ← (i)= n + (i)1, .(.i). , n u A ← f (A ) u j P a(u ) A u ← f({n) (  )∈ } u ← u That algorithm sp speciﬁes eciﬁes the forw forward ard propagation computation, whic which h we could put in a graph G . In order to p erform backpropagation, we can construct a That algorithm eciﬁes the forw propagation whic h weThese could G and computational graphspthat dep depends ends onard adds to itcomputation, an extra set of no nodes. des. put ina subgraph a graph B. with In order todep erform backpropagation, we incan construct B pro form one no node p er no node de of G. Computation proceeds ceeds ina computational graph dep ends on and adds an each extrano set These G of that G, itand exactly the reverse the order of computation into node deofofnoBdes. computes form a subgraph with one no de p er no de of . Computation in pro ceeds in the deriv derivativ ativ ativee ∂u asso associated ciated with Gthe forw forward ard graph no node de u(i) . This is done ∂u exactly the reverseB of the order of computation in G , and each no de of B computes using the chain rule with resp respect ect to scalar output u(n) : the derivative asso ciated with the forwardGgraph no de u . This B is done ( n ) ( n ) ( i ) X output using the chain rule with ∂resp u ect to scalar ∂ u ∂uu : = (6.49) ∂ u(j ) ∂ u(i) ∂ u(j ) i:j ∈P a(u ) ∂ u ∂u ∂u = (6.49) ∂u ∂ u ∂ u as sp speciﬁed eciﬁed by Algorithm 6.2. The subgraph B con contains tains exactly one edge for each edge from node u (j ) to no node de u(i) of G . The edge from u (j ) to u (i) is asso associated ciated with as sp eciﬁed by Algorithm 6.2. The subgraph contains exactly one edge for each ∂u the computation of ∂u . In addition, a dot pro product duct is p erformed for each no node, de, edge from node u to no de u of . X The edge B from u to u (i)is asso ciated with b et etw ween the gradient already computed with resp respect ect to no nodes des u that are children the computation of . In addition, G a dot pro duct is p erformed for each no de, b etween the gradient already computed 208 with resp ect to no des u that are children
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
of u (j ) and the vector con containing taining the partial deriv derivatives atives ∂u for the same children ∂u ( i ) no nodes des u . To summarize, the amount of computation required for p erforming of u and the vector containing the partial derivatives for the same children the backbackpropagation propagation scales linearly with the nu number mber of edges in G , where the no des u . To the amount of computing computation required forative p erforming computation forsummarize, each edge corresp corresponds onds to a partial deriv derivative (of one the scales linearly with of edges , where the no node debackwithpropagation resp respect ect to one of its paren parents) ts) as the wellnu asmber p erforming oneinmultiplication computation for each edge ondsthis to computing partialalued deriv ative one Gno and one addition. Belo Below, w, wecorresp generalize analysis to atensorv tensorvalued nodes, des,(of whic which h no de with resp ect to one of its paren ts) as well as p erforming one m ultiplication is just a wa way y to group multiple scalar values in the same no node de and enable more and one addition. Belo w, w e generalize this analysis to tensorv alued no des, which eﬃcien eﬃcientt implemen implementations. tations. is just a way to group multiple scalar values in the same no de and enable more eﬃcient implemen tations. version of the bac Simpliﬁed backpropagation kpropagation algorithm for computing ( n ) the deriv derivativ ativ atives es of u with resp respect ect to the variables in the graph. This example is Simpliﬁed version backpropagation forallcomputing in intended tended to further understanding of bythe showing a simpliﬁed algorithm case where variables u the deriv ativ es of with resp ect to the v ariables in the graph. This example (1) ). are scalars, and we wish to compute the deriv derivativ ativ atives es with resp respect ect to u , . . . , u(n is in tended to further understanding b y showing a simpliﬁed case where all v ariables This simpliﬁed version computes the deriv derivativ ativ atives es of all no nodes des in the graph. The are scalars, and we wish to compute the deriv ativ es with resp tobuer of, . edges . . , u in. computational cost of this algorithm is prop proportional ortional to the ect num umb This simpliﬁed version computes thederiv deriv ativ ofciated all nowith des in theedge graph. The the graph, assuming that the partial derivativ ativ ative e es asso associated each requires computational cost of this algorithm is prop ortional to the n um b er of edges in a constant time. This is of the same order as the num numb b er of computations for the graph, assuming that the partial derivative asso ciated with each edge requires the forward propagation. Each ∂u is a function of the paren parents ts u(j ) of u(i), thus ∂u a constant time. This is of the same order as the numb er of computations for linking the no nodes des of the forward graph to those added for the backpropagation the forward propagation. Each is a function of the parents u of u , thus graph. linking the no des of the forward graph to those added for the backpropagation Run forward propagation (Algorithm 6.1 for this example) to obtain the activ activaagraph. tions of the net network work Run forward propagation (Algorithm for will this store example) to obtain aInitialize , a data structure6.1that the deriv derivativ ativ atives esthe thatactiv hav have e tions of the network [ u (i)] will store the computed value of b een computed. The entry _ Initialize , a data structure that will store the derivatives that have ∂u . [ u ] will store the computed value of b∂u een computed.(nThe entry _ _ [∂ u ) ] ← 1 . j = n − 1 do down wn to 1 P _ [∂ u ] 1 ∂u ∂u = The next line computes ∂u using stored values: i:j ∈P a(u ) ∂u ∂u j = n 1 down ← to 1P ∂u ∂u (j ) ] ← _ line[ucomputes _ [u (i)] ∂u using stored values: i:j ∈P a(u =) The next −
[u ] ] [u (i)]  i = 1, . . . , n }_ i ← _ [u ] i = 1, . . . , n Pdesigned to reduce the number of common The backpropagation algorithm is  } { sub subexpressions expressions without regard to memory memory.. Sp Speciﬁcally eciﬁcally eciﬁcally,, it p erforms on the order P The backpropagation algorithm is designed reduce umber of common of one Jacobian pro product duct p er no node de in the graph.toThis canthe b e nseen from the fact sub expressions without regard to memory . Sp eciﬁcally , it p erforms on the ( j ) i) of in Algorithm 6.2 that bac backprop kprop visits each edge from no node de u to no node de u(order of one Jacobian pro duct p er no de in the graph. This can b e seen from the∂ufact the graph exactly once in order to obtain the asso associated ciated partial deriv derivativ ativ ativee . in Algorithm 6.2 that backprop visits each edge from no de u to no de u ∂u of Bac Backpropagation kpropagation thus av avoids oids the exp exponential onential explosion in rep repeated eated sub subexpressions. expressions. the graph exactly once in order to obtain the asso ciated partial derivative . 209 Backpropagation thus avoids the exp onential explosion in rep eated sub expressions. _ {
_[u
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Ho Howev wev wever, er, other algorithms may b e able to av avoid oid more sub subexpressions expressions by p erforming simpliﬁcations on the computational graph, or ma may y b e able to conserv conservee memory by Ho wev er, other algorithms may b e able to av oid more sub expressions by these p erforming recomputing rather than storing some sub subexpressions. expressions. We will revisit ideas simpliﬁcations on the computational graph, or ma y b e able to conserv e memory by after describing the bac backpropagation kpropagation algorithm itself. recomputing rather than storing some sub expressions. We will revisit these ideas after describing the backpropagation algorithm itself. To clarify the ab abov ov ovee deﬁnition of the bac backpropagation kpropagation computation, let us consider the sp speciﬁc eciﬁc graph asso associated ciated with a fullyconnected multilay ultilayer er MLP MLP.. To clarify the ab ove deﬁnition of the backpropagation computation, let us consider Algorithm 6.3 ﬁrst shows the forw forward ard propagation, which maps parameters to the sp eciﬁc graph asso ciated with a fullyconnected multilayer MLP. ˆ , y) asso the sup supervised ervised loss L( y associated ciated with a single (input,target) training example 6.3 output ﬁrst shows theneural forward propagation, maps parameters (x, yAlgorithm ), with yˆ the of the netw is provided in input. to network ork when xwhich ˆ , y) asso ciated with a single (input,target) training example the sup ervised loss L( y Algorithm 6.4 then shows ws the corresp corresponding computation toinbeinput. done for (x, y ), with yˆ the output sho of the neural netwonding ork when x is provided applying the backpropagation algorithm to this graph. Algorithm 6.4 then shows the corresp onding computation to be done for Algorithm 6.3 and Algorithm 6.4 are demonstrations that are chosen to b e applying the backpropagation algorithm to this graph. simple and straightforw straightforward ard to understand. How Howev ev ever, er, they are sp specialized ecialized to one Algorithm 6.3 and Algorithm 6.4 are demonstrations that are chosen to b e sp speciﬁc eciﬁc problem. simple and straightforward to understand. However, they are sp ecialized to one Mo Modern dern softw software are implementations are based on the generalized form of backsp eciﬁc problem. propagation describ described ed in Sec. 6.5.6 b elo elow, w, whic which h can accommo accommodate date any computaMograph dern softw are implementations based on the generalized formsym of backtional by explicitly manipulatingare a data structure for represen representing ting symb b olic propagation describ ed in Sec. 6.5.6 b elo w, whic h can accommo date any computacomputation. tional graph by explicitly manipulating a data structure for representing symb olic computation. Algebraic expressions and computational graphs b oth op operate erate on symb symbols ols ols,, or variables that do not hav havee sp speciﬁc eciﬁc values. These algebraic and graphbased Algebraic expressions and symb computational graphs b oth op erate on symbols represen representations tations are called symbolic olic representations. When we actually use, or or vtrain ariables that do not hav e sp eciﬁc v alues. These algebraic and graphbased a neural netw network, ork, we must assign sp speciﬁc eciﬁc values to these sym symb b ols. We represen tations are called symb olic representations. When we actually use as or replace a sym symb b olic input to the netw network ork x with a sp speciﬁc eciﬁc numeric value, such train a neural netw ork, we must assign sp eciﬁc v alues to these sym b ols. W e [1.2, 3.765, −1.8]> . replace a symb olic input to the network x with a sp eciﬁc numeric value, such as [1.2Some , 3.765approaches , 1.8] . to backpropagation take a computational graph and a set of numerical values for the inputs to the graph, then return a set of numerical − Some approaches to backpropagation a computational graph and a set values describing the gradien gradient t at those inputtake values. We call this approach “symbolof numerical values for the This inputs graph, used then by return a setsuc of hnumerical ton tonumber” umber” diﬀerentiation. is to thethe approach libraries such as Torc orch h alues describing gradien at those values. We call this approach “symbol(vCollob Collobert ert et al., the 2011b ) andt Caﬀe (Jiainput , 2013 ). tonumber” diﬀerentiation. This is the approach used by libraries such as Torch Another approach is to take a computational graph and add additional no nodes des (Collob ert et al., 2011b) and Caﬀe (Jia, 2013). to the graph that pro provide vide a symbolic description of the desired deriv derivativ ativ atives. es. This Another approach is to take a computational graph and add additional no des to the graph that provide a symbolic description of the desired derivatives. This 210
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Figure 6.9: A computational graph that results in rep repeated eated sub subexpressions expressions when computing the gradient. Let w ∈ R b e the input to the graph. We use the same function f : R → R Figure A computational graph results eated sub x =expressions f( w), y =when f ( x),computing z = as the 6.9: operation that R we apply atthat every step in of rep a chain: R f(yR). w applyb eEq. the gradient. Let the6.44 input to obtain: the graph. We use the same function f : T o compute , we and f(y ). as the operation that ∈ we apply at every step of a chain: x = f( w), y = f ( x), z = → ∂ z and obtain: To compute , we apply Eq. 6.44 (6.50) ∂w ∂∂zz ∂ y ∂ x (6.50) (6.51) = ∂w ∂y ∂x ∂w ∂z ∂y ∂x =f 0(y )f 0(x)f 0 (w ) (6.52) (6.51) ∂y ∂x ∂w 0 0 0 (6.53) =f (f (f (w )))f (f (w ))f (w ) =f (y )f (x)f (w ) (6.52) ( w ) = f ( f ( f ( w ))) f ( f ( w )) f Eq. 6.52 suggests an implemen implementation tation in which we compute the value of f (w ) only(6.53) once and store it in the variable x . This is the approach taken by the backpropagation (w ) expression Eq. 6.52 suggests an implemen tationisinsuggested which webcompute the value the of f sub only once algorithm. An alternative approach y Eq. 6.53 , where subexpression x it in thethan variable . This is the approach taken backpropagation fand (w ) store app appears ears more once. In the alternativ alternative e approac approach, h, f (w)byis the recomputed eac each h time algorithm. alternative approach is suggested y Eq. 6.53of, where the sub expression it is needed.AnWhen the memory required to storebthe value these expressions is low, f ( w f ( w ) ) app ears more than once. In the alternativ e approac h, is recomputed eac h time the backpropagation approach of Eq. 6.52 is clearly preferable b ecause of its reduced it is needed. When the memory required to store the v alue of these expressions is low, run runtime. time. How However, ever, Eq. 6.53 is also a valid implementation of the chain rule, and is useful the backpropagation approach of Eq. 6.52 is clearly preferable b ecause of its reduced when memory is limited. runtime. However, Eq. 6.53 is also a valid implementation of the chain rule, and is useful when memory is limited.
211
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Forward propagation through a typical deep neural netw network ork and the computation of the cost function. The loss L(yˆ, y ) dep depends ends on the output yˆ F orward propagation through a typical deep neuralTnetw ork and and on the target y (see Sec. 6.2.1.1 for examples of loss functions). o obtain the L ( y ˆ , y yˆ the computation of the cost function. The loss ) dep ends on the output total cost J , the loss may b e added to a regularizer Ω( Ω(θθ ), where θ con contains tains all the yts(see and on the target 6.2.1.1 for examples of losshow functions). To obtain the parameters (weigh (weights andSec. biases). Algorithm 6.4 shows to compute gradien gradients ts J θ θ total cost , the loss may b e added to a regularizer Ω( ) , where con tains all the of J with resp respect ect to parameters W and b. For simplicity simplicity,, this demonstration uses parameters (weigh ts and biases). Algorithm 6.4 shows how to gradien ts only a single input example x. Practical applications should usecompute a minibatch. See J W b of with resp ect to parameters and . F or simplicity , this demonstration uses Sec. 6.5.7 for a more realistic demonstration. only a single input example x. Practical applications should use a minibatch. See Net Netw work depth, l Sec. 6.5.7 for (ai),more i ∈ {1realistic , . . . , l },demonstration. the weigh weightt matrices of the mo model del W Net ork depth, l (i)w model del b , i ∈ {1, . . . , l }, the bias parameters of the mo ,i 1, . . . , l , the weight matrices of the mo del W x, the input to pro process cess , i ∈ 1{, . . . , l ,}the bias parameters of the mo del b y , the target output { to pro } cess h (0) = xx, the∈input k = y1,, .the . . , ltarget output h a (k=) = x b(k) + W (k) h (k−1) . .(,kl)) h (kk)==1f, .(a a = b +W h h h(l= ) f (a ) yˆ = J = L(yˆ, y ) + λΩ(θ ) yˆ = h J = L(yˆ, y ) + λΩ(θ ) is the approach taken by Theano (Bergstra et al., 2010; Bastien et al., 2012) and TensorFlow (Abadi et al., 2015). An example of how this approach works is the approach by Theano (Bergstraadv etan al. , 2010 Bastien et al. is illustrated in taken Fig. 6.10 . The primary advan antage tage of ;this approac approach h ,is2012 that) and deriv TensorFlow (Abadi et ed al.,in2015 An example approach works the derivatives atives are describ described the).same languageofashow thethis original expression. is illustrated in Fig. 6.10 . The primary adv an tage of this approac h is Because the deriv derivativ ativ atives es are just another computational graph, it is p ossible tothat run the derivatives areagain, describ ed in the same language original expression. bac backpropagation kpropagation diﬀerentiating the deriv derivatives ativesasinthe order to obtain higher Because the Computation derivatives areofjust another computational itedisinp ossible to run deriv derivatives. atives. higherorder deriv derivativ ativ atives es is graph, describ described Sec. 6.5.10 . backpropagation again, diﬀerentiating the derivatives in order to obtain higher We will use the latter approach and describ describee the bac backpropagation kpropagation algorithm in derivatives. Computation of higherorder derivatives is describ ed in Sec. 6.5.10. terms of constructing a computational graph for the deriv derivativ ativ atives. es. Any subset of the Wemay will then use the approach describ e the bacvkpropagation in graph b e latter ev evaluated aluated usingand sp speciﬁc eciﬁc numerical alues at a lateralgorithm time. This terms a computational graph eac for hthe derivativshould es. Anybsubset of the allo allows ws ofusconstructing to avoid specifying exactly when each operation e computed. graph may then b e ev aluated using sp eciﬁc numerical v alues at a later time. This Instead, a generic graph ev evaluation aluation engine can ev evaluate aluate every no node de as so soon on as its allo ws us to a v oid specifying exactly when eac h operation should b e computed. paren parents’ ts’ values are av available. ailable. Instead, a generic graph evaluation engine can evaluate every no de as so on as its The description of the symboltosymbol based approach subsumes the symbolparents’ values are available. ton tonumber umber approac approach. h. The symboltonum symboltonumb b er approach can b e understo understood od as The description of the symboltosymbol based approach subsumes the symbolp erforming exactly the same computations as are done in the graph built by the ton umber approac h. The symboltonum b er approach can sym b e understo od as sym symb b oltosym oltosymb b ol approach. The key diﬀerence is that the symb b olton oltonumber umber p erforming exactly the same computations as are done in the graph built by the symb oltosymb ol approach. The key212 diﬀerence is that the symb oltonumber
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Backwar Backward d computation for the deep neural net network work of Algorithm 6.3, which uses in addition to the input x a target y. This computation Backwar d activ computation theeac deep neural networkfrom of Algok) for k, starting yields the gradients on the activations ations a(for each h la layer yer the x y rithm 6.3 , which uses in addition to the input a target . This computation output lay layer er and going bac backwards kwards to the ﬁrst hidden lay layer. er. From these gradients, a k, starting yields the gradients on the activ ations for eac h la yer from the whic which h can b e interpreted as an indication of how eac each h lay layer’s er’s output should change output layerror, er andone going kwards the ﬁrst layer. From these gradients, to reduce canbac obtain theto gradien gradient t onhidden the parameters of eac each h lay layer. er. The whic h can b e interpreted as an indication of how eac h lay er’s output should change gradien gradients ts on weigh weights ts and biases can b e immediately used as part of a sto stoc chasto reduce error, one can obtain the gradien t on the parameters of eac h lay er. tic gradient up update date (p (performing erforming the up update date right after the gradients ha have ve bThe een gradien ts on weigh ts and biases can b e immediately used as part of a sto c hascomputed) or used with other gradientbased optimization metho methods. ds. tic gradient up date (p erforming the up date right after the gradients have b een After the forward computation, compute the gradient on the output lay layer: er: computed) or used with other gradientbased optimization metho ds. g ← ∇yˆJ = ∇yˆ L(yˆ, y ) After thel , forward compute the gradient on the output layer: k= l − 1, . . .computation, ,1 (yˆ, y ) t on the lay g Con J =the L Convert vert gradien gradient layer’s er’s output in into to a gradient into the prek∇= l , l ∇ . . . ,ation 1 ← nonlinearit nonlinearity y1,activ activation (elemen (elementwise twise multiplication if f is elementwise): Con the (k) )the layer’s output into a gradient into the preg ←vert ∇a − J =gradien g f 0(taon nonlinearitgradients y activation (elemen multiplication f isregularization elementwise): Compute on weigh weights ts twise and biases (including ifthe term, ( a ) g J = g f where needed): Compute gradients ←∇ on weights and biases (including the regularization term, ∇ b J = g + λ∇ b Ω(θ ) where J needed): ∇ = g h(k−1)> + λ∇ W Ω(θ ) W Ω(θ )w.r.t. the next lo J = gthe + λgradients Propagate lowerlev werlev werlevel el hidden lay layer’s er’s activ activations: ations: + λ J = g h Ω( θ ) ( k ) > ∇ ∇ J =W g g ← ∇h Propagate the gradients w.r.t. the next lowerlevel hidden layer’s activations: ∇ ∇ J =W g g ←∇
213
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Figure 6.10: An example of the sym symb b oltosymbol approac approach h to computing deriv derivatives. atives. In this approach, the backpropagation algorithm do does es not need to ever access any actual Figure An example of the sym b oltosymbol h to computing derivatives.ho In sp speciﬁc eciﬁc6.10: numeric values. Instead, it adds no nodes des to aapproac computational graph describing how w thiscompute approach, thederiv backpropagation algorithm doaluation es not need to ever access any actual to these derivativ ativ atives. es. A generic graph ev evaluation engine can later compute the sp eciﬁc numeric values. Instead, it v adds no(L des to In a computational graph describing how deriv derivatives atives for any sp speciﬁc eciﬁc numeric alues. (Left) eft) this example, we b egin with a graph to compute es. A generic graph aluation engine can later compute the represen representing ting these ))). . (Right) We run theevbac backpropagation kpropagation algorithm, instructing z = f (deriv f( f (ativ w))) eft) Inonding deriv for any sp eciﬁcfor numeric values. (L this example, wethis b egin with a we graph it to atives construct the graph the expression corresp corresponding to . In example, do z = f ( f ( f ( w ))) (Right) represen ting . W e run the bac kpropagation algorithm, instructing not explain how the bac backpropagation kpropagation algorithm works. The purp purpose ose is only to illustrate it to construct theresult graphis: forathe expression corresp . In this example, of wethe do what the desired computational graphonding with atosymbolic description not explain deriv derivative. ative. how the backpropagation algorithm works. The purp ose is only to illustrate what the desired result is: a computational graph with a symbolic description of the derivative.
214
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
approac approach h do does es not exp expose ose the graph. approach do es not exp ose the graph. The backpropagation algorithm is very simple. To compute the gradient of some scalar z with resp respect ect to one of its ancestors x in the graph, we b egin by observing The backpropagation is zvery simple. o compute the gradient of some that the gradient withalgorithm resp respect ect to is given byTdz dz = 1. We can then compute z witht resp x zin in scalar to ect oneto of each its ancestors thethe graph, weby b egin by observing the gradien gradient withect resp respect parent of graph multiplying the that the gradient ect toofz the is given by that = 1pro . W e canz then curren current t gradien gradient t bywith the resp Jacobian op operation eration produced duced . We compute contin continue ue z the gradien t with resp ect to each parent of in the graph by multiplying the multiplying by Jacobians tra traveling veling bac backwards kwards through the graph in this wa way y until z curren t gradien t by the Jacobian of the op eration that pro duced . W e contin ue we reac reach h x. For any no node de that may b e reached by going bac backwards kwards from z through ultiplying y Jacobians traveling through the from graphdiﬀerent in this wa y until tmwo or morebpaths, we simply sum bac thekwards gradients arriving paths at x. For any no de that may b e reached by going backwards from z through w e reac hde. that no node. two or more paths, we simply sum the gradients arriving from diﬀerent paths at formally,, each no node de in the graph G corresp corresponds onds to a variable. To achiev achievee thatMore no de.formally maxim maximum um generality generality,, we describ describee this variable as b eing a tensor . T Tensor ensor can More formally , each no de in the graph corresp onds to a v ariable. T o achiev in general hav havee an any y num numb b er of dimensions, and subsume scalars, vectors, ande maximum generality, we describ e this variable as b eing a tensor . Tensor can G matrices. in general have any numb er of dimensions, and subsume scalars, vectors, and We assume that each variable is asso associated ciated with the following subroutines: matrices. that each is asso ciated with the following subroutines: ( ):variable •We assume _ This returns the op operation eration that computes , represen sented ted by the edges coming in into to in the computational graph. For example, ( ): This _ may b e a Python returns the op eration the thatmatrix computes , reprethere or C++ class representing multiplication sen ted by the in the computational or example, • op operation, eration, andedges the coming into function. Supp Suppose ose we graph. ha have ve a vFariable that there may b e a Python or C++ class representing the matrix multiplication ( ) is created by matrix multiplication, C = AB . Then _ op eration, and the function. Supp ose we ha ve a v ariable that returns a p oin ointer ter to an instance of the corresp corresponding onding C++ class. ( ) is created by matrix multiplication, C = AB . Then _ • returns _ a p ointer( to , Gan ): instance This returns thecorresp list of onding variables thatclass. are children of of the C++ in the computational graph G . _ ( , ): This returns the list of variables that are children of •• ( , G ) : GThisgraph in_the computational returns. the list of variables that are parents of in the computational graph G . G _ ( , ) : This returns the list of variables that are parents of inhthe computational graph . •Eac G is also Each op operation eration asso associated ciated with a op operation. eration. This G op operation eration can compute a Jacobianvector pro product duct as describ described ed by Eq. 6.47. This Eac h op eration is also asso ciated with a op eration. This . Each is how the backpropagation algorithm is able to achiev achievee great generality generality. op eration can compute a Jacobianvector pro duct as describ ed by Eq. 6.47 . This op operation eration is resp responsible onsible for kno knowing wing ho how w to backpropagate through the edges in is how the backpropagation algorithm is able to achiev e great generality . Each the graph that it participates in. For example, we might use a matrix multiplication op eration to is create resp onsible for kno wing w to ose backpropagate through the edges in C= ABho z with op operation eration a variable . Supp Suppose that the gradient of a scalar the graph participates in. Fmatrix or example, we might op use a matrix multiplication C isitgiven resp respect ect to that by G . The multiplication operation eration is resp responsible onsible for C = AB z ewith op eration to create a v ariable . Supp ose that the gradient of a scalar deﬁning tw twoo bac backpropagation kpropagation rules, one for each of its input arguments. If w call resp ect to C is given by G . The matrix multiplication op eration is resp onsible for deﬁning two backpropagation rules, one215 for each of its input arguments. If we call
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
the metho method d to request the gradient with resp respect ect to A giv given en that the gradient on the output is G , then the metho method d of the matrix multiplication op operation eration A the metho d to request the gradient with resp ect to giv en that the gradient > must state that the gradien gradientt with resp respect ect to A is given by GB . Likewise, if we G on the output is , then the metho d oft the multiplication op eration call the metho method d to request the gradien gradient withmatrix resp respect ect to B, then the matrix m ust state that the gradien t with resp ect to is given b y . Likewise, we A GB op operation eration is resp responsible onsible for implementing the metho method d and sp specifying ecifyingifthat B, then the call desired the methoisd given to request the t with resp ect toalgorithm matrix G .gradien the gradient by A> The backpropagation itself do does es op eration is resp onsible for implementing the metho d and sp ecifying that not need to know any diﬀerentiation rules. It only needs to call each op operation’s eration’s A G the desired gradient is given by . The backpropagation algorithm itself do es . ( , , ) must rules with the right argumen arguments. ts. Formally ormally,, not need to know any diﬀerentiation rules. It only needs to call each op eration’s return ( , , )(6.54) rules with the right X argumen must . ( Formally ) i ), i , . (∇ ts. return i .( ) ) , ( (6.54) whic which h is just an implemen implementation tation of the cchain hain rule as expressed in Eq. 6.47. Here, is a list of inputs∇that are supplied to the op operation, eration, is the whic h is just an implemen tation of the c hain rule as expressed in Eq. 6.47t. mathematical function that the op operation eration implemen implements, ts, is the input whose gradien gradient Here, is a list of inputs that are supplied to the op eration, is we wish to compute, and X is the gradient on the output of the op operation. eration. the mathematical function that the op eration implements, is the input whose gradient The metho method d should alwa always ys pretend that all of its inputs are distinct we wish to compute, and is the gradient on the output of the op eration. from each other, even if they are not. For example, if the op operator erator is passed The metho d should alwa ys pretend that all of its inputs are xdistinct 2 two copies of x to compute x , the metho method d should still return as the from each other, even if they are not. F or example, if the op erator is passed deriv derivative ative with resp respect ect to b oth inputs. The bac backpropagation kpropagation algorithm will later tadd wo copies of to compute , the metho d, should return the x x x astotal b oth of these arguments together to obtain 2x which still is the correct derivative deriv derivative ative with on x.resp ect to b oth inputs. The backpropagation algorithm will later add b oth of these arguments together to obtain 2x, which is the correct total Soft Software ware implemen implementations tations of bac backpropagation kpropagation usually pro provide vide b oth the op operaeraderivative on x. tions and their metho methods, ds, so that users of deep learning softw software are libraries are implementations of graphs backpropagation vide b othlik the op eraableSoft to ware backpropagate through built using usually commonpro op operations erations like e matrix tions and their exp metho ds, so that and userssoofon. deep learning software who libraries area m ultiplication, exponen onen onents, ts, logarithms, Soft Software ware engineers build able implementation to backpropagate through graphs built using common op erations e matrix new of backpropagation or adv advanced anced users who need tolikadd their m ultiplication, exp onen ts, logarithms, and so on. Soft ware engineers who build own op operation eration to an existing library must usually derive the metho method d fora new implementation of backpropagation or advanced users who need to add their an op any y new operations erations man manually ually ually.. own op eration to an existing library must usually derive the metho d for The backpropagation algorithm is formally describ described ed in Algorithm 6.5. any new op erations manually. In Sec. 6.5.2, we motiv motivated ated backpropagation as a strategy for av avoiding oiding6.5 computThe backpropagation algorithm is formally describ ed in Algorithm . ing the same sub subexpression expression in the chain rule multiple times. The naiv naivee algorithm In hav Sec.e 6.5.2 , we tial motiv ated backpropagation as a strategy for avoidingNow computcould have exp exponen onen onential run runtime time due to these rep repeated eated sub subexpressions. expressions. that ing the same sub expression in the chain rule m ultiple times. The naiv e algorithm we ha have ve sp speciﬁed eciﬁed the bac backpropagation kpropagation algorithm, we can understand its comcould hav e exp onen tial run time that due to these rep eatedev sub expressions. Now that putational cost. If we assume each op operation eration evaluation aluation has roughly the w e ha ve sp eciﬁed the bac kpropagation algorithm, w e can understand its comsame cost, then we may analyze the computational cost in terms of the number putational cost. If we assume each eration evaluation has roughly the of op operations erations executed. Keep inthat mind hereopthat we refer to an op operation eration as the same cost,talthen e may analyze the computational cost inactually terms of the nof umber fundamen fundamental unitwof our computational graph, which might consist very of op erations executed. Keep in mind here that we refer to an op eration as the man many y arithmetic op operations erations (for example, we migh mightt hav havee a graph that treats matrix fundamental unit of our computational graph, which might actually consist of very many arithmetic op erations (